Crash Consistency: Durable State Without Mysticism

Monthly research note. Theme: Correctness & Foundations.

TL;DR

Crash Consistency: Durable State Without Mysticism as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Separate durable state from derived state; derived must be recomputable or reconcilable.
Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
Ack semantics must be explicit: durable, best-effort, or ambiguous.
Write assumptions down; treat them as interfaces.
Treat retries, reordering, and partial failure as default conditions.

Why this matters

The cost of unclear invariants is paid in production, under load, during an incident.
Performance work that changes semantics is a correctness regression with a nicer latency chart.
In distributed code, retries and duplication are the common case—not the edge case.
A system without explicit contracts becomes a collection of folklore and dashboards.

Key questions

What does a client learn after a timeout: success, failure, or ambiguity?
Where does concurrency create “double spend” style failures in your domain?
Where do you need atomicity (and where is eventual consistency acceptable)?
What exactly is the state, and what is derived or cached?
Which transitions are allowed, and which are impossible by construction?
What is your ordering model: FIFO per key, per partition, or none at all?

Assumptions

Requests can be duplicated, reordered, delayed, and replayed across restarts.
Clients retry with backoff but not with perfect discipline (bursts happen).
Time is untrusted: clock skew, NTP steps, monotonic vs wall-clock confusion.
Crashes happen mid-write (torn state) unless you prove otherwise.

Non-goals

Relying on “best effort” client behavior for safety properties.
Baking invariants into tribal knowledge instead of code.

Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

For idempotent operations, the contract is set-like:

\mathrm{apply}(s, op, k) = \mathrm{apply}(s, op, k) \quad\text{and}\quad \mathrm{apply}(s, op, k_1) \neq \mathrm{apply}(s, op, k_2)\ \text{in general}.

Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.

Crash points matter: define what happens if the process stops after each line that mutates state or acknowledges work.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Replay resistance: duplicated inputs do not change outcomes.
Least authority: privileges are scoped by purpose and time.
Integrity: invalid transitions are rejected (and detectable).
Evidence: critical actions emit verifiable audit events.

Failure modes

Recovery paths that only work when nothing is broken.
Timeout ambiguity causing double-apply or partial state transitions.
Observability gaps during incidents (missing evidence).
Config drift that weakens security posture over time.

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  input["Input"] --> parse["Parse/Validate"]
  parse --> decide["Decide (pure)"]
  decide --> write["Durable write"]
  write --> ack["Acknowledge"]
  ack --> obs["Emit evidence (logs/metrics)"]

Implementation notes

Implementation is the act of making invalid state unrepresentable (or at least unignorable).

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

use core::fmt;

#[derive(Clone, Debug)]
pub enum Event {
    Input(Vec<u8>),
    Tick,
    Fault(&'static str),
}

pub trait StateMachine {
    type State: Clone + fmt::Debug;
    type Error: fmt::Debug;

    fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
    fn invariant(state: &Self::State) -> bool;
}

// Crash Consistency: Durable State Without Mysticism: invariants are part of the API contract.

Verification strategy

Property-based tests: generate adversarial sequences and assert invariants after every step.
Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
Differential tests against a reference model (even a slow one).
Deterministic schedulers (e.g., Loom-like) to force rare interleavings.

Operational notes

Design “degraded modes” explicitly (fail closed vs fail open per operation).
Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
Log as evidence: append-only where possible; isolate logs from compromised workloads.
Make rollbacks safe: schema and protocol compatibility is a security boundary.
Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Admission-control / rate-limit rejections (by reason).
Error budget burn + tail latency under load.
Rollback events and the conditions that triggered them.

Rollback plan

Prefer backward-compatible changes; avoid “flag day” upgrades.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Use canaries and staged rollout; stop early when signals degrade.
Keep dual-write / dual-verify windows where appropriate.
Define an explicit rollback trigger (metrics + thresholds).

Evidence

Time, Clocks, and the Ordering of Events (Lamport, 1978) (1) — The mental model for causality and ordering in distributed systems.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
RFC 9110: HTTP Semantics (2) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.

Open questions

Which correctness properties can be enforced at compile time (types/capabilities)?
Where does your API currently allow ambiguous outcomes, and how will clients cope?
Which invariant, if violated, would silently corrupt state for weeks?
What would you do if you had to replay a month of traffic into a rebuilt system?

Checklist

Assumptions listed and reviewed.
Failure modes enumerated with mitigations.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Safety properties stated as invariants.
Telemetry captures correctness signals.
Rollback plan rehearsed and automated.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading