Monthly research note. Theme: Correctness & Foundations.

TL;DR

Crash Consistency: Durable State Without Mysticism as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

  • Separate durable state from derived state; derived must be recomputable or reconcilable.
  • Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
  • Ack semantics must be explicit: durable, best-effort, or ambiguous.
  • Write assumptions down; treat them as interfaces.
  • Treat retries, reordering, and partial failure as default conditions.

Why this matters

  • The cost of unclear invariants is paid in production, under load, during an incident.
  • Performance work that changes semantics is a correctness regression with a nicer latency chart.
  • In distributed code, retries and duplication are the common case—not the edge case.
  • A system without explicit contracts becomes a collection of folklore and dashboards.

Key questions

  • What does a client learn after a timeout: success, failure, or ambiguity?
  • Where does concurrency create “double spend” style failures in your domain?
  • Where do you need atomicity (and where is eventual consistency acceptable)?
  • What exactly is the state, and what is derived or cached?
  • Which transitions are allowed, and which are impossible by construction?
  • What is your ordering model: FIFO per key, per partition, or none at all?

Assumptions

  • Requests can be duplicated, reordered, delayed, and replayed across restarts.
  • Clients retry with backoff but not with perfect discipline (bursts happen).
  • Time is untrusted: clock skew, NTP steps, monotonic vs wall-clock confusion.
  • Crashes happen mid-write (torn state) unless you prove otherwise.

Non-goals

  • Relying on “best effort” client behavior for safety properties.
  • Baking invariants into tribal knowledge instead of code.
Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

For idempotent operations, the contract is set-like:

apply(s,op,k)=apply(s,op,k)andapply(s,op,k1)apply(s,op,k2) in general.\mathrm{apply}(s, op, k) = \mathrm{apply}(s, op, k) \quad\text{and}\quad \mathrm{apply}(s, op, k_1) \neq \mathrm{apply}(s, op, k_2)\ \text{in general}.

Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.

Crash points matter: define what happens if the process stops after each line that mutates state or acknowledges work.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

  • Replay resistance: duplicated inputs do not change outcomes.
  • Least authority: privileges are scoped by purpose and time.
  • Integrity: invalid transitions are rejected (and detectable).
  • Evidence: critical actions emit verifiable audit events.

Failure modes

  • Recovery paths that only work when nothing is broken.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Observability gaps during incidents (missing evidence).
  • Config drift that weakens security posture over time.
Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  input["Input"] --> parse["Parse/Validate"]
  parse --> decide["Decide (pure)"]
  decide --> write["Durable write"]
  write --> ack["Acknowledge"]
  ack --> obs["Emit evidence (logs/metrics)"]

Implementation notes

Implementation is the act of making invalid state unrepresentable (or at least unignorable).

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

use core::fmt;

#[derive(Clone, Debug)]
pub enum Event {
    Input(Vec<u8>),
    Tick,
    Fault(&'static str),
}

pub trait StateMachine {
    type State: Clone + fmt::Debug;
    type Error: fmt::Debug;

    fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
    fn invariant(state: &Self::State) -> bool;
}

// Crash Consistency: Durable State Without Mysticism: invariants are part of the API contract.

Verification strategy

  • Property-based tests: generate adversarial sequences and assert invariants after every step.
  • Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
  • Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
  • Differential tests against a reference model (even a slow one).
  • Deterministic schedulers (e.g., Loom-like) to force rare interleavings.

Operational notes

  • Design “degraded modes” explicitly (fail closed vs fail open per operation).
  • Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
  • Log as evidence: append-only where possible; isolate logs from compromised workloads.
  • Make rollbacks safe: schema and protocol compatibility is a security boundary.
  • Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Retry/timeout rates by endpoint and client cohort.
  • Invariant violation rate (should be ~0).
  • Admission-control / rate-limit rejections (by reason).
  • Error budget burn + tail latency under load.
  • Rollback events and the conditions that triggered them.

Rollback plan

  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Keep dual-write / dual-verify windows where appropriate.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • Time, Clocks, and the Ordering of Events (Lamport, 1978) (1) — The mental model for causality and ordering in distributed systems.
    • Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
  • RFC 9110: HTTP Semantics (2) — Defines method semantics including idempotency and safety—useful for API contracts.
    • Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.

Open questions

  • Which correctness properties can be enforced at compile time (types/capabilities)?
  • Where does your API currently allow ambiguous outcomes, and how will clients cope?
  • Which invariant, if violated, would silently corrupt state for weeks?
  • What would you do if you had to replay a month of traffic into a rebuilt system?

Checklist

  • Assumptions listed and reviewed.
  • Failure modes enumerated with mitigations.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Safety properties stated as invariants.
  • Telemetry captures correctness signals.
  • Rollback plan rehearsed and automated.

Further reading

1.
Lamport L. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM [Internet]. 1978;21(7):558–65. Available from: https://lamport.azurewebsites.net/pubs/time-clocks.pdf
2.
Fielding RT, Nottingham M, Reschke J. HTTP Semantics [Internet]. RFC Editor; 2022. Report No.: 9110. Available from: https://www.rfc-editor.org/rfc/rfc9110