Monthly research note. Theme: Correctness & Foundations.
TL;DR
Crash Consistency: Durable State Without Mysticism as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.
Key takeaways
- Separate durable state from derived state; derived must be recomputable or reconcilable.
- Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
- Ack semantics must be explicit: durable, best-effort, or ambiguous.
- Write assumptions down; treat them as interfaces.
- Treat retries, reordering, and partial failure as default conditions.
Why this matters
- The cost of unclear invariants is paid in production, under load, during an incident.
- Performance work that changes semantics is a correctness regression with a nicer latency chart.
- In distributed code, retries and duplication are the common case—not the edge case.
- A system without explicit contracts becomes a collection of folklore and dashboards.
Key questions
- What does a client learn after a timeout: success, failure, or ambiguity?
- Where does concurrency create “double spend” style failures in your domain?
- Where do you need atomicity (and where is eventual consistency acceptable)?
- What exactly is the state, and what is derived or cached?
- Which transitions are allowed, and which are impossible by construction?
- What is your ordering model: FIFO per key, per partition, or none at all?
Assumptions
- Requests can be duplicated, reordered, delayed, and replayed across restarts.
- Clients retry with backoff but not with perfect discipline (bursts happen).
- Time is untrusted: clock skew, NTP steps, monotonic vs wall-clock confusion.
- Crashes happen mid-write (torn state) unless you prove otherwise.
Non-goals
- Relying on “best effort” client behavior for safety properties.
- Baking invariants into tribal knowledge instead of code.
Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.
Model & invariants
For idempotent operations, the contract is set-like:
Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.
Crash points matter: define what happens if the process stops after each line that mutates state or acknowledges work.
Monotonicity beats timestamps: counters and epochs survive clock skew.
Security properties
- Replay resistance: duplicated inputs do not change outcomes.
- Least authority: privileges are scoped by purpose and time.
- Integrity: invalid transitions are rejected (and detectable).
- Evidence: critical actions emit verifiable audit events.
Failure modes
- Recovery paths that only work when nothing is broken.
- Timeout ambiguity causing double-apply or partial state transitions.
- Observability gaps during incidents (missing evidence).
- Config drift that weakens security posture over time.
Sampling hides the rare schedule that breaks your invariants.
Design sketch
flowchart TD
input["Input"] --> parse["Parse/Validate"]
parse --> decide["Decide (pure)"]
decide --> write["Durable write"]
write --> ack["Acknowledge"]
ack --> obs["Emit evidence (logs/metrics)"]Implementation notes
Implementation is the act of making invalid state unrepresentable (or at least unignorable).
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
use core::fmt;
#[derive(Clone, Debug)]
pub enum Event {
Input(Vec<u8>),
Tick,
Fault(&'static str),
}
pub trait StateMachine {
type State: Clone + fmt::Debug;
type Error: fmt::Debug;
fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
fn invariant(state: &Self::State) -> bool;
}
// Crash Consistency: Durable State Without Mysticism: invariants are part of the API contract.Verification strategy
- Property-based tests: generate adversarial sequences and assert invariants after every step.
- Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
- Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
- Differential tests against a reference model (even a slow one).
- Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
Operational notes
- Design “degraded modes” explicitly (fail closed vs fail open per operation).
- Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
- Log as evidence: append-only where possible; isolate logs from compromised workloads.
- Make rollbacks safe: schema and protocol compatibility is a security boundary.
- Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.
What to monitor
- Retry/timeout rates by endpoint and client cohort.
- Invariant violation rate (should be ~0).
- Admission-control / rate-limit rejections (by reason).
- Error budget burn + tail latency under load.
- Rollback events and the conditions that triggered them.
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Use canaries and staged rollout; stop early when signals degrade.
- Keep dual-write / dual-verify windows where appropriate.
- Define an explicit rollback trigger (metrics + thresholds).
Evidence
- Time, Clocks, and the Ordering of Events (Lamport, 1978) (1) — The mental model for causality and ordering in distributed systems.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
- RFC 9110: HTTP Semantics (2) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.
Open questions
- Which correctness properties can be enforced at compile time (types/capabilities)?
- Where does your API currently allow ambiguous outcomes, and how will clients cope?
- Which invariant, if violated, would silently corrupt state for weeks?
- What would you do if you had to replay a month of traffic into a rebuilt system?
Checklist
- Assumptions listed and reviewed.
- Failure modes enumerated with mitigations.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Safety properties stated as invariants.
- Telemetry captures correctness signals.
- Rollback plan rehearsed and automated.
Further reading
- Time, Clocks, and the Ordering of Events (Lamport, 1978) — The mental model for causality and ordering in distributed systems.
- Jepsen — Failure testing focused on correctness under partitions and reordering.
- RFC 9110: HTTP Semantics — Defines method semantics including idempotency and safety—useful for API contracts.
- Learn TLA+ — A pragmatic workflow for invariants and model checking.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.