Monthly research note. Theme: Correctness & Foundations.
TL;DR
A focused memo on Protocol State Machines: Invariants, Events, and Recovery: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.
Key takeaways
- Ack semantics must be explicit: durable, best-effort, or ambiguous.
- Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
- Crash points are part of the design; specify recovery after each state mutation.
- Make failure modes explicit and observable.
- Design rollbacks as part of the happy path.
Why this matters
- In distributed code, retries and duplication are the common case—not the edge case.
- If recovery is not specified, recovery becomes improvisation.
- The cost of unclear invariants is paid in production, under load, during an incident.
- Correctness is a property you enforce at boundaries: parsing, persistence, concurrency, RPC.
Key questions
- Which invariants must hold across crashes, restarts, and partial deployments?
- Which transitions are allowed, and which are impossible by construction?
- What must be durable before you acknowledge?
- What is your ordering model: FIFO per key, per partition, or none at all?
- What exactly is the state, and what is derived or cached?
- What does a client learn after a timeout: success, failure, or ambiguity?
Assumptions
- Errors are lossy: transient vs permanent is often indistinguishable at the boundary.
- Observability is incomplete: you will debug from partial evidence.
- Clients retry with backoff but not with perfect discipline (bursts happen).
- Partial failure is normal: one replica slow, one unavailable, one returning stale data.
Non-goals
- Relying on “best effort” client behavior for safety properties.
- Baking invariants into tribal knowledge instead of code.
Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.
Model & invariants
We want a transition function and invariant such that:
Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.
Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.
Monotonicity beats timestamps: counters and epochs survive clock skew.
Security properties
- Integrity: invalid transitions are rejected (and detectable).
- Least authority: privileges are scoped by purpose and time.
- Replay resistance: duplicated inputs do not change outcomes.
- Authenticity: actions are bound to identity and purpose.
Failure modes
- Mixed-version behavior that violates assumptions silently.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Recovery paths that only work when nothing is broken.
- Observability gaps during incidents (missing evidence).
Caches tend to become sources of truth unless you can recompute and validate them.
Design sketch
flowchart TD
input["Input"] --> parse["Parse/Validate"]
parse --> decide["Decide (pure)"]
decide --> write["Durable write"]
write --> ack["Acknowledge"]
ack --> obs["Emit evidence (logs/metrics)"]Implementation notes
Implementation is the act of making invalid state unrepresentable (or at least unignorable).
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
// Idempotency sketch: reserve -> execute -> commit result (or return cached).
type Key string
type Store interface {
Get(key Key) (value []byte, ok bool, err error)
PutIfAbsent(key Key, value []byte) (stored bool, err error)
}
// Protocol State Machines: Invariants, Events, and Recovery: "timeout" must not mean "try again and maybe double-apply".Verification strategy
- Metamorphic tests: same operation applied twice must not change the result.
- Fault injection: latency, partial writes, dropped acks, and duplicated messages.
- Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
- Property-based tests: generate adversarial sequences and assert invariants after every step.
- Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
Operational notes
- Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
- Track invariant violations as pages, not dashboards.
- Validate time assumptions: alert on clock steps, skew, and monotonicity issues.
- Design “degraded modes” explicitly (fail closed vs fail open per operation).
- Log as evidence: append-only where possible; isolate logs from compromised workloads.
Attach explicit rollout/rollback triggers to changes that touch security or correctness.
What to monitor
- Error budget burn + tail latency under load.
- Rollback events and the conditions that triggered them.
- Authz failures and policy denials (unexpected spikes).
- Invariant violation rate (should be ~0).
- Admission-control / rate-limit rejections (by reason).
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Use canaries and staged rollout; stop early when signals degrade.
- Keep dual-write / dual-verify windows where appropriate.
- Define an explicit rollback trigger (metrics + thresholds).
Evidence
- Time, Clocks, and the Ordering of Events (Lamport, 1978) (1) — The mental model for causality and ordering in distributed systems.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
- Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Open questions
- What is the minimal durable record needed to recover safely?
- What would you do if you had to replay a month of traffic into a rebuilt system?
- Where does your API currently allow ambiguous outcomes, and how will clients cope?
- Which correctness properties can be enforced at compile time (types/capabilities)?
Checklist
- Failure modes enumerated with mitigations.
- Rollback plan rehearsed and automated.
- Safety properties stated as invariants.
- Telemetry captures correctness signals.
- Assumptions listed and reviewed.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Further reading
- Jepsen — Failure testing focused on correctness under partitions and reordering.
- Learn TLA+ — A pragmatic workflow for invariants and model checking.
- Time, Clocks, and the Ordering of Events (Lamport, 1978) — The mental model for causality and ordering in distributed systems.
- RFC 9110: HTTP Semantics — Defines method semantics including idempotency and safety—useful for API contracts.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.