Monthly research note. Theme: Correctness & Foundations.

TL;DR

A focused memo on Protocol State Machines: Invariants, Events, and Recovery: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

  • Ack semantics must be explicit: durable, best-effort, or ambiguous.
  • Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
  • Crash points are part of the design; specify recovery after each state mutation.
  • Make failure modes explicit and observable.
  • Design rollbacks as part of the happy path.

Why this matters

  • In distributed code, retries and duplication are the common case—not the edge case.
  • If recovery is not specified, recovery becomes improvisation.
  • The cost of unclear invariants is paid in production, under load, during an incident.
  • Correctness is a property you enforce at boundaries: parsing, persistence, concurrency, RPC.

Key questions

  • Which invariants must hold across crashes, restarts, and partial deployments?
  • Which transitions are allowed, and which are impossible by construction?
  • What must be durable before you acknowledge?
  • What is your ordering model: FIFO per key, per partition, or none at all?
  • What exactly is the state, and what is derived or cached?
  • What does a client learn after a timeout: success, failure, or ambiguity?

Assumptions

  • Errors are lossy: transient vs permanent is often indistinguishable at the boundary.
  • Observability is incomplete: you will debug from partial evidence.
  • Clients retry with backoff but not with perfect discipline (bursts happen).
  • Partial failure is normal: one replica slow, one unavailable, one returning stale data.

Non-goals

  • Relying on “best effort” client behavior for safety properties.
  • Baking invariants into tribal knowledge instead of code.
Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

We want a transition function δ\delta and invariant Inv\mathrm{Inv} such that:

st+1=δ(st,et)Inv(st)Inv(st+1).s_{t+1} = \delta(s_t, e_t)\qquad\wedge\qquad \mathrm{Inv}(s_t)\Rightarrow \mathrm{Inv}(s_{t+1}).

Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.

Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

  • Integrity: invalid transitions are rejected (and detectable).
  • Least authority: privileges are scoped by purpose and time.
  • Replay resistance: duplicated inputs do not change outcomes.
  • Authenticity: actions are bound to identity and purpose.

Failure modes

  • Mixed-version behavior that violates assumptions silently.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Recovery paths that only work when nothing is broken.
  • Observability gaps during incidents (missing evidence).
Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  input["Input"] --> parse["Parse/Validate"]
  parse --> decide["Decide (pure)"]
  decide --> write["Durable write"]
  write --> ack["Acknowledge"]
  ack --> obs["Emit evidence (logs/metrics)"]

Implementation notes

Implementation is the act of making invalid state unrepresentable (or at least unignorable).

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

// Idempotency sketch: reserve -> execute -> commit result (or return cached).
type Key string

type Store interface {
  Get(key Key) (value []byte, ok bool, err error)
  PutIfAbsent(key Key, value []byte) (stored bool, err error)
}

// Protocol State Machines: Invariants, Events, and Recovery: "timeout" must not mean "try again and maybe double-apply".

Verification strategy

  • Metamorphic tests: same operation applied twice must not change the result.
  • Fault injection: latency, partial writes, dropped acks, and duplicated messages.
  • Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
  • Property-based tests: generate adversarial sequences and assert invariants after every step.
  • Deterministic schedulers (e.g., Loom-like) to force rare interleavings.

Operational notes

  • Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
  • Track invariant violations as pages, not dashboards.
  • Validate time assumptions: alert on clock steps, skew, and monotonicity issues.
  • Design “degraded modes” explicitly (fail closed vs fail open per operation).
  • Log as evidence: append-only where possible; isolate logs from compromised workloads.
Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

  • Error budget burn + tail latency under load.
  • Rollback events and the conditions that triggered them.
  • Authz failures and policy denials (unexpected spikes).
  • Invariant violation rate (should be ~0).
  • Admission-control / rate-limit rejections (by reason).

Rollback plan

  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Keep dual-write / dual-verify windows where appropriate.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • Time, Clocks, and the Ordering of Events (Lamport, 1978) (1) — The mental model for causality and ordering in distributed systems.
    • Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
  • Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
    • Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

  • What is the minimal durable record needed to recover safely?
  • What would you do if you had to replay a month of traffic into a rebuilt system?
  • Where does your API currently allow ambiguous outcomes, and how will clients cope?
  • Which correctness properties can be enforced at compile time (types/capabilities)?

Checklist

  • Failure modes enumerated with mitigations.
  • Rollback plan rehearsed and automated.
  • Safety properties stated as invariants.
  • Telemetry captures correctness signals.
  • Assumptions listed and reviewed.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

Further reading

1.
Lamport L. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM [Internet]. 1978;21(7):558–65. Available from: https://lamport.azurewebsites.net/pubs/time-clocks.pdf
2.
Beyer B, Jones C, Petoff J, Murphy NR. Site Reliability Engineering: How Google Runs Production Systems [Internet]. O’Reilly Media; 2016. Available from: https://sre.google/sre-book/table-of-contents/