Protocol State Machines: Invariants, Events, and Recovery

Monthly research note. Theme: Correctness & Foundations.

TL;DR

A focused memo on Protocol State Machines: Invariants, Events, and Recovery: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Ack semantics must be explicit: durable, best-effort, or ambiguous.
Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
Crash points are part of the design; specify recovery after each state mutation.
Make failure modes explicit and observable.
Design rollbacks as part of the happy path.

Why this matters

In distributed code, retries and duplication are the common case—not the edge case.
If recovery is not specified, recovery becomes improvisation.
The cost of unclear invariants is paid in production, under load, during an incident.
Correctness is a property you enforce at boundaries: parsing, persistence, concurrency, RPC.

Key questions

Which invariants must hold across crashes, restarts, and partial deployments?
Which transitions are allowed, and which are impossible by construction?
What must be durable before you acknowledge?
What is your ordering model: FIFO per key, per partition, or none at all?
What exactly is the state, and what is derived or cached?
What does a client learn after a timeout: success, failure, or ambiguity?

Assumptions

Errors are lossy: transient vs permanent is often indistinguishable at the boundary.
Observability is incomplete: you will debug from partial evidence.
Clients retry with backoff but not with perfect discipline (bursts happen).
Partial failure is normal: one replica slow, one unavailable, one returning stale data.

Non-goals

Relying on “best effort” client behavior for safety properties.
Baking invariants into tribal knowledge instead of code.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

We want a transition function $\delta$ and invariant $\mathrm{Inv}$ such that:

s_{t+1} = \delta(s_t, e_t)\qquad\wedge\qquad \mathrm{Inv}(s_t)\Rightarrow \mathrm{Inv}(s_{t+1}).

Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.

Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Integrity: invalid transitions are rejected (and detectable).
Least authority: privileges are scoped by purpose and time.
Replay resistance: duplicated inputs do not change outcomes.
Authenticity: actions are bound to identity and purpose.

Failure modes

Mixed-version behavior that violates assumptions silently.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Recovery paths that only work when nothing is broken.
Observability gaps during incidents (missing evidence).

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  input["Input"] --> parse["Parse/Validate"]
  parse --> decide["Decide (pure)"]
  decide --> write["Durable write"]
  write --> ack["Acknowledge"]
  ack --> obs["Emit evidence (logs/metrics)"]

Implementation notes

Implementation is the act of making invalid state unrepresentable (or at least unignorable).

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

// Idempotency sketch: reserve -> execute -> commit result (or return cached).
type Key string

type Store interface {
  Get(key Key) (value []byte, ok bool, err error)
  PutIfAbsent(key Key, value []byte) (stored bool, err error)
}

// Protocol State Machines: Invariants, Events, and Recovery: "timeout" must not mean "try again and maybe double-apply".

Verification strategy

Metamorphic tests: same operation applied twice must not change the result.
Fault injection: latency, partial writes, dropped acks, and duplicated messages.
Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
Property-based tests: generate adversarial sequences and assert invariants after every step.
Deterministic schedulers (e.g., Loom-like) to force rare interleavings.

Operational notes

Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
Track invariant violations as pages, not dashboards.
Validate time assumptions: alert on clock steps, skew, and monotonicity issues.
Design “degraded modes” explicitly (fail closed vs fail open per operation).
Log as evidence: append-only where possible; isolate logs from compromised workloads.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Error budget burn + tail latency under load.
Rollback events and the conditions that triggered them.
Authz failures and policy denials (unexpected spikes).
Invariant violation rate (should be ~0).
Admission-control / rate-limit rejections (by reason).

Rollback plan

Prefer backward-compatible changes; avoid “flag day” upgrades.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Use canaries and staged rollout; stop early when signals degrade.
Keep dual-write / dual-verify windows where appropriate.
Define an explicit rollback trigger (metrics + thresholds).

Evidence

Time, Clocks, and the Ordering of Events (Lamport, 1978) (1) — The mental model for causality and ordering in distributed systems.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

What is the minimal durable record needed to recover safely?
What would you do if you had to replay a month of traffic into a rebuilt system?
Where does your API currently allow ambiguous outcomes, and how will clients cope?
Which correctness properties can be enforced at compile time (types/capabilities)?

Checklist

Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Safety properties stated as invariants.
Telemetry captures correctness signals.
Assumptions listed and reviewed.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading