A Minimal TLA+ Workflow for Distributed Protocols

Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

A focused memo on A Minimal TLA+ Workflow for Distributed Protocols: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Expose protocol state (epoch/term/commit index) as first-class telemetry.
Write the safety property first; liveness is always conditional on timing assumptions.
Mixed-version operation is the default; upgrades must preserve invariants.
Treat retries, reordering, and partial failure as default conditions.
Prefer protocols and APIs that make invalid states hard to express.

Why this matters

Mixed-version operation is the default state of real deployments.
Most protocol bugs hide in timeouts, retries, and membership changes.
State compaction and snapshots are where correctness goes to die quietly.
If your protocol isn’t testable under reordering, it isn’t deployable.

Key questions

Which components require determinism for reproducibility?
What is the compaction story (snapshots, log truncation, state transfer)?
What is the failure model (crash, byzantine, partitions, reordering)?
Where do you pay for liveness (timeouts, leader election, reconfiguration)?
How do clients discover leaders safely (and what happens during flaps)?
How do you prevent overload from becoming inconsistency?

Assumptions

Partitions happen at multiple layers (network, DNS, LB, service mesh).
Delays are unbounded during incidents; timeouts are guesses.
Clocks drift; leases can be violated under GC pauses or VM stalls.
Nodes restart with partial state unless you prove durability.

Non-goals

Treating membership as static or human-managed only.
Relying on global time for ordering without strong synchronization assumptions.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

For quorum-based protocols, the intersection property is the backbone of safety:

\text{Crash-fault: } |Q| > \frac{n}{2}\qquad\qquad \text{Byzantine: } n \ge 3f+1,\ |Q| \ge 2f+1.

Treat membership changes as protocol events, not control-plane side effects.

Write down the safety property first. If it’s not written, it’s not implemented.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Authenticity: actions are bound to identity and purpose.
Replay resistance: duplicated inputs do not change outcomes.
Least authority: privileges are scoped by purpose and time.
Evidence: critical actions emit verifiable audit events.

Failure modes

Observability gaps during incidents (missing evidence).
Mixed-version behavior that violates assumptions silently.
Config drift that weakens security posture over time.
Recovery paths that only work when nothing is broken.

Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

sequenceDiagram
  participant C as Client
  participant L as Leader
  participant F1 as Follower 1
  participant F2 as Follower 2
  C->>L: propose(cmd)
  L->>F1: appendEntries
  L->>F2: appendEntries
  F1-->>L: ack
  F2-->>L: ack
  L-->>C: commit(result)

Implementation notes

Your protocol is an interface between failures and invariants. Encode both.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_control

Verification strategy

Model checking the smallest core (timeouts, election, reconfiguration).
Stress + skew tests: hot keys, slow disks, noisy neighbors.
Upgrade tests: mixed versions and rolling deploy invariants.
Jepsen-style fault injection: partitions + reordering + client retries.
Linearizability checks for read/write APIs that claim it.

Operational notes

Treat compaction and snapshot install as first-class SLOs.
Rate-limit retries and apply admission control before saturation.
Rehearse region failover and reconfiguration under load.
Prefer monotonic time sources for leases; alert on clock discontinuities.
Expose protocol state: term/epoch, leader, commit index, config version.

Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

Authz failures and policy denials (unexpected spikes).
Invariant violation rate (should be ~0).
Admission-control / rate-limit rejections (by reason).
Rollback events and the conditions that triggered them.
Error budget burn + tail latency under load.

Rollback plan

Keep dual-write / dual-verify windows where appropriate.
Use canaries and staged rollout; stop early when signals degrade.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Define an explicit rollback trigger (metrics + thresholds).

Evidence

Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Time, Clocks, and the Ordering of Events (Lamport) (2) — Causality, ordering, and why clocks are tricky.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.

Open questions

How do you prevent “operator fixes” from changing safety properties?
What is the worst-case recovery time after a leader + disk failure?
Which invariants are violated first under overload: latency, availability, or correctness?
Where does your protocol assume synchrony without admitting it?

Checklist

Telemetry captures correctness signals.
Assumptions listed and reviewed.
Safety properties stated as invariants.
Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading