Monthly research note. Theme: Distributed Systems Under Failure.
TL;DR
A focused memo on BFT from First Principles: Safety, Liveness, and Quorums: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Expose protocol state (epoch/term/commit index) as first-class telemetry.
- Mixed-version operation is the default; upgrades must preserve invariants.
- Treat membership changes and compaction as protocol events—not operational details.
- Make boundaries boring: validate inputs, cap costs, and be deterministic where needed.
- Treat retries, reordering, and partial failure as default conditions.
Why this matters
- State compaction and snapshots are where correctness goes to die quietly.
- Most protocol bugs hide in timeouts, retries, and membership changes.
- If your protocol isn’t testable under reordering, it isn’t deployable.
- Observability must explain protocol state, not just latency.
Key questions
- Which components require determinism for reproducibility?
- What is the unit of ordering (per key, per partition, global)?
- What is your reconfiguration model (joint consensus, epochs, leases)?
- What is the compaction story (snapshots, log truncation, state transfer)?
- Which safety property is non-negotiable (no double-commit, no forks, no split brain)?
- What is the failure model (crash, byzantine, partitions, reordering)?
Assumptions
- Workload is skewed: hot keys exist and dominate.
- Reconfigurations happen mid-incident (the worst time).
- Nodes restart with partial state unless you prove durability.
- Partitions happen at multiple layers (network, DNS, LB, service mesh).
Non-goals
- Relying on global time for ordering without strong synchronization assumptions.
- Pretending backpressure is an implementation detail.
Parsing is an attacker-controlled interface—validate early and fail fast.
Model & invariants
For quorum-based protocols, the intersection property is the backbone of safety:
Write down the safety property first. If it’s not written, it’s not implemented.
Make overload explicit: admission control is a protocol boundary.
Make the “impossible state” observable: a metric or alert that fires when invariants drift.
Security properties
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Least authority: privileges are scoped by purpose and time.
- Replay resistance: duplicated inputs do not change outcomes.
- Evidence: critical actions emit verifiable audit events.
Failure modes
- Recovery paths that only work when nothing is broken.
- Config drift that weakens security posture over time.
- Observability gaps during incidents (missing evidence).
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Caches tend to become sources of truth unless you can recompute and validate them.
Design sketch
flowchart TD
client["Client"] --> leader["Leader"]
leader --> log["Replicated Log"]
log --> snap["Snapshot"]
snap --> recover["Recovery / Catch-up"]
leader --> reconfig["Reconfiguration"]Implementation notes
Make the state machine explicit; then make persistence and networking boring.
Acknowledge only after durability (or make “ack” explicitly best-effort).
Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_controlVerification strategy
- Stress + skew tests: hot keys, slow disks, noisy neighbors.
- Model checking the smallest core (timeouts, election, reconfiguration).
- Upgrade tests: mixed versions and rolling deploy invariants.
- Linearizability checks for read/write APIs that claim it.
- Jepsen-style fault injection: partitions + reordering + client retries.
Operational notes
- Rehearse region failover and reconfiguration under load.
- Expose protocol state: term/epoch, leader, commit index, config version.
- Make client behavior part of the system: document retry semantics.
- Rate-limit retries and apply admission control before saturation.
- Treat compaction and snapshot install as first-class SLOs.
Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.
What to monitor
- Error budget burn + tail latency under load.
- Invariant violation rate (should be ~0).
- Admission-control / rate-limit rejections (by reason).
- Retry/timeout rates by endpoint and client cohort.
- Authz failures and policy denials (unexpected spikes).
Rollback plan
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Define an explicit rollback trigger (metrics + thresholds).
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Keep dual-write / dual-verify windows where appropriate.
- Use canaries and staged rollout; stop early when signals degrade.
Evidence
- Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
- Jepsen (2) — Testing correctness under partitions and faults.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Open questions
- Which invariants are violated first under overload: latency, availability, or correctness?
- What is the worst-case recovery time after a leader + disk failure?
- How do you prevent “operator fixes” from changing safety properties?
- Where does your protocol assume synchrony without admitting it?
Checklist
- Safety properties stated as invariants.
- Assumptions listed and reviewed.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Telemetry captures correctness signals.
- Failure modes enumerated with mitigations.
- Rollback plan rehearsed and automated.
Further reading
- Time, Clocks, and the Ordering of Events (Lamport) — Causality, ordering, and why clocks are tricky.
- Jepsen — Testing correctness under partitions and faults.
- In Search of an Understandable Consensus Algorithm (Raft) — Consensus with explicit state machines and practical tradeoffs.
- Paxos Made Simple (Lamport) — Agreement basics and the invariants that matter.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.