Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

A focused memo on BFT from First Principles: Safety, Liveness, and Quorums: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

  • Expose protocol state (epoch/term/commit index) as first-class telemetry.
  • Mixed-version operation is the default; upgrades must preserve invariants.
  • Treat membership changes and compaction as protocol events—not operational details.
  • Make boundaries boring: validate inputs, cap costs, and be deterministic where needed.
  • Treat retries, reordering, and partial failure as default conditions.

Why this matters

  • State compaction and snapshots are where correctness goes to die quietly.
  • Most protocol bugs hide in timeouts, retries, and membership changes.
  • If your protocol isn’t testable under reordering, it isn’t deployable.
  • Observability must explain protocol state, not just latency.

Key questions

  • Which components require determinism for reproducibility?
  • What is the unit of ordering (per key, per partition, global)?
  • What is your reconfiguration model (joint consensus, epochs, leases)?
  • What is the compaction story (snapshots, log truncation, state transfer)?
  • Which safety property is non-negotiable (no double-commit, no forks, no split brain)?
  • What is the failure model (crash, byzantine, partitions, reordering)?

Assumptions

  • Workload is skewed: hot keys exist and dominate.
  • Reconfigurations happen mid-incident (the worst time).
  • Nodes restart with partial state unless you prove durability.
  • Partitions happen at multiple layers (network, DNS, LB, service mesh).

Non-goals

  • Relying on global time for ordering without strong synchronization assumptions.
  • Pretending backpressure is an implementation detail.
Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

For quorum-based protocols, the intersection property is the backbone of safety:

Crash-fault: Q>n2Byzantine: n3f+1, Q2f+1.\text{Crash-fault: } |Q| > \frac{n}{2}\qquad\qquad \text{Byzantine: } n \ge 3f+1,\ |Q| \ge 2f+1.

Write down the safety property first. If it’s not written, it’s not implemented.

Make overload explicit: admission control is a protocol boundary.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Least authority: privileges are scoped by purpose and time.
  • Replay resistance: duplicated inputs do not change outcomes.
  • Evidence: critical actions emit verifiable audit events.

Failure modes

  • Recovery paths that only work when nothing is broken.
  • Config drift that weakens security posture over time.
  • Observability gaps during incidents (missing evidence).
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  client["Client"] --> leader["Leader"]
  leader --> log["Replicated Log"]
  log --> snap["Snapshot"]
  snap --> recover["Recovery / Catch-up"]
  leader --> reconfig["Reconfiguration"]

Implementation notes

Make the state machine explicit; then make persistence and networking boring.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_control

Verification strategy

  • Stress + skew tests: hot keys, slow disks, noisy neighbors.
  • Model checking the smallest core (timeouts, election, reconfiguration).
  • Upgrade tests: mixed versions and rolling deploy invariants.
  • Linearizability checks for read/write APIs that claim it.
  • Jepsen-style fault injection: partitions + reordering + client retries.

Operational notes

  • Rehearse region failover and reconfiguration under load.
  • Expose protocol state: term/epoch, leader, commit index, config version.
  • Make client behavior part of the system: document retry semantics.
  • Rate-limit retries and apply admission control before saturation.
  • Treat compaction and snapshot install as first-class SLOs.
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Error budget burn + tail latency under load.
  • Invariant violation rate (should be ~0).
  • Admission-control / rate-limit rejections (by reason).
  • Retry/timeout rates by endpoint and client cohort.
  • Authz failures and policy denials (unexpected spikes).

Rollback plan

  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Keep dual-write / dual-verify windows where appropriate.
  • Use canaries and staged rollout; stop early when signals degrade.

Evidence

  • Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
  • Jepsen (2) — Testing correctness under partitions and faults.
    • Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

  • Which invariants are violated first under overload: latency, availability, or correctness?
  • What is the worst-case recovery time after a leader + disk failure?
  • How do you prevent “operator fixes” from changing safety properties?
  • Where does your protocol assume synchrony without admitting it?

Checklist

  • Safety properties stated as invariants.
  • Assumptions listed and reviewed.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Telemetry captures correctness signals.
  • Failure modes enumerated with mitigations.
  • Rollback plan rehearsed and automated.

Further reading

1.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/
2.
Jepsen. Jepsen: Distributed Systems Safety Analysis [Internet]. Web; Available from: https://jepsen.io/