Consensus Under Partial Synchrony: From Paxos to Raft

Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

Consensus Under Partial Synchrony: From Paxos to Raft as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Treat membership changes and compaction as protocol events—not operational details.
Mixed-version operation is the default; upgrades must preserve invariants.
Expose protocol state (epoch/term/commit index) as first-class telemetry.
Design rollbacks as part of the happy path.
Make failure modes explicit and observable.

Why this matters

Most protocol bugs hide in timeouts, retries, and membership changes.
If your protocol isn’t testable under reordering, it isn’t deployable.
Global systems fail in correlated ways (regions, dependencies, routing).
Safety failures are permanent; liveness failures are (sometimes) recoverable.

Key questions

Which components require determinism for reproducibility?
What is the unit of ordering (per key, per partition, global)?
Which safety property is non-negotiable (no double-commit, no forks, no split brain)?
How do clients discover leaders safely (and what happens during flaps)?
How do you prevent overload from becoming inconsistency?
What is the compaction story (snapshots, log truncation, state transfer)?

Assumptions

Clients retry and amplify load right when the system is weakest.
Reconfigurations happen mid-incident (the worst time).
Workload is skewed: hot keys exist and dominate.
Packets can be duplicated and reordered; acks can be lost.

Non-goals

Treating membership as static or human-managed only.
Assuming the network eventually behaves “nicely” under load.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

For quorum-based protocols, the intersection property is the backbone of safety:

\text{Crash-fault: } |Q| > \frac{n}{2}\qquad\qquad \text{Byzantine: } n \ge 3f+1,\ |Q| \ge 2f+1.

Liveness is always conditional: specify when progress is expected and what you do otherwise.

Write down the safety property first. If it’s not written, it’s not implemented.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Downgrade resistance: negotiation can’t silently weaken security posture.
Least authority: privileges are scoped by purpose and time.
Replay resistance: duplicated inputs do not change outcomes.
Evidence: critical actions emit verifiable audit events.

Failure modes

Timeout ambiguity causing double-apply or partial state transitions.
Config drift that weakens security posture over time.
Recovery paths that only work when nothing is broken.
Observability gaps during incidents (missing evidence).

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

sequenceDiagram
  participant C as Client
  participant L as Leader
  participant F1 as Follower 1
  participant F2 as Follower 2
  C->>L: propose(cmd)
  L->>F1: appendEntries
  L->>F2: appendEntries
  F1-->>L: ack
  F2-->>L: ack
  L-->>C: commit(result)

Implementation notes

Your protocol is an interface between failures and invariants. Encode both.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_control

Verification strategy

Model checking the smallest core (timeouts, election, reconfiguration).
Stress + skew tests: hot keys, slow disks, noisy neighbors.
Linearizability checks for read/write APIs that claim it.
Upgrade tests: mixed versions and rolling deploy invariants.
Jepsen-style fault injection: partitions + reordering + client retries.

Operational notes

Expose protocol state: term/epoch, leader, commit index, config version.
Make client behavior part of the system: document retry semantics.
Treat compaction and snapshot install as first-class SLOs.
Rate-limit retries and apply admission control before saturation.
Prefer monotonic time sources for leases; alert on clock discontinuities.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Authz failures and policy denials (unexpected spikes).
Rollback events and the conditions that triggered them.
Invariant violation rate (should be ~0).
Admission-control / rate-limit rejections (by reason).
Error budget burn + tail latency under load.

Rollback plan

Keep dual-write / dual-verify windows where appropriate.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Define an explicit rollback trigger (metrics + thresholds).
Use canaries and staged rollout; stop early when signals degrade.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

Jepsen (1) — Testing correctness under partitions and faults.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Learn TLA+ (2) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

Where does your protocol assume synchrony without admitting it?
How do you prevent “operator fixes” from changing safety properties?
Which invariants are violated first under overload: latency, availability, or correctness?
What is the worst-case recovery time after a leader + disk failure?

Checklist

Assumptions listed and reviewed.
Failure modes enumerated with mitigations.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Safety properties stated as invariants.
Rollback plan rehearsed and automated.
Telemetry captures correctness signals.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading