Monthly research note. Theme: Distributed Systems Under Failure.
TL;DR
A focused memo on State Machine Replication: Log Design, Snapshots, and Compaction: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Correctness is cheaper to enforce at interfaces than to repair in production data.
Key takeaways
- Expose protocol state (epoch/term/commit index) as first-class telemetry.
- Treat membership changes and compaction as protocol events—not operational details.
- Backpressure and admission control are correctness mechanisms under load.
- Treat retries, reordering, and partial failure as default conditions.
- Bind security decisions to evidence (audit, invariants, telemetry).
Why this matters
- Safety failures are permanent; liveness failures are (sometimes) recoverable.
- Global systems fail in correlated ways (regions, dependencies, routing).
- If your protocol isn’t testable under reordering, it isn’t deployable.
- Observability must explain protocol state, not just latency.
Key questions
- What does “read” mean under replication lag?
- What is the unit of ordering (per key, per partition, global)?
- What is your reconfiguration model (joint consensus, epochs, leases)?
- How do clients discover leaders safely (and what happens during flaps)?
- Where do you pay for liveness (timeouts, leader election, reconfiguration)?
- How do you prevent overload from becoming inconsistency?
Assumptions
- Workload is skewed: hot keys exist and dominate.
- Nodes restart with partial state unless you prove durability.
- Clients retry and amplify load right when the system is weakest.
- Packets can be duplicated and reordered; acks can be lost.
Non-goals
- Pretending backpressure is an implementation detail.
- Assuming the network eventually behaves “nicely” under load.
Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.
Model & invariants
A common safety shape for replicated logs:
Liveness is always conditional: specify when progress is expected and what you do otherwise.
Treat membership changes as protocol events, not control-plane side effects.
Make the “impossible state” observable: a metric or alert that fires when invariants drift.
Security properties
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Evidence: critical actions emit verifiable audit events.
- Authenticity: actions are bound to identity and purpose.
- Replay resistance: duplicated inputs do not change outcomes.
Failure modes
- Mixed-version behavior that violates assumptions silently.
- Observability gaps during incidents (missing evidence).
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Timeout ambiguity causing double-apply or partial state transitions.
Mixed-version deployments create states you never tested—plan for them explicitly.
Design sketch
flowchart TD
client["Client"] --> leader["Leader"]
leader --> log["Replicated Log"]
log --> snap["Snapshot"]
snap --> recover["Recovery / Catch-up"]
leader --> reconfig["Reconfiguration"]Implementation notes
Your protocol is an interface between failures and invariants. Encode both.
If you can’t explain a timeout outcome, you can’t make retries safe.
Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_controlVerification strategy
- Jepsen-style fault injection: partitions + reordering + client retries.
- Upgrade tests: mixed versions and rolling deploy invariants.
- Deterministic replay of network traces to reproduce rare failures.
- Model checking the smallest core (timeouts, election, reconfiguration).
- Stress + skew tests: hot keys, slow disks, noisy neighbors.
Operational notes
- Expose protocol state: term/epoch, leader, commit index, config version.
- Make client behavior part of the system: document retry semantics.
- Treat compaction and snapshot install as first-class SLOs.
- Rehearse region failover and reconfiguration under load.
- Rate-limit retries and apply admission control before saturation.
Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.
What to monitor
- Rollback events and the conditions that triggered them.
- Error budget burn + tail latency under load.
- Authz failures and policy denials (unexpected spikes).
- Admission-control / rate-limit rejections (by reason).
- Invariant violation rate (should be ~0).
Rollback plan
- Keep dual-write / dual-verify windows where appropriate.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Use canaries and staged rollout; stop early when signals degrade.
- Define an explicit rollback trigger (metrics + thresholds).
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Evidence
- Time, Clocks, and the Ordering of Events (Lamport) (1) — Causality, ordering, and why clocks are tricky.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
- Jepsen (2) — Testing correctness under partitions and faults.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Open questions
- Which invariants are violated first under overload: latency, availability, or correctness?
- What is the worst-case recovery time after a leader + disk failure?
- How do you prevent “operator fixes” from changing safety properties?
- Where does your protocol assume synchrony without admitting it?
Checklist
- Telemetry captures correctness signals.
- Rollback plan rehearsed and automated.
- Assumptions listed and reviewed.
- Safety properties stated as invariants.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Failure modes enumerated with mitigations.
Further reading
- Jepsen — Testing correctness under partitions and faults.
- Time, Clocks, and the Ordering of Events (Lamport) — Causality, ordering, and why clocks are tricky.
- Paxos Made Simple (Lamport) — Agreement basics and the invariants that matter.
- In Search of an Understandable Consensus Algorithm (Raft) — Consensus with explicit state machines and practical tradeoffs.
- Learn TLA+ — Practical entry point for specification and model checking.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.