Monthly research note. Theme: Distributed Systems Under Failure.
TL;DR
Designing for Network Partitions: Degraded Modes That Still Make Sense as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Write the safety property first; liveness is always conditional on timing assumptions.
- Expose protocol state (epoch/term/commit index) as first-class telemetry.
- Treat membership changes and compaction as protocol events—not operational details.
- Prefer protocols and APIs that make invalid states hard to express.
- Automate guardrails; humans are for judgment, not for consistent enforcement.
Why this matters
- Operational simplicity is a security property: fewer modes, fewer surprises.
- Tail latency is a protocol input: it changes who retries and when.
- State compaction and snapshots are where correctness goes to die quietly.
- Backpressure and fairness are part of correctness when resources are finite.
Key questions
- How do you prevent overload from becoming inconsistency?
- Which components require determinism for reproducibility?
- Which safety property is non-negotiable (no double-commit, no forks, no split brain)?
- How do clients discover leaders safely (and what happens during flaps)?
- What does “read” mean under replication lag?
- What is your reconfiguration model (joint consensus, epochs, leases)?
Assumptions
- Clients retry and amplify load right when the system is weakest.
- Delays are unbounded during incidents; timeouts are guesses.
- Workload is skewed: hot keys exist and dominate.
- Partitions happen at multiple layers (network, DNS, LB, service mesh).
Non-goals
- Pretending backpressure is an implementation detail.
- Relying on global time for ordering without strong synchronization assumptions.
Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.
Model & invariants
Under partial synchrony, progress depends on a stabilizing period:
Make overload explicit: admission control is a protocol boundary.
Write down the safety property first. If it’s not written, it’s not implemented.
If the system can enter an invalid state, it eventually will—usually during an incident.
Security properties
- Integrity: invalid transitions are rejected (and detectable).
- Replay resistance: duplicated inputs do not change outcomes.
- Evidence: critical actions emit verifiable audit events.
- Least authority: privileges are scoped by purpose and time.
Failure modes
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Mixed-version behavior that violates assumptions silently.
- Recovery paths that only work when nothing is broken.
- Observability gaps during incidents (missing evidence).
Caches tend to become sources of truth unless you can recompute and validate them.
Design sketch
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
C->>L: propose(cmd)
L->>F1: appendEntries
L->>F2: appendEntries
F1-->>L: ack
F2-->>L: ack
L-->>C: commit(result)Implementation notes
Protocols fail at the boundaries: timeouts, membership, compaction, and overload.
Make rollbacks boring: if rollback is a hero move, it will fail.
Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_controlVerification strategy
- Stress + skew tests: hot keys, slow disks, noisy neighbors.
- Model checking the smallest core (timeouts, election, reconfiguration).
- Linearizability checks for read/write APIs that claim it.
- Jepsen-style fault injection: partitions + reordering + client retries.
- Upgrade tests: mixed versions and rolling deploy invariants.
Operational notes
- Rate-limit retries and apply admission control before saturation.
- Rehearse region failover and reconfiguration under load.
- Expose protocol state: term/epoch, leader, commit index, config version.
- Make client behavior part of the system: document retry semantics.
- Prefer monotonic time sources for leases; alert on clock discontinuities.
Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.
What to monitor
- Invariant violation rate (should be ~0).
- Authz failures and policy denials (unexpected spikes).
- Rollback events and the conditions that triggered them.
- Error budget burn + tail latency under load.
- Retry/timeout rates by endpoint and client cohort.
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Keep dual-write / dual-verify windows where appropriate.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Define an explicit rollback trigger (metrics + thresholds).
- Use canaries and staged rollout; stop early when signals degrade.
Evidence
- Jepsen (1) — Testing correctness under partitions and faults.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
- Time, Clocks, and the Ordering of Events (Lamport) (2) — Causality, ordering, and why clocks are tricky.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
Open questions
- Where does your protocol assume synchrony without admitting it?
- What is the worst-case recovery time after a leader + disk failure?
- How do you prevent “operator fixes” from changing safety properties?
- Which invariants are violated first under overload: latency, availability, or correctness?
Checklist
- Telemetry captures correctness signals.
- Safety properties stated as invariants.
- Rollback plan rehearsed and automated.
- Assumptions listed and reviewed.
- Failure modes enumerated with mitigations.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Further reading
- Paxos Made Simple (Lamport) — Agreement basics and the invariants that matter.
- In Search of an Understandable Consensus Algorithm (Raft) — Consensus with explicit state machines and practical tradeoffs.
- Time, Clocks, and the Ordering of Events (Lamport) — Causality, ordering, and why clocks are tricky.
- Jepsen — Testing correctness under partitions and faults.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.
- Learn TLA+ — Practical entry point for specification and model checking.