Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

Designing for Network Partitions: Degraded Modes That Still Make Sense as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

  • Write the safety property first; liveness is always conditional on timing assumptions.
  • Expose protocol state (epoch/term/commit index) as first-class telemetry.
  • Treat membership changes and compaction as protocol events—not operational details.
  • Prefer protocols and APIs that make invalid states hard to express.
  • Automate guardrails; humans are for judgment, not for consistent enforcement.

Why this matters

  • Operational simplicity is a security property: fewer modes, fewer surprises.
  • Tail latency is a protocol input: it changes who retries and when.
  • State compaction and snapshots are where correctness goes to die quietly.
  • Backpressure and fairness are part of correctness when resources are finite.

Key questions

  • How do you prevent overload from becoming inconsistency?
  • Which components require determinism for reproducibility?
  • Which safety property is non-negotiable (no double-commit, no forks, no split brain)?
  • How do clients discover leaders safely (and what happens during flaps)?
  • What does “read” mean under replication lag?
  • What is your reconfiguration model (joint consensus, epochs, leases)?

Assumptions

  • Clients retry and amplify load right when the system is weakest.
  • Delays are unbounded during incidents; timeouts are guesses.
  • Workload is skewed: hot keys exist and dominate.
  • Partitions happen at multiple layers (network, DNS, LB, service mesh).

Non-goals

  • Pretending backpressure is an implementation detail.
  • Relying on global time for ordering without strong synchronization assumptions.
Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

Under partial synchrony, progress depends on a stabilizing period:

T: tT, messages delivered within Δ.\exists T:\ \forall t \ge T,\ \text{messages delivered within } \Delta.

Make overload explicit: admission control is a protocol boundary.

Write down the safety property first. If it’s not written, it’s not implemented.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

  • Integrity: invalid transitions are rejected (and detectable).
  • Replay resistance: duplicated inputs do not change outcomes.
  • Evidence: critical actions emit verifiable audit events.
  • Least authority: privileges are scoped by purpose and time.

Failure modes

  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Mixed-version behavior that violates assumptions silently.
  • Recovery paths that only work when nothing is broken.
  • Observability gaps during incidents (missing evidence).
Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

sequenceDiagram
  participant C as Client
  participant L as Leader
  participant F1 as Follower 1
  participant F2 as Follower 2
  C->>L: propose(cmd)
  L->>F1: appendEntries
  L->>F2: appendEntries
  F1-->>L: ack
  F2-->>L: ack
  L-->>C: commit(result)

Implementation notes

Protocols fail at the boundaries: timeouts, membership, compaction, and overload.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_control

Verification strategy

  • Stress + skew tests: hot keys, slow disks, noisy neighbors.
  • Model checking the smallest core (timeouts, election, reconfiguration).
  • Linearizability checks for read/write APIs that claim it.
  • Jepsen-style fault injection: partitions + reordering + client retries.
  • Upgrade tests: mixed versions and rolling deploy invariants.

Operational notes

  • Rate-limit retries and apply admission control before saturation.
  • Rehearse region failover and reconfiguration under load.
  • Expose protocol state: term/epoch, leader, commit index, config version.
  • Make client behavior part of the system: document retry semantics.
  • Prefer monotonic time sources for leases; alert on clock discontinuities.
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Invariant violation rate (should be ~0).
  • Authz failures and policy denials (unexpected spikes).
  • Rollback events and the conditions that triggered them.
  • Error budget burn + tail latency under load.
  • Retry/timeout rates by endpoint and client cohort.

Rollback plan

  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Keep dual-write / dual-verify windows where appropriate.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Use canaries and staged rollout; stop early when signals degrade.

Evidence

  • Jepsen (1) — Testing correctness under partitions and faults.
    • Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
  • Time, Clocks, and the Ordering of Events (Lamport) (2) — Causality, ordering, and why clocks are tricky.
    • Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.

Open questions

  • Where does your protocol assume synchrony without admitting it?
  • What is the worst-case recovery time after a leader + disk failure?
  • How do you prevent “operator fixes” from changing safety properties?
  • Which invariants are violated first under overload: latency, availability, or correctness?

Checklist

  • Telemetry captures correctness signals.
  • Safety properties stated as invariants.
  • Rollback plan rehearsed and automated.
  • Assumptions listed and reviewed.
  • Failure modes enumerated with mitigations.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

Further reading

1.
Jepsen. Jepsen: Distributed Systems Safety Analysis [Internet]. Web; Available from: https://jepsen.io/
2.
Lamport L. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM [Internet]. 1978;21(7):558–65. Available from: https://lamport.azurewebsites.net/pubs/time-clocks.pdf