Monthly research note. Theme: Distributed Systems Under Failure.
TL;DR
A focused memo on Rate Limiting and Fairness: Protecting Critical Paths: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Correctness is cheaper to enforce at interfaces than to repair in production data.
Key takeaways
- Write the safety property first; liveness is always conditional on timing assumptions.
- Expose protocol state (epoch/term/commit index) as first-class telemetry.
- Mixed-version operation is the default; upgrades must preserve invariants.
- Write assumptions down; treat them as interfaces.
- Design rollbacks as part of the happy path.
Why this matters
- Most protocol bugs hide in timeouts, retries, and membership changes.
- Mixed-version operation is the default state of real deployments.
- Tail latency is a protocol input: it changes who retries and when.
- Observability must explain protocol state, not just latency.
Key questions
- How do clients discover leaders safely (and what happens during flaps)?
- Which components require determinism for reproducibility?
- Where do you pay for liveness (timeouts, leader election, reconfiguration)?
- What is your reconfiguration model (joint consensus, epochs, leases)?
- What does “read” mean under replication lag?
- What is the failure model (crash, byzantine, partitions, reordering)?
Assumptions
- Reconfigurations happen mid-incident (the worst time).
- Clocks drift; leases can be violated under GC pauses or VM stalls.
- Workload is skewed: hot keys exist and dominate.
- Clients retry and amplify load right when the system is weakest.
Non-goals
- Treating membership as static or human-managed only.
- Relying on global time for ordering without strong synchronization assumptions.
Any unbounded work per request becomes a DoS primitive under adversaries.
Model & invariants
A common safety shape for replicated logs:
Liveness is always conditional: specify when progress is expected and what you do otherwise.
Treat membership changes as protocol events, not control-plane side effects.
Make the “impossible state” observable: a metric or alert that fires when invariants drift.
Security properties
- Integrity: invalid transitions are rejected (and detectable).
- Evidence: critical actions emit verifiable audit events.
- Authenticity: actions are bound to identity and purpose.
- Least authority: privileges are scoped by purpose and time.
Failure modes
- Config drift that weakens security posture over time.
- Observability gaps during incidents (missing evidence).
- Recovery paths that only work when nothing is broken.
- Timeout ambiguity causing double-apply or partial state transitions.
Mixed-version deployments create states you never tested—plan for them explicitly.
Design sketch
stateDiagram-v2
[*] --> Follower
Follower --> Candidate: timeout
Candidate --> Leader: win quorum
Candidate --> Follower: lose
Leader --> Follower: stepdownImplementation notes
Your protocol is an interface between failures and invariants. Encode both.
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_controlVerification strategy
- Model checking the smallest core (timeouts, election, reconfiguration).
- Linearizability checks for read/write APIs that claim it.
- Stress + skew tests: hot keys, slow disks, noisy neighbors.
- Jepsen-style fault injection: partitions + reordering + client retries.
- Deterministic replay of network traces to reproduce rare failures.
Operational notes
- Prefer monotonic time sources for leases; alert on clock discontinuities.
- Expose protocol state: term/epoch, leader, commit index, config version.
- Rehearse region failover and reconfiguration under load.
- Treat compaction and snapshot install as first-class SLOs.
- Rate-limit retries and apply admission control before saturation.
Keep audit and config history queryable during incidents—evidence beats intuition.
What to monitor
- Authz failures and policy denials (unexpected spikes).
- Rollback events and the conditions that triggered them.
- Invariant violation rate (should be ~0).
- Retry/timeout rates by endpoint and client cohort.
- Error budget burn + tail latency under load.
Rollback plan
- Use canaries and staged rollout; stop early when signals degrade.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Define an explicit rollback trigger (metrics + thresholds).
- Keep dual-write / dual-verify windows where appropriate.
Evidence
- In Search of an Understandable Consensus Algorithm (Raft) (1) — Consensus with explicit state machines and practical tradeoffs.
- Evidence: Track term/commitIndex as explicit evidence; test leader changes and log conflicts as part of rollback behavior.
- Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Open questions
- Where does your protocol assume synchrony without admitting it?
- What is the worst-case recovery time after a leader + disk failure?
- Which invariants are violated first under overload: latency, availability, or correctness?
- How do you prevent “operator fixes” from changing safety properties?
Checklist
- Assumptions listed and reviewed.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Rollback plan rehearsed and automated.
- Failure modes enumerated with mitigations.
- Safety properties stated as invariants.
- Telemetry captures correctness signals.
Further reading
- Jepsen — Testing correctness under partitions and faults.
- Paxos Made Simple (Lamport) — Agreement basics and the invariants that matter.
- Time, Clocks, and the Ordering of Events (Lamport) — Causality, ordering, and why clocks are tricky.
- In Search of an Understandable Consensus Algorithm (Raft) — Consensus with explicit state machines and practical tradeoffs.
- Learn TLA+ — Practical entry point for specification and model checking.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.