Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

A focused memo on Rate Limiting and Fairness: Protecting Critical Paths: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Write the safety property first; liveness is always conditional on timing assumptions.
  • Expose protocol state (epoch/term/commit index) as first-class telemetry.
  • Mixed-version operation is the default; upgrades must preserve invariants.
  • Write assumptions down; treat them as interfaces.
  • Design rollbacks as part of the happy path.

Why this matters

  • Most protocol bugs hide in timeouts, retries, and membership changes.
  • Mixed-version operation is the default state of real deployments.
  • Tail latency is a protocol input: it changes who retries and when.
  • Observability must explain protocol state, not just latency.

Key questions

  • How do clients discover leaders safely (and what happens during flaps)?
  • Which components require determinism for reproducibility?
  • Where do you pay for liveness (timeouts, leader election, reconfiguration)?
  • What is your reconfiguration model (joint consensus, epochs, leases)?
  • What does “read” mean under replication lag?
  • What is the failure model (crash, byzantine, partitions, reordering)?

Assumptions

  • Reconfigurations happen mid-incident (the worst time).
  • Clocks drift; leases can be violated under GC pauses or VM stalls.
  • Workload is skewed: hot keys exist and dominate.
  • Clients retry and amplify load right when the system is weakest.

Non-goals

  • Treating membership as static or human-managed only.
  • Relying on global time for ordering without strong synchronization assumptions.
Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

A common safety shape for replicated logs:

i: Committed(i)r: Logr[i]=Logleader[i].\forall i:\ \text{Committed}(i)\Rightarrow \forall r:\ \text{Log}_r[i] = \text{Log}_\text{leader}[i].

Liveness is always conditional: specify when progress is expected and what you do otherwise.

Treat membership changes as protocol events, not control-plane side effects.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Integrity: invalid transitions are rejected (and detectable).
  • Evidence: critical actions emit verifiable audit events.
  • Authenticity: actions are bound to identity and purpose.
  • Least authority: privileges are scoped by purpose and time.

Failure modes

  • Config drift that weakens security posture over time.
  • Observability gaps during incidents (missing evidence).
  • Recovery paths that only work when nothing is broken.
  • Timeout ambiguity causing double-apply or partial state transitions.
Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

stateDiagram-v2
  [*] --> Follower
  Follower --> Candidate: timeout
  Candidate --> Leader: win quorum
  Candidate --> Follower: lose
  Leader --> Follower: stepdown

Implementation notes

Your protocol is an interface between failures and invariants. Encode both.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_control

Verification strategy

  • Model checking the smallest core (timeouts, election, reconfiguration).
  • Linearizability checks for read/write APIs that claim it.
  • Stress + skew tests: hot keys, slow disks, noisy neighbors.
  • Jepsen-style fault injection: partitions + reordering + client retries.
  • Deterministic replay of network traces to reproduce rare failures.

Operational notes

  • Prefer monotonic time sources for leases; alert on clock discontinuities.
  • Expose protocol state: term/epoch, leader, commit index, config version.
  • Rehearse region failover and reconfiguration under load.
  • Treat compaction and snapshot install as first-class SLOs.
  • Rate-limit retries and apply admission control before saturation.
Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

  • Authz failures and policy denials (unexpected spikes).
  • Rollback events and the conditions that triggered them.
  • Invariant violation rate (should be ~0).
  • Retry/timeout rates by endpoint and client cohort.
  • Error budget burn + tail latency under load.

Rollback plan

  • Use canaries and staged rollout; stop early when signals degrade.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Keep dual-write / dual-verify windows where appropriate.

Evidence

  • In Search of an Understandable Consensus Algorithm (Raft) (1) — Consensus with explicit state machines and practical tradeoffs.
    • Evidence: Track term/commitIndex as explicit evidence; test leader changes and log conflicts as part of rollback behavior.
  • Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
    • Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

  • Where does your protocol assume synchrony without admitting it?
  • What is the worst-case recovery time after a leader + disk failure?
  • Which invariants are violated first under overload: latency, availability, or correctness?
  • How do you prevent “operator fixes” from changing safety properties?

Checklist

  • Assumptions listed and reviewed.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Rollback plan rehearsed and automated.
  • Failure modes enumerated with mitigations.
  • Safety properties stated as invariants.
  • Telemetry captures correctness signals.

Further reading

1.
Ongaro D, Ousterhout J. In Search of an Understandable Consensus Algorithm (Raft). In: 2014 USENIX Annual Technical Conference (USENIX ATC 14) [Internet]. 2014. Available from: https://raft.github.io/raft.pdf
2.
Beyer B, Jones C, Petoff J, Murphy NR. Site Reliability Engineering: How Google Runs Production Systems [Internet]. O’Reilly Media; 2016. Available from: https://sre.google/sre-book/table-of-contents/