Rate Limiting and Fairness: Protecting Critical Paths

Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

A focused memo on Rate Limiting and Fairness: Protecting Critical Paths: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

Write the safety property first; liveness is always conditional on timing assumptions.
Expose protocol state (epoch/term/commit index) as first-class telemetry.
Mixed-version operation is the default; upgrades must preserve invariants.
Write assumptions down; treat them as interfaces.
Design rollbacks as part of the happy path.

Why this matters

Most protocol bugs hide in timeouts, retries, and membership changes.
Mixed-version operation is the default state of real deployments.
Tail latency is a protocol input: it changes who retries and when.
Observability must explain protocol state, not just latency.

Key questions

How do clients discover leaders safely (and what happens during flaps)?
Which components require determinism for reproducibility?
Where do you pay for liveness (timeouts, leader election, reconfiguration)?
What is your reconfiguration model (joint consensus, epochs, leases)?
What does “read” mean under replication lag?
What is the failure model (crash, byzantine, partitions, reordering)?

Assumptions

Reconfigurations happen mid-incident (the worst time).
Clocks drift; leases can be violated under GC pauses or VM stalls.
Workload is skewed: hot keys exist and dominate.
Clients retry and amplify load right when the system is weakest.

Non-goals

Treating membership as static or human-managed only.
Relying on global time for ordering without strong synchronization assumptions.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

A common safety shape for replicated logs:

\forall i:\ \text{Committed}(i)\Rightarrow \forall r:\ \text{Log}_r[i] = \text{Log}_\text{leader}[i].

Liveness is always conditional: specify when progress is expected and what you do otherwise.

Treat membership changes as protocol events, not control-plane side effects.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Integrity: invalid transitions are rejected (and detectable).
Evidence: critical actions emit verifiable audit events.
Authenticity: actions are bound to identity and purpose.
Least authority: privileges are scoped by purpose and time.

Failure modes

Config drift that weakens security posture over time.
Observability gaps during incidents (missing evidence).
Recovery paths that only work when nothing is broken.
Timeout ambiguity causing double-apply or partial state transitions.

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

stateDiagram-v2
  [*] --> Follower
  Follower --> Candidate: timeout
  Candidate --> Leader: win quorum
  Candidate --> Follower: lose
  Leader --> Follower: stepdown

Implementation notes

Your protocol is an interface between failures and invariants. Encode both.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_control

Verification strategy

Model checking the smallest core (timeouts, election, reconfiguration).
Linearizability checks for read/write APIs that claim it.
Stress + skew tests: hot keys, slow disks, noisy neighbors.
Jepsen-style fault injection: partitions + reordering + client retries.
Deterministic replay of network traces to reproduce rare failures.

Operational notes

Prefer monotonic time sources for leases; alert on clock discontinuities.
Expose protocol state: term/epoch, leader, commit index, config version.
Rehearse region failover and reconfiguration under load.
Treat compaction and snapshot install as first-class SLOs.
Rate-limit retries and apply admission control before saturation.

Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

Authz failures and policy denials (unexpected spikes).
Rollback events and the conditions that triggered them.
Invariant violation rate (should be ~0).
Retry/timeout rates by endpoint and client cohort.
Error budget burn + tail latency under load.

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Define an explicit rollback trigger (metrics + thresholds).
Keep dual-write / dual-verify windows where appropriate.

Evidence

In Search of an Understandable Consensus Algorithm (Raft) (1) — Consensus with explicit state machines and practical tradeoffs.
- Evidence: Track term/commitIndex as explicit evidence; test leader changes and log conflicts as part of rollback behavior.
Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

Where does your protocol assume synchrony without admitting it?
What is the worst-case recovery time after a leader + disk failure?
Which invariants are violated first under overload: latency, availability, or correctness?
How do you prevent “operator fixes” from changing safety properties?

Checklist

Assumptions listed and reviewed.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Rollback plan rehearsed and automated.
Failure modes enumerated with mitigations.
Safety properties stated as invariants.
Telemetry captures correctness signals.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading