Monthly research note. Theme: Correctness & Foundations.
TL;DR
Idempotency Everywhere: Designing Safe Retries in Distributed APIs as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Separate durable state from derived state; derived must be recomputable or reconcilable.
- Crash points are part of the design; specify recovery after each state mutation.
- Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
- Define safety properties before performance goals.
- Prefer protocols and APIs that make invalid states hard to express.
Why this matters
- “Works in tests” often means “fails under reordering and retries.”
- Undefined behavior is an attack surface when inputs are adversarial.
- If recovery is not specified, recovery becomes improvisation.
- A system without explicit contracts becomes a collection of folklore and dashboards.
Key questions
- What does a client learn after a timeout: success, failure, or ambiguity?
- Where does concurrency create “double spend” style failures in your domain?
- How do you make “unsafe defaults” impossible to ship?
- What must be durable before you acknowledge?
- Which invariants must hold across crashes, restarts, and partial deployments?
- What exactly is the state, and what is derived or cached?
Assumptions
- Clients retry with backoff but not with perfect discipline (bursts happen).
- Time is untrusted: clock skew, NTP steps, monotonic vs wall-clock confusion.
- Crashes happen mid-write (torn state) unless you prove otherwise.
- Input is hostile: malformed, oversized, boundary values, protocol confusion.
Non-goals
- Baking invariants into tribal knowledge instead of code.
- Treating retries as a transport detail rather than a semantic constraint.
Parsing is an attacker-controlled interface—validate early and fail fast.
Model & invariants
A common pattern is splitting state into durable vs derived:
If you can’t define what a timeout means, you can’t implement retries safely. Make ambiguity explicit in the API.
Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.
Invariants must be checkable from evidence you actually have (state + logs + counters).
Security properties
- Replay resistance: duplicated inputs do not change outcomes.
- Evidence: critical actions emit verifiable audit events.
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Least authority: privileges are scoped by purpose and time.
Failure modes
- Config drift that weakens security posture over time.
- Timeout ambiguity causing double-apply or partial state transitions.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Observability gaps during incidents (missing evidence).
Mixed-version deployments create states you never tested—plan for them explicitly.
Design sketch
stateDiagram-v2
[*] --> Init
Init --> Ready: bootstrap()
Ready --> Processing: event(e)
Processing --> Ready: commit()
Processing --> Error: violate(Inv)
Error --> Ready: recover()Implementation notes
Implementation is the act of making invalid state unrepresentable (or at least unignorable).
Make rollbacks boring: if rollback is a hero move, it will fail.
Correctness checklist:
1) Define state (durable vs derived).
2) Enumerate transitions.
3) Write invariants (safety) and progress conditions (liveness).
4) Pick crash points and specify recovery.
5) Make retries part of semantics (idempotency keys, monotonic versions).Verification strategy
- Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
- Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
- Metamorphic tests: same operation applied twice must not change the result.
- Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
- Crash/restart tests: persist mid-transition and validate recovery correctness.
Operational notes
- Design “degraded modes” explicitly (fail closed vs fail open per operation).
- Validate time assumptions: alert on clock steps, skew, and monotonicity issues.
- Make rollbacks safe: schema and protocol compatibility is a security boundary.
- Log as evidence: append-only where possible; isolate logs from compromised workloads.
- Track invariant violations as pages, not dashboards.
Attach explicit rollout/rollback triggers to changes that touch security or correctness.
What to monitor
- Invariant violation rate (should be ~0).
- Error budget burn + tail latency under load.
- Retry/timeout rates by endpoint and client cohort.
- Rollback events and the conditions that triggered them.
- Authz failures and policy denials (unexpected spikes).
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Use canaries and staged rollout; stop early when signals degrade.
- Define an explicit rollback trigger (metrics + thresholds).
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Keep dual-write / dual-verify windows where appropriate.
Evidence
- Time, Clocks, and the Ordering of Events (Lamport, 1978) (1) — The mental model for causality and ordering in distributed systems.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
- RFC 9110: HTTP Semantics (2) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.
Open questions
- Which correctness properties can be enforced at compile time (types/capabilities)?
- Where does your API currently allow ambiguous outcomes, and how will clients cope?
- Which operations need monotonic versioning vs idempotency keys vs both?
- What is the minimal durable record needed to recover safely?
Checklist
- Failure modes enumerated with mitigations.
- Rollback plan rehearsed and automated.
- Telemetry captures correctness signals.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Assumptions listed and reviewed.
- Safety properties stated as invariants.
Further reading
- Time, Clocks, and the Ordering of Events (Lamport, 1978) — The mental model for causality and ordering in distributed systems.
- Jepsen — Failure testing focused on correctness under partitions and reordering.
- RFC 9110: HTTP Semantics — Defines method semantics including idempotency and safety—useful for API contracts.
- Learn TLA+ — A pragmatic workflow for invariants and model checking.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.