Idempotency Everywhere: Designing Safe Retries in Distributed APIs

Monthly research note. Theme: Correctness & Foundations.

TL;DR

Idempotency Everywhere: Designing Safe Retries in Distributed APIs as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

Separate durable state from derived state; derived must be recomputable or reconcilable.
Crash points are part of the design; specify recovery after each state mutation.
Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
Define safety properties before performance goals.
Prefer protocols and APIs that make invalid states hard to express.

Why this matters

“Works in tests” often means “fails under reordering and retries.”
Undefined behavior is an attack surface when inputs are adversarial.
If recovery is not specified, recovery becomes improvisation.
A system without explicit contracts becomes a collection of folklore and dashboards.

Key questions

What does a client learn after a timeout: success, failure, or ambiguity?
Where does concurrency create “double spend” style failures in your domain?
How do you make “unsafe defaults” impossible to ship?
What must be durable before you acknowledge?
Which invariants must hold across crashes, restarts, and partial deployments?
What exactly is the state, and what is derived or cached?

Assumptions

Clients retry with backoff but not with perfect discipline (bursts happen).
Time is untrusted: clock skew, NTP steps, monotonic vs wall-clock confusion.
Crashes happen mid-write (torn state) unless you prove otherwise.
Input is hostile: malformed, oversized, boundary values, protocol confusion.

Non-goals

Baking invariants into tribal knowledge instead of code.
Treating retries as a transport detail rather than a semantic constraint.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

A common pattern is splitting state into durable vs derived:

S = S_\text{durable} \times S_\text{derived}\qquad\text{and}\qquad S_\text{derived} = f(S_\text{durable}).

If you can’t define what a timeout means, you can’t implement retries safely. Make ambiguity explicit in the API.

Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

Replay resistance: duplicated inputs do not change outcomes.
Evidence: critical actions emit verifiable audit events.
Downgrade resistance: negotiation can’t silently weaken security posture.
Least authority: privileges are scoped by purpose and time.

Failure modes

Config drift that weakens security posture over time.
Timeout ambiguity causing double-apply or partial state transitions.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Observability gaps during incidents (missing evidence).

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

stateDiagram-v2
  [*] --> Init
  Init --> Ready: bootstrap()
  Ready --> Processing: event(e)
  Processing --> Ready: commit()
  Processing --> Error: violate(Inv)
  Error --> Ready: recover()

Implementation notes

Implementation is the act of making invalid state unrepresentable (or at least unignorable).

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Correctness checklist:
1) Define state (durable vs derived).
2) Enumerate transitions.
3) Write invariants (safety) and progress conditions (liveness).
4) Pick crash points and specify recovery.
5) Make retries part of semantics (idempotency keys, monotonic versions).

Verification strategy

Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
Metamorphic tests: same operation applied twice must not change the result.
Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
Crash/restart tests: persist mid-transition and validate recovery correctness.

Operational notes

Design “degraded modes” explicitly (fail closed vs fail open per operation).
Validate time assumptions: alert on clock steps, skew, and monotonicity issues.
Make rollbacks safe: schema and protocol compatibility is a security boundary.
Log as evidence: append-only where possible; isolate logs from compromised workloads.
Track invariant violations as pages, not dashboards.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Invariant violation rate (should be ~0).
Error budget burn + tail latency under load.
Retry/timeout rates by endpoint and client cohort.
Rollback events and the conditions that triggered them.
Authz failures and policy denials (unexpected spikes).

Rollback plan

Prefer backward-compatible changes; avoid “flag day” upgrades.
Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Keep dual-write / dual-verify windows where appropriate.

Evidence

Time, Clocks, and the Ordering of Events (Lamport, 1978) (1) — The mental model for causality and ordering in distributed systems.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
RFC 9110: HTTP Semantics (2) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.

Open questions

Which correctness properties can be enforced at compile time (types/capabilities)?
Where does your API currently allow ambiguous outcomes, and how will clients cope?
Which operations need monotonic versioning vs idempotency keys vs both?
What is the minimal durable record needed to recover safely?

Checklist

Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Telemetry captures correctness signals.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Assumptions listed and reviewed.
Safety properties stated as invariants.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading