Monthly research note. Theme: Correctness & Foundations.
TL;DR
A focused memo on Backpressure as a Correctness Property: Stability Under Load: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Correctness is cheaper to enforce at interfaces than to repair in production data.
Key takeaways
- Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
- Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
- Crash points are part of the design; specify recovery after each state mutation.
- Design rollbacks as part of the happy path.
- Define safety properties before performance goals.
Why this matters
- Your on-call runbook is part of the specification—make it match the code.
- Undefined behavior is an attack surface when inputs are adversarial.
- A system without explicit contracts becomes a collection of folklore and dashboards.
- “Works in tests” often means “fails under reordering and retries.”
Key questions
- Which transitions are allowed, and which are impossible by construction?
- What exactly is the state, and what is derived or cached?
- What must be durable before you acknowledge?
- Where does concurrency create “double spend” style failures in your domain?
- What does a client learn after a timeout: success, failure, or ambiguity?
- Where do you need atomicity (and where is eventual consistency acceptable)?
Assumptions
- Partial failure is normal: one replica slow, one unavailable, one returning stale data.
- Clients retry with backoff but not with perfect discipline (bursts happen).
- Input is hostile: malformed, oversized, boundary values, protocol confusion.
- Concurrency is adversarial: races appear only in production schedules.
Non-goals
- Letting recovery be “restart the service and hope.”
- Treating retries as a transport detail rather than a semantic constraint.
Any unbounded work per request becomes a DoS primitive under adversaries.
Model & invariants
We want a transition function and invariant such that:
Crash points matter: define what happens if the process stops after each line that mutates state or acknowledges work.
Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.
Invariants must be checkable from evidence you actually have (state + logs + counters).
Security properties
- Least authority: privileges are scoped by purpose and time.
- Authenticity: actions are bound to identity and purpose.
- Integrity: invalid transitions are rejected (and detectable).
- Replay resistance: duplicated inputs do not change outcomes.
Failure modes
- Recovery paths that only work when nothing is broken.
- Timeout ambiguity causing double-apply or partial state transitions.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Observability gaps during incidents (missing evidence).
Caches tend to become sources of truth unless you can recompute and validate them.
Design sketch
sequenceDiagram
participant C as Client
participant API as API
participant DB as Durable Store
C->>API: request(op, idempotency_key)
API->>DB: check_or_reserve(key)
DB-->>API: miss | hit(result)
API->>DB: commit(result)
API-->>C: ack(result)Implementation notes
The goal isn’t cleverness—it’s eliminating ambiguity at boundaries and making recovery boring.
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
use core::fmt;
#[derive(Clone, Debug)]
pub enum Event {
Input(Vec<u8>),
Tick,
Fault(&'static str),
}
pub trait StateMachine {
type State: Clone + fmt::Debug;
type Error: fmt::Debug;
fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
fn invariant(state: &Self::State) -> bool;
}
// Backpressure as a Correctness Property: Stability Under Load: invariants are part of the API contract.Verification strategy
- Crash/restart tests: persist mid-transition and validate recovery correctness.
- Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
- Property-based tests: generate adversarial sequences and assert invariants after every step.
- Differential tests against a reference model (even a slow one).
- Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
Operational notes
- Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
- Design “degraded modes” explicitly (fail closed vs fail open per operation).
- Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
- Track invariant violations as pages, not dashboards.
- Instrument ambiguity: measure “unknown outcome” responses separately from failures.
Keep audit and config history queryable during incidents—evidence beats intuition.
What to monitor
- Error budget burn + tail latency under load.
- Admission-control / rate-limit rejections (by reason).
- Retry/timeout rates by endpoint and client cohort.
- Authz failures and policy denials (unexpected spikes).
- Invariant violation rate (should be ~0).
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Use canaries and staged rollout; stop early when signals degrade.
- Define an explicit rollback trigger (metrics + thresholds).
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Keep dual-write / dual-verify windows where appropriate.
Evidence
- Jepsen (1) — Failure testing focused on correctness under partitions and reordering.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
- RFC 9110: HTTP Semantics (2) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.
Open questions
- What would you do if you had to replay a month of traffic into a rebuilt system?
- Which correctness properties can be enforced at compile time (types/capabilities)?
- What is the minimal durable record needed to recover safely?
- Which invariant, if violated, would silently corrupt state for weeks?
Checklist
- Telemetry captures correctness signals.
- Failure modes enumerated with mitigations.
- Rollback plan rehearsed and automated.
- Safety properties stated as invariants.
- Assumptions listed and reviewed.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Further reading
- Jepsen — Failure testing focused on correctness under partitions and reordering.
- Time, Clocks, and the Ordering of Events (Lamport, 1978) — The mental model for causality and ordering in distributed systems.
- Learn TLA+ — A pragmatic workflow for invariants and model checking.
- RFC 9110: HTTP Semantics — Defines method semantics including idempotency and safety—useful for API contracts.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.