Backpressure as a Correctness Property: Stability Under Load

Monthly research note. Theme: Correctness & Foundations.

TL;DR

A focused memo on Backpressure as a Correctness Property: Stability Under Load: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
Crash points are part of the design; specify recovery after each state mutation.
Design rollbacks as part of the happy path.
Define safety properties before performance goals.

Why this matters

Your on-call runbook is part of the specification—make it match the code.
Undefined behavior is an attack surface when inputs are adversarial.
A system without explicit contracts becomes a collection of folklore and dashboards.
“Works in tests” often means “fails under reordering and retries.”

Key questions

Which transitions are allowed, and which are impossible by construction?
What exactly is the state, and what is derived or cached?
What must be durable before you acknowledge?
Where does concurrency create “double spend” style failures in your domain?
What does a client learn after a timeout: success, failure, or ambiguity?
Where do you need atomicity (and where is eventual consistency acceptable)?

Assumptions

Partial failure is normal: one replica slow, one unavailable, one returning stale data.
Clients retry with backoff but not with perfect discipline (bursts happen).
Input is hostile: malformed, oversized, boundary values, protocol confusion.
Concurrency is adversarial: races appear only in production schedules.

Non-goals

Letting recovery be “restart the service and hope.”
Treating retries as a transport detail rather than a semantic constraint.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

We want a transition function $\delta$ and invariant $\mathrm{Inv}$ such that:

s_{t+1} = \delta(s_t, e_t)\qquad\wedge\qquad \mathrm{Inv}(s_t)\Rightarrow \mathrm{Inv}(s_{t+1}).

Crash points matter: define what happens if the process stops after each line that mutates state or acknowledges work.

Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

Least authority: privileges are scoped by purpose and time.
Authenticity: actions are bound to identity and purpose.
Integrity: invalid transitions are rejected (and detectable).
Replay resistance: duplicated inputs do not change outcomes.

Failure modes

Recovery paths that only work when nothing is broken.
Timeout ambiguity causing double-apply or partial state transitions.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Observability gaps during incidents (missing evidence).

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

sequenceDiagram
  participant C as Client
  participant API as API
  participant DB as Durable Store
  C->>API: request(op, idempotency_key)
  API->>DB: check_or_reserve(key)
  DB-->>API: miss | hit(result)
  API->>DB: commit(result)
  API-->>C: ack(result)

Implementation notes

The goal isn’t cleverness—it’s eliminating ambiguity at boundaries and making recovery boring.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

use core::fmt;

#[derive(Clone, Debug)]
pub enum Event {
    Input(Vec<u8>),
    Tick,
    Fault(&'static str),
}

pub trait StateMachine {
    type State: Clone + fmt::Debug;
    type Error: fmt::Debug;

    fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
    fn invariant(state: &Self::State) -> bool;
}

// Backpressure as a Correctness Property: Stability Under Load: invariants are part of the API contract.

Verification strategy

Crash/restart tests: persist mid-transition and validate recovery correctness.
Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
Property-based tests: generate adversarial sequences and assert invariants after every step.
Differential tests against a reference model (even a slow one).
Deterministic schedulers (e.g., Loom-like) to force rare interleavings.

Operational notes

Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
Design “degraded modes” explicitly (fail closed vs fail open per operation).
Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
Track invariant violations as pages, not dashboards.
Instrument ambiguity: measure “unknown outcome” responses separately from failures.

Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

Error budget burn + tail latency under load.
Admission-control / rate-limit rejections (by reason).
Retry/timeout rates by endpoint and client cohort.
Authz failures and policy denials (unexpected spikes).
Invariant violation rate (should be ~0).

Rollback plan

Prefer backward-compatible changes; avoid “flag day” upgrades.
Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Keep dual-write / dual-verify windows where appropriate.

Evidence

Jepsen (1) — Failure testing focused on correctness under partitions and reordering.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
RFC 9110: HTTP Semantics (2) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.

Open questions

What would you do if you had to replay a month of traffic into a rebuilt system?
Which correctness properties can be enforced at compile time (types/capabilities)?
What is the minimal durable record needed to recover safely?
Which invariant, if violated, would silently corrupt state for weeks?

Checklist

Telemetry captures correctness signals.
Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Safety properties stated as invariants.
Assumptions listed and reviewed.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading