Monthly research note. Theme: Correctness & Foundations.
TL;DR
A focused memo on Observability as Specification: SLOs, Error Budgets, and Contracts: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.
Key takeaways
- Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
- Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
- Ack semantics must be explicit: durable, best-effort, or ambiguous.
- Automate guardrails; humans are for judgment, not for consistent enforcement.
- Make failure modes explicit and observable.
Why this matters
- Your on-call runbook is part of the specification—make it match the code.
- Undefined behavior is an attack surface when inputs are adversarial.
- In distributed code, retries and duplication are the common case—not the edge case.
- A system without explicit contracts becomes a collection of folklore and dashboards.
Key questions
- What does a client learn after a timeout: success, failure, or ambiguity?
- Which invariants must hold across crashes, restarts, and partial deployments?
- Which transitions are allowed, and which are impossible by construction?
- Where does concurrency create “double spend” style failures in your domain?
- How do you make “unsafe defaults” impossible to ship?
- What exactly is the state, and what is derived or cached?
Assumptions
- Requests can be duplicated, reordered, delayed, and replayed across restarts.
- Observability is incomplete: you will debug from partial evidence.
- Crashes happen mid-write (torn state) unless you prove otherwise.
- Time is untrusted: clock skew, NTP steps, monotonic vs wall-clock confusion.
Non-goals
- Letting recovery be “restart the service and hope.”
- Assuming a single authoritative clock that never moves backwards.
Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.
Model & invariants
A common pattern is splitting state into durable vs derived:
Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.
Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.
Invariants must be checkable from evidence you actually have (state + logs + counters).
Security properties
- Integrity: invalid transitions are rejected (and detectable).
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Evidence: critical actions emit verifiable audit events.
- Authenticity: actions are bound to identity and purpose.
Failure modes
- Recovery paths that only work when nothing is broken.
- Config drift that weakens security posture over time.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Mixed-version behavior that violates assumptions silently.
Sampling hides the rare schedule that breaks your invariants.
Design sketch
flowchart TD
input["Input"] --> parse["Parse/Validate"]
parse --> decide["Decide (pure)"]
decide --> write["Durable write"]
write --> ack["Acknowledge"]
ack --> obs["Emit evidence (logs/metrics)"]Implementation notes
The goal isn’t cleverness—it’s eliminating ambiguity at boundaries and making recovery boring.
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
use core::fmt;
#[derive(Clone, Debug)]
pub enum Event {
Input(Vec<u8>),
Tick,
Fault(&'static str),
}
pub trait StateMachine {
type State: Clone + fmt::Debug;
type Error: fmt::Debug;
fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
fn invariant(state: &Self::State) -> bool;
}
// Observability as Specification: SLOs, Error Budgets, and Contracts: invariants are part of the API contract.Verification strategy
- Differential tests against a reference model (even a slow one).
- Metamorphic tests: same operation applied twice must not change the result.
- Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
- Fault injection: latency, partial writes, dropped acks, and duplicated messages.
- Crash/restart tests: persist mid-transition and validate recovery correctness.
Operational notes
- Log as evidence: append-only where possible; isolate logs from compromised workloads.
- Instrument ambiguity: measure “unknown outcome” responses separately from failures.
- Track invariant violations as pages, not dashboards.
- Design “degraded modes” explicitly (fail closed vs fail open per operation).
- Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
Attach explicit rollout/rollback triggers to changes that touch security or correctness.
What to monitor
- Authz failures and policy denials (unexpected spikes).
- Error budget burn + tail latency under load.
- Rollback events and the conditions that triggered them.
- Admission-control / rate-limit rejections (by reason).
- Retry/timeout rates by endpoint and client cohort.
Rollback plan
- Use canaries and staged rollout; stop early when signals degrade.
- Keep dual-write / dual-verify windows where appropriate.
- Define an explicit rollback trigger (metrics + thresholds).
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
Evidence
- RFC 9110: HTTP Semantics (1) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.
- Learn TLA+ (2) — A pragmatic workflow for invariants and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
Open questions
- What is the minimal durable record needed to recover safely?
- Where does your API currently allow ambiguous outcomes, and how will clients cope?
- What would you do if you had to replay a month of traffic into a rebuilt system?
- Which correctness properties can be enforced at compile time (types/capabilities)?
Checklist
- Telemetry captures correctness signals.
- Assumptions listed and reviewed.
- Failure modes enumerated with mitigations.
- Rollback plan rehearsed and automated.
- Safety properties stated as invariants.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Further reading
- Time, Clocks, and the Ordering of Events (Lamport, 1978) — The mental model for causality and ordering in distributed systems.
- RFC 9110: HTTP Semantics — Defines method semantics including idempotency and safety—useful for API contracts.
- Learn TLA+ — A pragmatic workflow for invariants and model checking.
- Paxos Made Simple (Lamport) — A clean reference for agreement and invariants.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.