Observability as Specification: SLOs, Error Budgets, and Contracts

Monthly research note. Theme: Correctness & Foundations.

TL;DR

A focused memo on Observability as Specification: SLOs, Error Budgets, and Contracts: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
Ack semantics must be explicit: durable, best-effort, or ambiguous.
Automate guardrails; humans are for judgment, not for consistent enforcement.
Make failure modes explicit and observable.

Why this matters

Your on-call runbook is part of the specification—make it match the code.
Undefined behavior is an attack surface when inputs are adversarial.
In distributed code, retries and duplication are the common case—not the edge case.
A system without explicit contracts becomes a collection of folklore and dashboards.

Key questions

What does a client learn after a timeout: success, failure, or ambiguity?
Which invariants must hold across crashes, restarts, and partial deployments?
Which transitions are allowed, and which are impossible by construction?
Where does concurrency create “double spend” style failures in your domain?
How do you make “unsafe defaults” impossible to ship?
What exactly is the state, and what is derived or cached?

Assumptions

Requests can be duplicated, reordered, delayed, and replayed across restarts.
Observability is incomplete: you will debug from partial evidence.
Crashes happen mid-write (torn state) unless you prove otherwise.
Time is untrusted: clock skew, NTP steps, monotonic vs wall-clock confusion.

Non-goals

Letting recovery be “restart the service and hope.”
Assuming a single authoritative clock that never moves backwards.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

A common pattern is splitting state into durable vs derived:

S = S_\text{durable} \times S_\text{derived}\qquad\text{and}\qquad S_\text{derived} = f(S_\text{durable}).

Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.

Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

Integrity: invalid transitions are rejected (and detectable).
Downgrade resistance: negotiation can’t silently weaken security posture.
Evidence: critical actions emit verifiable audit events.
Authenticity: actions are bound to identity and purpose.

Failure modes

Recovery paths that only work when nothing is broken.
Config drift that weakens security posture over time.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Mixed-version behavior that violates assumptions silently.

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  input["Input"] --> parse["Parse/Validate"]
  parse --> decide["Decide (pure)"]
  decide --> write["Durable write"]
  write --> ack["Acknowledge"]
  ack --> obs["Emit evidence (logs/metrics)"]

Implementation notes

The goal isn’t cleverness—it’s eliminating ambiguity at boundaries and making recovery boring.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

use core::fmt;

#[derive(Clone, Debug)]
pub enum Event {
    Input(Vec<u8>),
    Tick,
    Fault(&'static str),
}

pub trait StateMachine {
    type State: Clone + fmt::Debug;
    type Error: fmt::Debug;

    fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
    fn invariant(state: &Self::State) -> bool;
}

// Observability as Specification: SLOs, Error Budgets, and Contracts: invariants are part of the API contract.

Verification strategy

Differential tests against a reference model (even a slow one).
Metamorphic tests: same operation applied twice must not change the result.
Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
Fault injection: latency, partial writes, dropped acks, and duplicated messages.
Crash/restart tests: persist mid-transition and validate recovery correctness.

Operational notes

Log as evidence: append-only where possible; isolate logs from compromised workloads.
Instrument ambiguity: measure “unknown outcome” responses separately from failures.
Track invariant violations as pages, not dashboards.
Design “degraded modes” explicitly (fail closed vs fail open per operation).
Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Authz failures and policy denials (unexpected spikes).
Error budget burn + tail latency under load.
Rollback events and the conditions that triggered them.
Admission-control / rate-limit rejections (by reason).
Retry/timeout rates by endpoint and client cohort.

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Keep dual-write / dual-verify windows where appropriate.
Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

RFC 9110: HTTP Semantics (1) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.
Learn TLA+ (2) — A pragmatic workflow for invariants and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

What is the minimal durable record needed to recover safely?
Where does your API currently allow ambiguous outcomes, and how will clients cope?
What would you do if you had to replay a month of traffic into a rebuilt system?
Which correctness properties can be enforced at compile time (types/capabilities)?

Checklist

Telemetry captures correctness signals.
Assumptions listed and reviewed.
Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Safety properties stated as invariants.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading