Monthly research note. Theme: Correctness & Foundations.

TL;DR

Reproducible Builds: Trusting Artifacts in a Hostile World as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
  • Crash points are part of the design; specify recovery after each state mutation.
  • Ack semantics must be explicit: durable, best-effort, or ambiguous.
  • Define safety properties before performance goals.
  • Measure correctness signals, not only latency/throughput.

Why this matters

  • Interfaces that allow invalid state guarantee someone will eventually produce it.
  • Undefined behavior is an attack surface when inputs are adversarial.
  • A system without explicit contracts becomes a collection of folklore and dashboards.
  • “Works in tests” often means “fails under reordering and retries.”

Key questions

  • How do you make “unsafe defaults” impossible to ship?
  • Where does concurrency create “double spend” style failures in your domain?
  • What exactly is the state, and what is derived or cached?
  • What must be durable before you acknowledge?
  • Which invariants must hold across crashes, restarts, and partial deployments?
  • How do you ensure deduplication is scoped correctly (tenant, resource, operation)?

Assumptions

  • Clients retry with backoff but not with perfect discipline (bursts happen).
  • Deployments are mixed-version for longer than you think.
  • Errors are lossy: transient vs permanent is often indistinguishable at the boundary.
  • Observability is incomplete: you will debug from partial evidence.

Non-goals

  • Letting recovery be “restart the service and hope.”
  • Perfect exactly-once semantics across an untrusted network without coordination.
Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

For idempotent operations, the contract is set-like:

apply(s,op,k)=apply(s,op,k)andapply(s,op,k1)apply(s,op,k2) in general.\mathrm{apply}(s, op, k) = \mathrm{apply}(s, op, k) \quad\text{and}\quad \mathrm{apply}(s, op, k_1) \neq \mathrm{apply}(s, op, k_2)\ \text{in general}.

Treat invariants as a first-class interface: a function that cannot check its invariants cannot be safely composed. Start with the smallest invariant that is both meaningful and enforceable at your boundaries.

Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Authenticity: actions are bound to identity and purpose.
  • Replay resistance: duplicated inputs do not change outcomes.
  • Least authority: privileges are scoped by purpose and time.
  • Evidence: critical actions emit verifiable audit events.

Failure modes

  • Timeout ambiguity causing double-apply or partial state transitions.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Recovery paths that only work when nothing is broken.
  • Mixed-version behavior that violates assumptions silently.
Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  input["Input"] --> parse["Parse/Validate"]
  parse --> decide["Decide (pure)"]
  decide --> write["Durable write"]
  write --> ack["Acknowledge"]
  ack --> obs["Emit evidence (logs/metrics)"]

Implementation notes

Correctness lives in the seams: encoding, persistence, concurrency, and retries.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

use core::fmt;

#[derive(Clone, Debug)]
pub enum Event {
    Input(Vec<u8>),
    Tick,
    Fault(&'static str),
}

pub trait StateMachine {
    type State: Clone + fmt::Debug;
    type Error: fmt::Debug;

    fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
    fn invariant(state: &Self::State) -> bool;
}

// Reproducible Builds: Trusting Artifacts in a Hostile World: invariants are part of the API contract.

Verification strategy

  • Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
  • Differential tests against a reference model (even a slow one).
  • Property-based tests: generate adversarial sequences and assert invariants after every step.
  • Metamorphic tests: same operation applied twice must not change the result.
  • Deterministic schedulers (e.g., Loom-like) to force rare interleavings.

Operational notes

  • Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
  • Log as evidence: append-only where possible; isolate logs from compromised workloads.
  • Make rollbacks safe: schema and protocol compatibility is a security boundary.
  • Track invariant violations as pages, not dashboards.
  • Validate time assumptions: alert on clock steps, skew, and monotonicity issues.
Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

  • Authz failures and policy denials (unexpected spikes).
  • Error budget burn + tail latency under load.
  • Rollback events and the conditions that triggered them.
  • Admission-control / rate-limit rejections (by reason).
  • Invariant violation rate (should be ~0).

Rollback plan

  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Keep dual-write / dual-verify windows where appropriate.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

  • Jepsen (1) — Fault injection and correctness testing for distributed systems.
    • Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
  • RFC 9110: HTTP Semantics (2) — Defines method semantics including idempotency and safety—useful for API contracts.
    • Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.

Open questions

  • What would you do if you had to replay a month of traffic into a rebuilt system?
  • Which operations need monotonic versioning vs idempotency keys vs both?
  • Which invariant, if violated, would silently corrupt state for weeks?
  • Which correctness properties can be enforced at compile time (types/capabilities)?

Checklist

  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Rollback plan rehearsed and automated.
  • Assumptions listed and reviewed.
  • Failure modes enumerated with mitigations.
  • Telemetry captures correctness signals.
  • Safety properties stated as invariants.

Further reading

1.
Jepsen. Jepsen: Distributed Systems Safety Analysis [Internet]. Web; Available from: https://jepsen.io/
2.
Fielding RT, Nottingham M, Reschke J. HTTP Semantics [Internet]. RFC Editor; 2022. Report No.: 9110. Available from: https://www.rfc-editor.org/rfc/rfc9110