Monthly research note. Theme: Correctness & Foundations.

TL;DR

Cryptographic Hygiene: Domain Separation, KDFs, and Context Binding as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Crash points are part of the design; specify recovery after each state mutation.
  • Separate durable state from derived state; derived must be recomputable or reconcilable.
  • Ack semantics must be explicit: durable, best-effort, or ambiguous.
  • Treat retries, reordering, and partial failure as default conditions.
  • Automate guardrails; humans are for judgment, not for consistent enforcement.

Why this matters

  • Undefined behavior is an attack surface when inputs are adversarial.
  • Correctness bugs are indistinguishable from security incidents when the system is adversarial.
  • Correctness is a property you enforce at boundaries: parsing, persistence, concurrency, RPC.
  • A system without explicit contracts becomes a collection of folklore and dashboards.

Key questions

  • What exactly is the state, and what is derived or cached?
  • What does a client learn after a timeout: success, failure, or ambiguity?
  • How do you ensure deduplication is scoped correctly (tenant, resource, operation)?
  • Which invariants must hold across crashes, restarts, and partial deployments?
  • How do you make “unsafe defaults” impossible to ship?
  • Where do you need atomicity (and where is eventual consistency acceptable)?

Assumptions

  • Clients retry with backoff but not with perfect discipline (bursts happen).
  • Input is hostile: malformed, oversized, boundary values, protocol confusion.
  • Crashes happen mid-write (torn state) unless you prove otherwise.
  • Requests can be duplicated, reordered, delayed, and replayed across restarts.

Non-goals

  • Letting recovery be “restart the service and hope.”
  • Baking invariants into tribal knowledge instead of code.
Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

We want a transition function δ\delta and invariant Inv\mathrm{Inv} such that:

st+1=δ(st,et)Inv(st)Inv(st+1).s_{t+1} = \delta(s_t, e_t)\qquad\wedge\qquad \mathrm{Inv}(s_t)\Rightarrow \mathrm{Inv}(s_{t+1}).

Treat invariants as a first-class interface: a function that cannot check its invariants cannot be safely composed. Start with the smallest invariant that is both meaningful and enforceable at your boundaries.

Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Evidence: critical actions emit verifiable audit events.
  • Least authority: privileges are scoped by purpose and time.
  • Integrity: invalid transitions are rejected (and detectable).
  • Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Mixed-version behavior that violates assumptions silently.
  • Config drift that weakens security posture over time.
Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart TD
  input["Input"] --> parse["Parse/Validate"]
  parse --> decide["Decide (pure)"]
  decide --> write["Durable write"]
  write --> ack["Acknowledge"]
  ack --> obs["Emit evidence (logs/metrics)"]

Implementation notes

Implementation is the act of making invalid state unrepresentable (or at least unignorable).

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

Correctness checklist:
1) Define state (durable vs derived).
2) Enumerate transitions.
3) Write invariants (safety) and progress conditions (liveness).
4) Pick crash points and specify recovery.
5) Make retries part of semantics (idempotency keys, monotonic versions).

Verification strategy

  • Differential tests against a reference model (even a slow one).
  • Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
  • Property-based tests: generate adversarial sequences and assert invariants after every step.
  • Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
  • Fault injection: latency, partial writes, dropped acks, and duplicated messages.

Operational notes

  • Validate time assumptions: alert on clock steps, skew, and monotonicity issues.
  • Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
  • Design “degraded modes” explicitly (fail closed vs fail open per operation).
  • Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
  • Instrument ambiguity: measure “unknown outcome” responses separately from failures.
Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

  • Invariant violation rate (should be ~0).
  • Authz failures and policy denials (unexpected spikes).
  • Admission-control / rate-limit rejections (by reason).
  • Retry/timeout rates by endpoint and client cohort.
  • Rollback events and the conditions that triggered them.

Rollback plan

  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Keep dual-write / dual-verify windows where appropriate.

Evidence

  • Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
  • Jepsen (2) — Fault injection and correctness testing for distributed systems.
    • Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

  • Which correctness properties can be enforced at compile time (types/capabilities)?
  • Which invariant, if violated, would silently corrupt state for weeks?
  • Which operations need monotonic versioning vs idempotency keys vs both?
  • Where does your API currently allow ambiguous outcomes, and how will clients cope?

Checklist

  • Telemetry captures correctness signals.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Rollback plan rehearsed and automated.
  • Safety properties stated as invariants.
  • Failure modes enumerated with mitigations.
  • Assumptions listed and reviewed.

Further reading

1.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/
2.
Jepsen. Jepsen: Distributed Systems Safety Analysis [Internet]. Web; Available from: https://jepsen.io/