Monthly research note. Theme: Correctness & Foundations.
TL;DR
Cryptographic Hygiene: Domain Separation, KDFs, and Context Binding as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
Correctness is cheaper to enforce at interfaces than to repair in production data.
Key takeaways
- Crash points are part of the design; specify recovery after each state mutation.
- Separate durable state from derived state; derived must be recomputable or reconcilable.
- Ack semantics must be explicit: durable, best-effort, or ambiguous.
- Treat retries, reordering, and partial failure as default conditions.
- Automate guardrails; humans are for judgment, not for consistent enforcement.
Why this matters
- Undefined behavior is an attack surface when inputs are adversarial.
- Correctness bugs are indistinguishable from security incidents when the system is adversarial.
- Correctness is a property you enforce at boundaries: parsing, persistence, concurrency, RPC.
- A system without explicit contracts becomes a collection of folklore and dashboards.
Key questions
- What exactly is the state, and what is derived or cached?
- What does a client learn after a timeout: success, failure, or ambiguity?
- How do you ensure deduplication is scoped correctly (tenant, resource, operation)?
- Which invariants must hold across crashes, restarts, and partial deployments?
- How do you make “unsafe defaults” impossible to ship?
- Where do you need atomicity (and where is eventual consistency acceptable)?
Assumptions
- Clients retry with backoff but not with perfect discipline (bursts happen).
- Input is hostile: malformed, oversized, boundary values, protocol confusion.
- Crashes happen mid-write (torn state) unless you prove otherwise.
- Requests can be duplicated, reordered, delayed, and replayed across restarts.
Non-goals
- Letting recovery be “restart the service and hope.”
- Baking invariants into tribal knowledge instead of code.
Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.
Model & invariants
We want a transition function and invariant such that:
Treat invariants as a first-class interface: a function that cannot check its invariants cannot be safely composed. Start with the smallest invariant that is both meaningful and enforceable at your boundaries.
Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.
Make the “impossible state” observable: a metric or alert that fires when invariants drift.
Security properties
- Evidence: critical actions emit verifiable audit events.
- Least authority: privileges are scoped by purpose and time.
- Integrity: invalid transitions are rejected (and detectable).
- Downgrade resistance: negotiation can’t silently weaken security posture.
Failure modes
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Timeout ambiguity causing double-apply or partial state transitions.
- Mixed-version behavior that violates assumptions silently.
- Config drift that weakens security posture over time.
A recovery plan that isn’t exercised will fail when you need it.
Design sketch
flowchart TD
input["Input"] --> parse["Parse/Validate"]
parse --> decide["Decide (pure)"]
decide --> write["Durable write"]
write --> ack["Acknowledge"]
ack --> obs["Emit evidence (logs/metrics)"]Implementation notes
Implementation is the act of making invalid state unrepresentable (or at least unignorable).
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
Correctness checklist:
1) Define state (durable vs derived).
2) Enumerate transitions.
3) Write invariants (safety) and progress conditions (liveness).
4) Pick crash points and specify recovery.
5) Make retries part of semantics (idempotency keys, monotonic versions).Verification strategy
- Differential tests against a reference model (even a slow one).
- Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
- Property-based tests: generate adversarial sequences and assert invariants after every step.
- Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
- Fault injection: latency, partial writes, dropped acks, and duplicated messages.
Operational notes
- Validate time assumptions: alert on clock steps, skew, and monotonicity issues.
- Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
- Design “degraded modes” explicitly (fail closed vs fail open per operation).
- Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
- Instrument ambiguity: measure “unknown outcome” responses separately from failures.
Make degraded modes explicit: fail closed vs fail open is a policy choice.
What to monitor
- Invariant violation rate (should be ~0).
- Authz failures and policy denials (unexpected spikes).
- Admission-control / rate-limit rejections (by reason).
- Retry/timeout rates by endpoint and client cohort.
- Rollback events and the conditions that triggered them.
Rollback plan
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Use canaries and staged rollout; stop early when signals degrade.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Define an explicit rollback trigger (metrics + thresholds).
- Keep dual-write / dual-verify windows where appropriate.
Evidence
- Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
- Jepsen (2) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Open questions
- Which correctness properties can be enforced at compile time (types/capabilities)?
- Which invariant, if violated, would silently corrupt state for weeks?
- Which operations need monotonic versioning vs idempotency keys vs both?
- Where does your API currently allow ambiguous outcomes, and how will clients cope?
Checklist
- Telemetry captures correctness signals.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Rollback plan rehearsed and automated.
- Safety properties stated as invariants.
- Failure modes enumerated with mitigations.
- Assumptions listed and reviewed.
Further reading
- Time, Clocks, and the Ordering of Events (Lamport, 1978) — The mental model for causality and ordering in distributed systems.
- Paxos Made Simple (Lamport) — A clean reference for agreement and invariants.
- RFC 9110: HTTP Semantics — Defines method semantics including idempotency and safety—useful for API contracts.
- Learn TLA+ — A pragmatic workflow for invariants and model checking.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Jepsen — Fault injection and correctness testing for distributed systems.