Monthly research note. Theme: Correctness & Foundations.

TL;DR

Security vs Reliability: When the Same Bug Has Two Names as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

  • Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
  • Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
  • Ack semantics must be explicit: durable, best-effort, or ambiguous.
  • Measure correctness signals, not only latency/throughput.
  • Define safety properties before performance goals.

Why this matters

  • Correctness is a property you enforce at boundaries: parsing, persistence, concurrency, RPC.
  • The cost of unclear invariants is paid in production, under load, during an incident.
  • Correctness bugs are indistinguishable from security incidents when the system is adversarial.
  • Most outages are “state management” failures: partial writes, ambiguous outcomes, invalid transitions.

Key questions

  • Which invariants must hold across crashes, restarts, and partial deployments?
  • What is your ordering model: FIFO per key, per partition, or none at all?
  • What exactly is the state, and what is derived or cached?
  • How do you ensure deduplication is scoped correctly (tenant, resource, operation)?
  • Which transitions are allowed, and which are impossible by construction?
  • What does a client learn after a timeout: success, failure, or ambiguity?

Assumptions

  • Deployments are mixed-version for longer than you think.
  • Requests can be duplicated, reordered, delayed, and replayed across restarts.
  • Partial failure is normal: one replica slow, one unavailable, one returning stale data.
  • Input is hostile: malformed, oversized, boundary values, protocol confusion.

Non-goals

  • Baking invariants into tribal knowledge instead of code.
  • Letting recovery be “restart the service and hope.”
Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

We want a transition function δ\delta and invariant Inv\mathrm{Inv} such that:

st+1=δ(st,et)Inv(st)Inv(st+1).s_{t+1} = \delta(s_t, e_t)\qquad\wedge\qquad \mathrm{Inv}(s_t)\Rightarrow \mathrm{Inv}(s_{t+1}).

If you can’t define what a timeout means, you can’t implement retries safely. Make ambiguity explicit in the API.

Crash points matter: define what happens if the process stops after each line that mutates state or acknowledges work.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

  • Authenticity: actions are bound to identity and purpose.
  • Evidence: critical actions emit verifiable audit events.
  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Integrity: invalid transitions are rejected (and detectable).

Failure modes

  • Recovery paths that only work when nothing is broken.
  • Mixed-version behavior that violates assumptions silently.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Timeout ambiguity causing double-apply or partial state transitions.
Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart TD
  input["Input"] --> parse["Parse/Validate"]
  parse --> decide["Decide (pure)"]
  decide --> write["Durable write"]
  write --> ack["Acknowledge"]
  ack --> obs["Emit evidence (logs/metrics)"]

Implementation notes

The goal isn’t cleverness—it’s eliminating ambiguity at boundaries and making recovery boring.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

// Idempotency sketch: reserve -> execute -> commit result (or return cached).
type Key string

type Store interface {
  Get(key Key) (value []byte, ok bool, err error)
  PutIfAbsent(key Key, value []byte) (stored bool, err error)
}

// Security vs Reliability: When the Same Bug Has Two Names: "timeout" must not mean "try again and maybe double-apply".

Verification strategy

  • Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
  • Metamorphic tests: same operation applied twice must not change the result.
  • Fault injection: latency, partial writes, dropped acks, and duplicated messages.
  • Property-based tests: generate adversarial sequences and assert invariants after every step.
  • Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.

Operational notes

  • Log as evidence: append-only where possible; isolate logs from compromised workloads.
  • Track invariant violations as pages, not dashboards.
  • Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
  • Instrument ambiguity: measure “unknown outcome” responses separately from failures.
  • Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

  • Retry/timeout rates by endpoint and client cohort.
  • Rollback events and the conditions that triggered them.
  • Invariant violation rate (should be ~0).
  • Error budget burn + tail latency under load.
  • Authz failures and policy denials (unexpected spikes).

Rollback plan

  • Use canaries and staged rollout; stop early when signals degrade.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Keep dual-write / dual-verify windows where appropriate.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • Learn TLA+ (1) — A pragmatic workflow for invariants and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
  • Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

  • Which operations need monotonic versioning vs idempotency keys vs both?
  • What is the minimal durable record needed to recover safely?
  • Where does your API currently allow ambiguous outcomes, and how will clients cope?
  • Which correctness properties can be enforced at compile time (types/capabilities)?

Checklist

  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Rollback plan rehearsed and automated.
  • Telemetry captures correctness signals.
  • Safety properties stated as invariants.
  • Assumptions listed and reviewed.
  • Failure modes enumerated with mitigations.

Further reading

1.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/
2.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/