Security vs Reliability: When the Same Bug Has Two Names

Monthly research note. Theme: Correctness & Foundations.

TL;DR

Security vs Reliability: When the Same Bug Has Two Names as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
Ack semantics must be explicit: durable, best-effort, or ambiguous.
Measure correctness signals, not only latency/throughput.
Define safety properties before performance goals.

Why this matters

Correctness is a property you enforce at boundaries: parsing, persistence, concurrency, RPC.
The cost of unclear invariants is paid in production, under load, during an incident.
Correctness bugs are indistinguishable from security incidents when the system is adversarial.
Most outages are “state management” failures: partial writes, ambiguous outcomes, invalid transitions.

Key questions

Which invariants must hold across crashes, restarts, and partial deployments?
What is your ordering model: FIFO per key, per partition, or none at all?
What exactly is the state, and what is derived or cached?
How do you ensure deduplication is scoped correctly (tenant, resource, operation)?
Which transitions are allowed, and which are impossible by construction?
What does a client learn after a timeout: success, failure, or ambiguity?

Assumptions

Deployments are mixed-version for longer than you think.
Requests can be duplicated, reordered, delayed, and replayed across restarts.
Partial failure is normal: one replica slow, one unavailable, one returning stale data.
Input is hostile: malformed, oversized, boundary values, protocol confusion.

Non-goals

Baking invariants into tribal knowledge instead of code.
Letting recovery be “restart the service and hope.”

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

We want a transition function $\delta$ and invariant $\mathrm{Inv}$ such that:

s_{t+1} = \delta(s_t, e_t)\qquad\wedge\qquad \mathrm{Inv}(s_t)\Rightarrow \mathrm{Inv}(s_{t+1}).

If you can’t define what a timeout means, you can’t implement retries safely. Make ambiguity explicit in the API.

Crash points matter: define what happens if the process stops after each line that mutates state or acknowledges work.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

Authenticity: actions are bound to identity and purpose.
Evidence: critical actions emit verifiable audit events.
Downgrade resistance: negotiation can’t silently weaken security posture.
Integrity: invalid transitions are rejected (and detectable).

Failure modes

Recovery paths that only work when nothing is broken.
Mixed-version behavior that violates assumptions silently.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Timeout ambiguity causing double-apply or partial state transitions.

Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart TD
  input["Input"] --> parse["Parse/Validate"]
  parse --> decide["Decide (pure)"]
  decide --> write["Durable write"]
  write --> ack["Acknowledge"]
  ack --> obs["Emit evidence (logs/metrics)"]

Implementation notes

The goal isn’t cleverness—it’s eliminating ambiguity at boundaries and making recovery boring.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

// Idempotency sketch: reserve -> execute -> commit result (or return cached).
type Key string

type Store interface {
  Get(key Key) (value []byte, ok bool, err error)
  PutIfAbsent(key Key, value []byte) (stored bool, err error)
}

// Security vs Reliability: When the Same Bug Has Two Names: "timeout" must not mean "try again and maybe double-apply".

Verification strategy

Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
Metamorphic tests: same operation applied twice must not change the result.
Fault injection: latency, partial writes, dropped acks, and duplicated messages.
Property-based tests: generate adversarial sequences and assert invariants after every step.
Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.

Operational notes

Log as evidence: append-only where possible; isolate logs from compromised workloads.
Track invariant violations as pages, not dashboards.
Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
Instrument ambiguity: measure “unknown outcome” responses separately from failures.
Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Retry/timeout rates by endpoint and client cohort.
Rollback events and the conditions that triggered them.
Invariant violation rate (should be ~0).
Error budget burn + tail latency under load.
Authz failures and policy denials (unexpected spikes).

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Keep dual-write / dual-verify windows where appropriate.
Define an explicit rollback trigger (metrics + thresholds).

Evidence

Learn TLA+ (1) — A pragmatic workflow for invariants and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

Which operations need monotonic versioning vs idempotency keys vs both?
What is the minimal durable record needed to recover safely?
Where does your API currently allow ambiguous outcomes, and how will clients cope?
Which correctness properties can be enforced at compile time (types/capabilities)?

Checklist

Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Rollback plan rehearsed and automated.
Telemetry captures correctness signals.
Safety properties stated as invariants.
Assumptions listed and reviewed.
Failure modes enumerated with mitigations.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading