Monthly research note. Theme: Correctness & Foundations.
TL;DR
Threat Modeling for Engineers: Assumptions as Interfaces as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.
Key takeaways
- Ack semantics must be explicit: durable, best-effort, or ambiguous.
- Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
- Crash points are part of the design; specify recovery after each state mutation.
- Measure correctness signals, not only latency/throughput.
- Write assumptions down; treat them as interfaces.
Why this matters
- Undefined behavior is an attack surface when inputs are adversarial.
- Correctness bugs are indistinguishable from security incidents when the system is adversarial.
- Performance work that changes semantics is a correctness regression with a nicer latency chart.
- Your on-call runbook is part of the specification—make it match the code.
Key questions
- Where do you need atomicity (and where is eventual consistency acceptable)?
- What does a client learn after a timeout: success, failure, or ambiguity?
- Which invariants must hold across crashes, restarts, and partial deployments?
- Which transitions are allowed, and which are impossible by construction?
- How do you make “unsafe defaults” impossible to ship?
- What exactly is the state, and what is derived or cached?
Assumptions
- Input is hostile: malformed, oversized, boundary values, protocol confusion.
- Observability is incomplete: you will debug from partial evidence.
- Deployments are mixed-version for longer than you think.
- Crashes happen mid-write (torn state) unless you prove otherwise.
Non-goals
- Relying on “best effort” client behavior for safety properties.
- Letting recovery be “restart the service and hope.”
Parsing is an attacker-controlled interface—validate early and fail fast.
Model & invariants
A common pattern is splitting state into durable vs derived:
Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.
Treat invariants as a first-class interface: a function that cannot check its invariants cannot be safely composed. Start with the smallest invariant that is both meaningful and enforceable at your boundaries.
If the system can enter an invalid state, it eventually will—usually during an incident.
Security properties
- Least authority: privileges are scoped by purpose and time.
- Authenticity: actions are bound to identity and purpose.
- Replay resistance: duplicated inputs do not change outcomes.
- Evidence: critical actions emit verifiable audit events.
Failure modes
- Timeout ambiguity causing double-apply or partial state transitions.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Config drift that weakens security posture over time.
- Recovery paths that only work when nothing is broken.
Caches tend to become sources of truth unless you can recompute and validate them.
Design sketch
flowchart TD
input["Input"] --> parse["Parse/Validate"]
parse --> decide["Decide (pure)"]
decide --> write["Durable write"]
write --> ack["Acknowledge"]
ack --> obs["Emit evidence (logs/metrics)"]Implementation notes
Implementation is the act of making invalid state unrepresentable (or at least unignorable).
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
// Idempotency sketch: reserve -> execute -> commit result (or return cached).
type Key string
type Store interface {
Get(key Key) (value []byte, ok bool, err error)
PutIfAbsent(key Key, value []byte) (stored bool, err error)
}
// Threat Modeling for Engineers: Assumptions as Interfaces: "timeout" must not mean "try again and maybe double-apply".Verification strategy
- Fault injection: latency, partial writes, dropped acks, and duplicated messages.
- Property-based tests: generate adversarial sequences and assert invariants after every step.
- Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
- Crash/restart tests: persist mid-transition and validate recovery correctness.
- Metamorphic tests: same operation applied twice must not change the result.
Operational notes
- Instrument ambiguity: measure “unknown outcome” responses separately from failures.
- Make rollbacks safe: schema and protocol compatibility is a security boundary.
- Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
- Log as evidence: append-only where possible; isolate logs from compromised workloads.
- Design “degraded modes” explicitly (fail closed vs fail open per operation).
Make degraded modes explicit: fail closed vs fail open is a policy choice.
What to monitor
- Error budget burn + tail latency under load.
- Authz failures and policy denials (unexpected spikes).
- Retry/timeout rates by endpoint and client cohort.
- Rollback events and the conditions that triggered them.
- Invariant violation rate (should be ~0).
Rollback plan
- Keep dual-write / dual-verify windows where appropriate.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Define an explicit rollback trigger (metrics + thresholds).
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Use canaries and staged rollout; stop early when signals degrade.
Evidence
- Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
- RFC 9110: HTTP Semantics (2) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.
Open questions
- Which correctness properties can be enforced at compile time (types/capabilities)?
- What is the minimal durable record needed to recover safely?
- Which operations need monotonic versioning vs idempotency keys vs both?
- Which invariant, if violated, would silently corrupt state for weeks?
Checklist
- Safety properties stated as invariants.
- Assumptions listed and reviewed.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Failure modes enumerated with mitigations.
- Rollback plan rehearsed and automated.
- Telemetry captures correctness signals.
Further reading
- Time, Clocks, and the Ordering of Events (Lamport, 1978) — The mental model for causality and ordering in distributed systems.
- RFC 9110: HTTP Semantics — Defines method semantics including idempotency and safety—useful for API contracts.
- Paxos Made Simple (Lamport) — A clean reference for agreement and invariants.
- Learn TLA+ — A pragmatic workflow for invariants and model checking.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.