Threat Modeling for Engineers: Assumptions as Interfaces

Monthly research note. Theme: Correctness & Foundations.

TL;DR

Threat Modeling for Engineers: Assumptions as Interfaces as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Ack semantics must be explicit: durable, best-effort, or ambiguous.
Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
Crash points are part of the design; specify recovery after each state mutation.
Measure correctness signals, not only latency/throughput.
Write assumptions down; treat them as interfaces.

Why this matters

Undefined behavior is an attack surface when inputs are adversarial.
Correctness bugs are indistinguishable from security incidents when the system is adversarial.
Performance work that changes semantics is a correctness regression with a nicer latency chart.
Your on-call runbook is part of the specification—make it match the code.

Key questions

Where do you need atomicity (and where is eventual consistency acceptable)?
What does a client learn after a timeout: success, failure, or ambiguity?
Which invariants must hold across crashes, restarts, and partial deployments?
Which transitions are allowed, and which are impossible by construction?
How do you make “unsafe defaults” impossible to ship?
What exactly is the state, and what is derived or cached?

Assumptions

Input is hostile: malformed, oversized, boundary values, protocol confusion.
Observability is incomplete: you will debug from partial evidence.
Deployments are mixed-version for longer than you think.
Crashes happen mid-write (torn state) unless you prove otherwise.

Non-goals

Relying on “best effort” client behavior for safety properties.
Letting recovery be “restart the service and hope.”

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

A common pattern is splitting state into durable vs derived:

S = S_\text{durable} \times S_\text{derived}\qquad\text{and}\qquad S_\text{derived} = f(S_\text{durable}).

Prefer monotonic identifiers at boundaries (sequence numbers, epochs, version vectors) so that replays are detectable and order can be reasoned about.

Treat invariants as a first-class interface: a function that cannot check its invariants cannot be safely composed. Start with the smallest invariant that is both meaningful and enforceable at your boundaries.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

Least authority: privileges are scoped by purpose and time.
Authenticity: actions are bound to identity and purpose.
Replay resistance: duplicated inputs do not change outcomes.
Evidence: critical actions emit verifiable audit events.

Failure modes

Timeout ambiguity causing double-apply or partial state transitions.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Config drift that weakens security posture over time.
Recovery paths that only work when nothing is broken.

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  input["Input"] --> parse["Parse/Validate"]
  parse --> decide["Decide (pure)"]
  decide --> write["Durable write"]
  write --> ack["Acknowledge"]
  ack --> obs["Emit evidence (logs/metrics)"]

Implementation notes

Implementation is the act of making invalid state unrepresentable (or at least unignorable).

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

// Idempotency sketch: reserve -> execute -> commit result (or return cached).
type Key string

type Store interface {
  Get(key Key) (value []byte, ok bool, err error)
  PutIfAbsent(key Key, value []byte) (stored bool, err error)
}

// Threat Modeling for Engineers: Assumptions as Interfaces: "timeout" must not mean "try again and maybe double-apply".

Verification strategy

Fault injection: latency, partial writes, dropped acks, and duplicated messages.
Property-based tests: generate adversarial sequences and assert invariants after every step.
Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
Crash/restart tests: persist mid-transition and validate recovery correctness.
Metamorphic tests: same operation applied twice must not change the result.

Operational notes

Instrument ambiguity: measure “unknown outcome” responses separately from failures.
Make rollbacks safe: schema and protocol compatibility is a security boundary.
Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
Log as evidence: append-only where possible; isolate logs from compromised workloads.
Design “degraded modes” explicitly (fail closed vs fail open per operation).

Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

Error budget burn + tail latency under load.
Authz failures and policy denials (unexpected spikes).
Retry/timeout rates by endpoint and client cohort.
Rollback events and the conditions that triggered them.
Invariant violation rate (should be ~0).

Rollback plan

Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Define an explicit rollback trigger (metrics + thresholds).
Prefer backward-compatible changes; avoid “flag day” upgrades.
Use canaries and staged rollout; stop early when signals degrade.

Evidence

Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
RFC 9110: HTTP Semantics (2) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.

Open questions

Which correctness properties can be enforced at compile time (types/capabilities)?
What is the minimal durable record needed to recover safely?
Which operations need monotonic versioning vs idempotency keys vs both?
Which invariant, if violated, would silently corrupt state for weeks?

Checklist

Safety properties stated as invariants.
Assumptions listed and reviewed.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Telemetry captures correctness signals.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading