Monthly research note. Theme: Formal Methods & Verification.

TL;DR

A focused memo on Model Checking at Scale: State Explosion and How to Cheat: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Write properties in plain language next to the formal statement.
  • Model the smallest system that can still fail in the way you fear.
  • Refinement boundaries prevent spec drift between paper and code.
  • Write assumptions down; treat them as interfaces.
  • Bind security decisions to evidence (audit, invariants, telemetry).

Why this matters

  • Refinement boundaries prevent “spec drift” between paper and code.
  • The goal is not a perfect proof—it’s reducing the space of unknown failure modes.
  • Most catastrophic bugs are small: a missing condition, a stale variable, a rare interleaving.
  • Formal models force you to name assumptions (time, ordering, failure).

Key questions

  • Which properties belong in the model vs in tests vs in monitoring?
  • What is the environment model (adversary actions, scheduling, failures)?
  • Which invariants must hold under every interleaving and crash point?
  • What is the refinement boundary between spec and implementation?
  • What is the smallest model that still captures the bug class you fear?
  • How do you ensure proofs stay valid through refactors and upgrades?

Assumptions

  • Concurrency introduces interleavings humans don’t reason about reliably.
  • Specifications omit details; implementations invent them. That gap is risk.
  • Teams need workflows that keep models and code aligned over time.
  • Adversaries choose the worst schedule, not the average one.

Non-goals

  • Writing models that can’t produce counterexamples quickly.
  • Proving the whole system end-to-end with all implementation details.
Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

A common way to state linearizability is existence of a sequential history:

Hs: Hs is sequential HsHc.\exists H_s:\ H_s \text{ is sequential } \wedge H_s \sim H_c.

Model the scheduler explicitly when concurrency is part of the threat model.

Write properties in plain language next to the formal version.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Evidence: critical actions emit verifiable audit events.
  • Integrity: invalid transitions are rejected (and detectable).
  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Least authority: privileges are scoped by purpose and time.

Failure modes

  • Timeout ambiguity causing double-apply or partial state transitions.
  • Config drift that weakens security posture over time.
  • Observability gaps during incidents (missing evidence).
  • Mixed-version behavior that violates assumptions silently.
Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart LR
  spec["Spec (TLA+/PlusCal)"] --> mc["Model Check"]
  mc --> refine["Refinement / Invariants"]
  refine --> impl["Implementation (Rust/Go)"]
  impl --> tests["Fuzz / PBT / Differential"]
  tests --> spec

Implementation notes

Treat invariants as code: version, review, and test them.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

// Practical tip: make the model "executable" enough to emit traces you can replay.
// Then treat traces as regression inputs for your implementation.

Verification strategy

  • Runtime assertions for invariants that are cheap to check.
  • Proof maintenance: keep models in CI with a time budget.
  • Differential tests against other implementations/specs.
  • Model checking bounded versions of the core protocol.
  • Refinement tests: compare model traces to implementation traces.

Operational notes

  • Treat counterexamples as incidents: track, root-cause, regression-test.
  • Run the model checker in CI with explicit timeouts and bounds.
  • Use models to evaluate protocol upgrades before shipping.
  • Version properties and invariants like code; review changes carefully.
  • Keep a library of “known hard schedules” from past failures.
Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

  • Retry/timeout rates by endpoint and client cohort.
  • Rollback events and the conditions that triggered them.
  • Invariant violation rate (should be ~0).
  • Admission-control / rate-limit rejections (by reason).
  • Error budget burn + tail latency under load.

Rollback plan

  • Use canaries and staged rollout; stop early when signals degrade.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Keep dual-write / dual-verify windows where appropriate.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

  • Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
    • Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
  • Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

  • Which invariants are cheap enough to monitor in production?
  • How will you keep models aligned during rapid iteration?
  • Which properties are you currently assuming but not testing or proving?
  • What is the smallest model that reproduces your worst incident class?

Checklist

  • Telemetry captures correctness signals.
  • Assumptions listed and reviewed.
  • Rollback plan rehearsed and automated.
  • Safety properties stated as invariants.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Failure modes enumerated with mitigations.

Further reading

1.
Beyer B, Jones C, Petoff J, Murphy NR. Site Reliability Engineering: How Google Runs Production Systems [Internet]. O’Reilly Media; 2016. Available from: https://sre.google/sre-book/table-of-contents/
2.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/