Monthly research note. Theme: Formal Methods & Verification.

TL;DR

A focused memo on TLA+ for Engineers: Modeling the Minimal Thing That Can Break You: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

  • Counterexamples are engineering artifacts—minimize them and turn them into tests.
  • Write properties in plain language next to the formal statement.
  • Model the smallest system that can still fail in the way you fear.
  • Prefer protocols and APIs that make invalid states hard to express.
  • Treat retries, reordering, and partial failure as default conditions.

Why this matters

  • Verification complements testing by exploring adversarial schedules systematically.
  • Refinement boundaries prevent “spec drift” between paper and code.
  • Counterexamples are better than intuition—they are executable bug reports.
  • The goal is not a perfect proof—it’s reducing the space of unknown failure modes.

Key questions

  • What is the environment model (adversary actions, scheduling, failures)?
  • Which invariants must hold under every interleaving and crash point?
  • How do you convert counterexamples into test harnesses?
  • Which properties belong in the model vs in tests vs in monitoring?
  • How do you handle state explosion (symmetry, abstraction, bounds)?
  • What is the refinement boundary between spec and implementation?

Assumptions

  • Adversaries choose the worst schedule, not the average one.
  • Concurrency introduces interleavings humans don’t reason about reliably.
  • Most systems have implicit assumptions about timeouts and ordering.
  • Specifications omit details; implementations invent them. That gap is risk.

Non-goals

  • Proving the whole system end-to-end with all implementation details.
  • Treating verification as a one-time event rather than a process.
Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

In temporal logic terms, the common shape is:

SafetyInvLivenessProgress.\mathrm{Safety} \equiv \Box\,\mathrm{Inv}\qquad\qquad \mathrm{Liveness} \equiv \Box\Diamond\,\mathrm{Progress}.

Write properties in plain language next to the formal version.

Model the scheduler explicitly when concurrency is part of the threat model.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Replay resistance: duplicated inputs do not change outcomes.
  • Integrity: invalid transitions are rejected (and detectable).
  • Evidence: critical actions emit verifiable audit events.
  • Least authority: privileges are scoped by purpose and time.

Failure modes

  • Observability gaps during incidents (missing evidence).
  • Mixed-version behavior that violates assumptions silently.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Config drift that weakens security posture over time.
Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart LR
  spec["Spec (TLA+/PlusCal)"] --> mc["Model Check"]
  mc --> refine["Refinement / Invariants"]
  refine --> impl["Implementation (Rust/Go)"]
  impl --> tests["Fuzz / PBT / Differential"]
  tests --> spec

Implementation notes

Treat invariants as code: version, review, and test them.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

// Practical tip: make the model "executable" enough to emit traces you can replay.
// Then treat traces as regression inputs for your implementation.

Verification strategy

  • Model checking bounded versions of the core protocol.
  • Differential tests against other implementations/specs.
  • Runtime assertions for invariants that are cheap to check.
  • Property-based tests derived from invariants.
  • Refinement tests: compare model traces to implementation traces.

Operational notes

  • Treat counterexamples as incidents: track, root-cause, regression-test.
  • Version properties and invariants like code; review changes carefully.
  • Keep a library of “known hard schedules” from past failures.
  • Use models to evaluate protocol upgrades before shipping.
  • Run the model checker in CI with explicit timeouts and bounds.
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Authz failures and policy denials (unexpected spikes).
  • Invariant violation rate (should be ~0).
  • Error budget burn + tail latency under load.
  • Retry/timeout rates by endpoint and client cohort.
  • Admission-control / rate-limit rejections (by reason).

Rollback plan

  • Use canaries and staged rollout; stop early when signals degrade.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Keep dual-write / dual-verify windows where appropriate.

Evidence

  • Learn TLA+ (1) — Practical workflow and examples.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
  • Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
    • Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

  • How will you keep models aligned during rapid iteration?
  • What is the smallest model that reproduces your worst incident class?
  • Which invariants are cheap enough to monitor in production?
  • Which properties are you currently assuming but not testing or proving?

Checklist

  • Safety properties stated as invariants.
  • Assumptions listed and reviewed.
  • Failure modes enumerated with mitigations.
  • Rollback plan rehearsed and automated.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Telemetry captures correctness signals.

Further reading

1.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/
2.
Beyer B, Jones C, Petoff J, Murphy NR. Site Reliability Engineering: How Google Runs Production Systems [Internet]. O’Reilly Media; 2016. Available from: https://sre.google/sre-book/table-of-contents/