TLA+ for Engineers: Modeling the Minimal Thing That Can Break You

Monthly research note. Theme: Formal Methods & Verification.

TL;DR

A focused memo on TLA+ for Engineers: Modeling the Minimal Thing That Can Break You: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Counterexamples are engineering artifacts—minimize them and turn them into tests.
Write properties in plain language next to the formal statement.
Model the smallest system that can still fail in the way you fear.
Prefer protocols and APIs that make invalid states hard to express.
Treat retries, reordering, and partial failure as default conditions.

Why this matters

Verification complements testing by exploring adversarial schedules systematically.
Refinement boundaries prevent “spec drift” between paper and code.
Counterexamples are better than intuition—they are executable bug reports.
The goal is not a perfect proof—it’s reducing the space of unknown failure modes.

Key questions

What is the environment model (adversary actions, scheduling, failures)?
Which invariants must hold under every interleaving and crash point?
How do you convert counterexamples into test harnesses?
Which properties belong in the model vs in tests vs in monitoring?
How do you handle state explosion (symmetry, abstraction, bounds)?
What is the refinement boundary between spec and implementation?

Assumptions

Adversaries choose the worst schedule, not the average one.
Concurrency introduces interleavings humans don’t reason about reliably.
Most systems have implicit assumptions about timeouts and ordering.
Specifications omit details; implementations invent them. That gap is risk.

Non-goals

Proving the whole system end-to-end with all implementation details.
Treating verification as a one-time event rather than a process.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

In temporal logic terms, the common shape is:

\mathrm{Safety} \equiv \Box\,\mathrm{Inv}\qquad\qquad \mathrm{Liveness} \equiv \Box\Diamond\,\mathrm{Progress}.

Write properties in plain language next to the formal version.

Model the scheduler explicitly when concurrency is part of the threat model.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Replay resistance: duplicated inputs do not change outcomes.
Integrity: invalid transitions are rejected (and detectable).
Evidence: critical actions emit verifiable audit events.
Least authority: privileges are scoped by purpose and time.

Failure modes

Observability gaps during incidents (missing evidence).
Mixed-version behavior that violates assumptions silently.
Timeout ambiguity causing double-apply or partial state transitions.
Config drift that weakens security posture over time.

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart LR
  spec["Spec (TLA+/PlusCal)"] --> mc["Model Check"]
  mc --> refine["Refinement / Invariants"]
  refine --> impl["Implementation (Rust/Go)"]
  impl --> tests["Fuzz / PBT / Differential"]
  tests --> spec

Implementation notes

Treat invariants as code: version, review, and test them.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

// Practical tip: make the model "executable" enough to emit traces you can replay.
// Then treat traces as regression inputs for your implementation.

Verification strategy

Model checking bounded versions of the core protocol.
Differential tests against other implementations/specs.
Runtime assertions for invariants that are cheap to check.
Property-based tests derived from invariants.
Refinement tests: compare model traces to implementation traces.

Operational notes

Treat counterexamples as incidents: track, root-cause, regression-test.
Version properties and invariants like code; review changes carefully.
Keep a library of “known hard schedules” from past failures.
Use models to evaluate protocol upgrades before shipping.
Run the model checker in CI with explicit timeouts and bounds.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Authz failures and policy denials (unexpected spikes).
Invariant violation rate (should be ~0).
Error budget burn + tail latency under load.
Retry/timeout rates by endpoint and client cohort.
Admission-control / rate-limit rejections (by reason).

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.

Evidence

Learn TLA+ (1) — Practical workflow and examples.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

How will you keep models aligned during rapid iteration?
What is the smallest model that reproduces your worst incident class?
Which invariants are cheap enough to monitor in production?
Which properties are you currently assuming but not testing or proving?

Checklist

Safety properties stated as invariants.
Assumptions listed and reviewed.
Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Telemetry captures correctness signals.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading