Monthly research note. Theme: Formal Methods & Verification.

TL;DR

A focused memo on Safety/Liveness Catalog: A Practical Checklist for Protocol Specs: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

  • Counterexamples are engineering artifacts—minimize them and turn them into tests.
  • Refinement boundaries prevent spec drift between paper and code.
  • Write properties in plain language next to the formal statement.
  • Write assumptions down; treat them as interfaces.
  • Treat retries, reordering, and partial failure as default conditions.

Why this matters

  • Counterexamples are better than intuition—they are executable bug reports.
  • Most catastrophic bugs are small: a missing condition, a stale variable, a rare interleaving.
  • Verification complements testing by exploring adversarial schedules systematically.
  • The goal is not a perfect proof—it’s reducing the space of unknown failure modes.

Key questions

  • What is the smallest model that still captures the bug class you fear?
  • How do you ensure proofs stay valid through refactors and upgrades?
  • Which properties belong in the model vs in tests vs in monitoring?
  • How do you handle state explosion (symmetry, abstraction, bounds)?
  • Which invariants must hold under every interleaving and crash point?
  • How do you convert counterexamples into test harnesses?

Assumptions

  • Concurrency introduces interleavings humans don’t reason about reliably.
  • Most systems have implicit assumptions about timeouts and ordering.
  • Teams need workflows that keep models and code aligned over time.
  • Adversaries choose the worst schedule, not the average one.

Non-goals

  • Treating verification as a one-time event rather than a process.
  • Proving the whole system end-to-end with all implementation details.
Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

In temporal logic terms, the common shape is:

SafetyInvLivenessProgress.\mathrm{Safety} \equiv \Box\,\mathrm{Inv}\qquad\qquad \mathrm{Liveness} \equiv \Box\Diamond\,\mathrm{Progress}.

Treat counterexamples as regression tests: reduce, encode, and replay.

Keep the model small enough to run in seconds; large models rot.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

  • Integrity: invalid transitions are rejected (and detectable).
  • Replay resistance: duplicated inputs do not change outcomes.
  • Evidence: critical actions emit verifiable audit events.
  • Least authority: privileges are scoped by purpose and time.

Failure modes

  • Config drift that weakens security posture over time.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Mixed-version behavior that violates assumptions silently.
  • Observability gaps during incidents (missing evidence).
Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart TD
  props["Properties"] --> inv["Invariants"]
  inv --> model["Model"]
  model --> cex["Counterexamples"]
  cex --> tests["Regression Tests"]
  tests --> model

Implementation notes

Keep refinement boundaries explicit: what the spec promises vs what code enforces.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

Workflow:
1) Write a model with a few state variables.
2) State invariants (safety) and progress conditions (liveness).
3) Run model checker with tight bounds.
4) Minimize counterexamples into test cases.
5) Iterate until failures are boring.

Verification strategy

  • Property-based tests derived from invariants.
  • Proof maintenance: keep models in CI with a time budget.
  • Runtime assertions for invariants that are cheap to check.
  • Refinement tests: compare model traces to implementation traces.
  • Differential tests against other implementations/specs.

Operational notes

  • Run the model checker in CI with explicit timeouts and bounds.
  • Treat counterexamples as incidents: track, root-cause, regression-test.
  • Version properties and invariants like code; review changes carefully.
  • Keep a library of “known hard schedules” from past failures.
  • Use models to evaluate protocol upgrades before shipping.
Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

  • Error budget burn + tail latency under load.
  • Rollback events and the conditions that triggered them.
  • Retry/timeout rates by endpoint and client cohort.
  • Invariant violation rate (should be ~0).
  • Admission-control / rate-limit rejections (by reason).

Rollback plan

  • Use canaries and staged rollout; stop early when signals degrade.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Keep dual-write / dual-verify windows where appropriate.

Evidence

  • Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
  • Learn TLA+ (2) — Practical workflow and examples.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

  • Which invariants are cheap enough to monitor in production?
  • How will you keep models aligned during rapid iteration?
  • Which properties are you currently assuming but not testing or proving?
  • What is the smallest model that reproduces your worst incident class?

Checklist

  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Rollback plan rehearsed and automated.
  • Failure modes enumerated with mitigations.
  • Assumptions listed and reviewed.
  • Telemetry captures correctness signals.
  • Safety properties stated as invariants.

Further reading

1.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/
2.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/