Safety/Liveness Catalog: A Practical Checklist for Protocol Specs

Monthly research note. Theme: Formal Methods & Verification.

TL;DR

A focused memo on Safety/Liveness Catalog: A Practical Checklist for Protocol Specs: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Counterexamples are engineering artifacts—minimize them and turn them into tests.
Refinement boundaries prevent spec drift between paper and code.
Write properties in plain language next to the formal statement.
Write assumptions down; treat them as interfaces.
Treat retries, reordering, and partial failure as default conditions.

Why this matters

Counterexamples are better than intuition—they are executable bug reports.
Most catastrophic bugs are small: a missing condition, a stale variable, a rare interleaving.
Verification complements testing by exploring adversarial schedules systematically.
The goal is not a perfect proof—it’s reducing the space of unknown failure modes.

Key questions

What is the smallest model that still captures the bug class you fear?
How do you ensure proofs stay valid through refactors and upgrades?
Which properties belong in the model vs in tests vs in monitoring?
How do you handle state explosion (symmetry, abstraction, bounds)?
Which invariants must hold under every interleaving and crash point?
How do you convert counterexamples into test harnesses?

Assumptions

Concurrency introduces interleavings humans don’t reason about reliably.
Most systems have implicit assumptions about timeouts and ordering.
Teams need workflows that keep models and code aligned over time.
Adversaries choose the worst schedule, not the average one.

Non-goals

Treating verification as a one-time event rather than a process.
Proving the whole system end-to-end with all implementation details.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

In temporal logic terms, the common shape is:

\mathrm{Safety} \equiv \Box\,\mathrm{Inv}\qquad\qquad \mathrm{Liveness} \equiv \Box\Diamond\,\mathrm{Progress}.

Treat counterexamples as regression tests: reduce, encode, and replay.

Keep the model small enough to run in seconds; large models rot.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

Integrity: invalid transitions are rejected (and detectable).
Replay resistance: duplicated inputs do not change outcomes.
Evidence: critical actions emit verifiable audit events.
Least authority: privileges are scoped by purpose and time.

Failure modes

Config drift that weakens security posture over time.
Timeout ambiguity causing double-apply or partial state transitions.
Mixed-version behavior that violates assumptions silently.
Observability gaps during incidents (missing evidence).

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart TD
  props["Properties"] --> inv["Invariants"]
  inv --> model["Model"]
  model --> cex["Counterexamples"]
  cex --> tests["Regression Tests"]
  tests --> model

Implementation notes

Keep refinement boundaries explicit: what the spec promises vs what code enforces.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

Workflow:
1) Write a model with a few state variables.
2) State invariants (safety) and progress conditions (liveness).
3) Run model checker with tight bounds.
4) Minimize counterexamples into test cases.
5) Iterate until failures are boring.

Verification strategy

Property-based tests derived from invariants.
Proof maintenance: keep models in CI with a time budget.
Runtime assertions for invariants that are cheap to check.
Refinement tests: compare model traces to implementation traces.
Differential tests against other implementations/specs.

Operational notes

Run the model checker in CI with explicit timeouts and bounds.
Treat counterexamples as incidents: track, root-cause, regression-test.
Version properties and invariants like code; review changes carefully.
Keep a library of “known hard schedules” from past failures.
Use models to evaluate protocol upgrades before shipping.

Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

Error budget burn + tail latency under load.
Rollback events and the conditions that triggered them.
Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Admission-control / rate-limit rejections (by reason).

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Define an explicit rollback trigger (metrics + thresholds).
Keep dual-write / dual-verify windows where appropriate.

Evidence

Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Learn TLA+ (2) — Practical workflow and examples.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

Which invariants are cheap enough to monitor in production?
How will you keep models aligned during rapid iteration?
Which properties are you currently assuming but not testing or proving?
What is the smallest model that reproduces your worst incident class?

Checklist

Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Rollback plan rehearsed and automated.
Failure modes enumerated with mitigations.
Assumptions listed and reviewed.
Telemetry captures correctness signals.
Safety properties stated as invariants.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading