Monthly research note. Theme: Formal Methods & Verification.

TL;DR

A focused memo on Concurrency Testing in Rust: Loom, Schedules, and Determinism: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

  • Refinement boundaries prevent spec drift between paper and code.
  • Write properties in plain language next to the formal statement.
  • Keep models small enough to run in seconds or they will rot.
  • Make boundaries boring: validate inputs, cap costs, and be deterministic where needed.
  • Bind security decisions to evidence (audit, invariants, telemetry).

Why this matters

  • Most catastrophic bugs are small: a missing condition, a stale variable, a rare interleaving.
  • Verification complements testing by exploring adversarial schedules systematically.
  • Counterexamples are better than intuition—they are executable bug reports.
  • Formal models force you to name assumptions (time, ordering, failure).

Key questions

  • How do you convert counterexamples into test harnesses?
  • What is the smallest model that still captures the bug class you fear?
  • What is the refinement boundary between spec and implementation?
  • How do you handle state explosion (symmetry, abstraction, bounds)?
  • Which invariants must hold under every interleaving and crash point?
  • How do you ensure proofs stay valid through refactors and upgrades?

Assumptions

  • Teams need workflows that keep models and code aligned over time.
  • Specifications omit details; implementations invent them. That gap is risk.
  • Adversaries choose the worst schedule, not the average one.
  • Most systems have implicit assumptions about timeouts and ordering.

Non-goals

  • Proving the whole system end-to-end with all implementation details.
  • Writing models that can’t produce counterexamples quickly.
Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

A common way to state linearizability is existence of a sequential history:

Hs: Hs is sequential HsHc.\exists H_s:\ H_s \text{ is sequential } \wedge H_s \sim H_c.

Model the scheduler explicitly when concurrency is part of the threat model.

Keep the model small enough to run in seconds; large models rot.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Evidence: critical actions emit verifiable audit events.
  • Replay resistance: duplicated inputs do not change outcomes.
  • Authenticity: actions are bound to identity and purpose.

Failure modes

  • Mixed-version behavior that violates assumptions silently.
  • Observability gaps during incidents (missing evidence).
  • Config drift that weakens security posture over time.
  • Recovery paths that only work when nothing is broken.
Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  props["Properties"] --> inv["Invariants"]
  inv --> model["Model"]
  model --> cex["Counterexamples"]
  cex --> tests["Regression Tests"]
  tests --> model

Implementation notes

Keep refinement boundaries explicit: what the spec promises vs what code enforces.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

// Practical tip: make the model "executable" enough to emit traces you can replay.
// Then treat traces as regression inputs for your implementation.

Verification strategy

  • Proof maintenance: keep models in CI with a time budget.
  • Runtime assertions for invariants that are cheap to check.
  • Refinement tests: compare model traces to implementation traces.
  • Differential tests against other implementations/specs.
  • Property-based tests derived from invariants.

Operational notes

  • Keep a library of “known hard schedules” from past failures.
  • Run the model checker in CI with explicit timeouts and bounds.
  • Use models to evaluate protocol upgrades before shipping.
  • Treat counterexamples as incidents: track, root-cause, regression-test.
  • Version properties and invariants like code; review changes carefully.
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Admission-control / rate-limit rejections (by reason).
  • Invariant violation rate (should be ~0).
  • Authz failures and policy denials (unexpected spikes).
  • Error budget burn + tail latency under load.
  • Rollback events and the conditions that triggered them.

Rollback plan

  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Keep dual-write / dual-verify windows where appropriate.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • Learn TLA+ (1) — Practical workflow and examples.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
  • Jepsen (2) — Fault injection and correctness testing for distributed systems.
    • Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

  • Which properties are you currently assuming but not testing or proving?
  • Which invariants are cheap enough to monitor in production?
  • What is the smallest model that reproduces your worst incident class?
  • How will you keep models aligned during rapid iteration?

Checklist

  • Rollback plan rehearsed and automated.
  • Safety properties stated as invariants.
  • Failure modes enumerated with mitigations.
  • Telemetry captures correctness signals.
  • Assumptions listed and reviewed.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

Further reading

1.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/
2.
Jepsen. Jepsen: Distributed Systems Safety Analysis [Internet]. Web; Available from: https://jepsen.io/