Monthly research note. Theme: Formal Methods & Verification.
TL;DR
A focused memo on Differential Testing: Using Other Implementations as Oracles: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Keep models small enough to run in seconds or they will rot.
- Counterexamples are engineering artifacts—minimize them and turn them into tests.
- Model the smallest system that can still fail in the way you fear.
- Make failure modes explicit and observable.
- Automate guardrails; humans are for judgment, not for consistent enforcement.
Why this matters
- Verification complements testing by exploring adversarial schedules systematically.
- Refinement boundaries prevent “spec drift” between paper and code.
- Most catastrophic bugs are small: a missing condition, a stale variable, a rare interleaving.
- The goal is not a perfect proof—it’s reducing the space of unknown failure modes.
Key questions
- What is the smallest model that still captures the bug class you fear?
- What is the refinement boundary between spec and implementation?
- What is the environment model (adversary actions, scheduling, failures)?
- Which properties belong in the model vs in tests vs in monitoring?
- Which invariants must hold under every interleaving and crash point?
- How do you ensure proofs stay valid through refactors and upgrades?
Assumptions
- Specifications omit details; implementations invent them. That gap is risk.
- Concurrency introduces interleavings humans don’t reason about reliably.
- Most systems have implicit assumptions about timeouts and ordering.
- Teams need workflows that keep models and code aligned over time.
Non-goals
- Proving the whole system end-to-end with all implementation details.
- Treating verification as a one-time event rather than a process.
Any unbounded work per request becomes a DoS primitive under adversaries.
Model & invariants
In temporal logic terms, the common shape is:
Keep the model small enough to run in seconds; large models rot.
Model the scheduler explicitly when concurrency is part of the threat model.
Make the “impossible state” observable: a metric or alert that fires when invariants drift.
Security properties
- Integrity: invalid transitions are rejected (and detectable).
- Least authority: privileges are scoped by purpose and time.
- Evidence: critical actions emit verifiable audit events.
- Downgrade resistance: negotiation can’t silently weaken security posture.
Failure modes
- Mixed-version behavior that violates assumptions silently.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Observability gaps during incidents (missing evidence).
- Timeout ambiguity causing double-apply or partial state transitions.
Caches tend to become sources of truth unless you can recompute and validate them.
Design sketch
flowchart TD
props["Properties"] --> inv["Invariants"]
inv --> model["Model"]
model --> cex["Counterexamples"]
cex --> tests["Regression Tests"]
tests --> modelImplementation notes
Keep refinement boundaries explicit: what the spec promises vs what code enforces.
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
Workflow:
1) Write a model with a few state variables.
2) State invariants (safety) and progress conditions (liveness).
3) Run model checker with tight bounds.
4) Minimize counterexamples into test cases.
5) Iterate until failures are boring.Verification strategy
- Property-based tests derived from invariants.
- Model checking bounded versions of the core protocol.
- Runtime assertions for invariants that are cheap to check.
- Refinement tests: compare model traces to implementation traces.
- Differential tests against other implementations/specs.
Operational notes
- Version properties and invariants like code; review changes carefully.
- Treat counterexamples as incidents: track, root-cause, regression-test.
- Use models to evaluate protocol upgrades before shipping.
- Run the model checker in CI with explicit timeouts and bounds.
- Keep a library of “known hard schedules” from past failures.
Attach explicit rollout/rollback triggers to changes that touch security or correctness.
What to monitor
- Rollback events and the conditions that triggered them.
- Invariant violation rate (should be ~0).
- Authz failures and policy denials (unexpected spikes).
- Error budget burn + tail latency under load.
- Admission-control / rate-limit rejections (by reason).
Rollback plan
- Keep dual-write / dual-verify windows where appropriate.
- Define an explicit rollback trigger (metrics + thresholds).
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Use canaries and staged rollout; stop early when signals degrade.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Evidence
- Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
- Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Open questions
- How will you keep models aligned during rapid iteration?
- What is the smallest model that reproduces your worst incident class?
- Which invariants are cheap enough to monitor in production?
- Which properties are you currently assuming but not testing or proving?
Checklist
- Rollback plan rehearsed and automated.
- Telemetry captures correctness signals.
- Safety properties stated as invariants.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Assumptions listed and reviewed.
- Failure modes enumerated with mitigations.
Further reading
- Specifying Systems (Lamport) — The TLA+ reference for safety/liveness and system specs.
- Paxos Made Simple (Lamport) — A small protocol that demonstrates why specs matter.
- Learn TLA+ — Practical workflow and examples.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Jepsen — Fault injection and correctness testing for distributed systems.