Property-Based Testing: Finding Bugs You Didn’t Imagine

Monthly research note. Theme: Formal Methods & Verification.

TL;DR

Property-Based Testing: Finding Bugs You Didn’t Imagine as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

Write properties in plain language next to the formal statement.
Keep models small enough to run in seconds or they will rot.
Refinement boundaries prevent spec drift between paper and code.
Measure correctness signals, not only latency/throughput.
Write assumptions down; treat them as interfaces.

Why this matters

Formal models force you to name assumptions (time, ordering, failure).
Refinement boundaries prevent “spec drift” between paper and code.
Verification complements testing by exploring adversarial schedules systematically.
The goal is not a perfect proof—it’s reducing the space of unknown failure modes.

Key questions

How do you handle state explosion (symmetry, abstraction, bounds)?
How do you convert counterexamples into test harnesses?
What is the environment model (adversary actions, scheduling, failures)?
What is the refinement boundary between spec and implementation?
Which invariants must hold under every interleaving and crash point?
How do you ensure proofs stay valid through refactors and upgrades?

Assumptions

Teams need workflows that keep models and code aligned over time.
Adversaries choose the worst schedule, not the average one.
Concurrency introduces interleavings humans don’t reason about reliably.
Specifications omit details; implementations invent them. That gap is risk.

Non-goals

Assuming the spec and the code share the same definitions implicitly.
Writing models that can’t produce counterexamples quickly.

Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

In temporal logic terms, the common shape is:

\mathrm{Safety} \equiv \Box\,\mathrm{Inv}\qquad\qquad \mathrm{Liveness} \equiv \Box\Diamond\,\mathrm{Progress}.

Write properties in plain language next to the formal version.

Model the scheduler explicitly when concurrency is part of the threat model.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

Replay resistance: duplicated inputs do not change outcomes.
Integrity: invalid transitions are rejected (and detectable).
Least authority: privileges are scoped by purpose and time.
Authenticity: actions are bound to identity and purpose.

Failure modes

Mixed-version behavior that violates assumptions silently.
Config drift that weakens security posture over time.
Observability gaps during incidents (missing evidence).
Recovery paths that only work when nothing is broken.

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  props["Properties"] --> inv["Invariants"]
  inv --> model["Model"]
  model --> cex["Counterexamples"]
  cex --> tests["Regression Tests"]
  tests --> model

Implementation notes

Treat invariants as code: version, review, and test them.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Workflow:
1) Write a model with a few state variables.
2) State invariants (safety) and progress conditions (liveness).
3) Run model checker with tight bounds.
4) Minimize counterexamples into test cases.
5) Iterate until failures are boring.

Verification strategy

Property-based tests derived from invariants.
Runtime assertions for invariants that are cheap to check.
Refinement tests: compare model traces to implementation traces.
Model checking bounded versions of the core protocol.
Differential tests against other implementations/specs.

Operational notes

Use models to evaluate protocol upgrades before shipping.
Version properties and invariants like code; review changes carefully.
Run the model checker in CI with explicit timeouts and bounds.
Treat counterexamples as incidents: track, root-cause, regression-test.
Keep a library of “known hard schedules” from past failures.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Error budget burn + tail latency under load.
Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Authz failures and policy denials (unexpected spikes).
Rollback events and the conditions that triggered them.

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Use canaries and staged rollout; stop early when signals degrade.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.

Evidence

Learn TLA+ (1) — Practical workflow and examples.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

Which properties are you currently assuming but not testing or proving?
Which invariants are cheap enough to monitor in production?
What is the smallest model that reproduces your worst incident class?
How will you keep models aligned during rapid iteration?

Checklist

Safety properties stated as invariants.
Telemetry captures correctness signals.
Rollback plan rehearsed and automated.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Assumptions listed and reviewed.
Failure modes enumerated with mitigations.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading