Formal Verification of Crypto Protocols: Models, Gaps, and Pain

Monthly research note. Theme: Adversarial Infrastructure & Global Systems.

TL;DR

A focused memo on Formal Verification of Crypto Protocols: Models, Gaps, and Pain: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Protect observability: you can’t respond blind, and telemetry can be attacked.
Degraded modes are security decisions; write them down and test them.
Evidence pipelines (audit/config history) are part of incident response correctness.
Treat retries, reordering, and partial failure as default conditions.
Measure correctness signals, not only latency/throughput.

Why this matters

Logs are only useful if they remain trustworthy under compromise.
Degraded modes without explicit policy become accidental vulnerabilities.
Incident response is a protocol: practice it, automate it, validate it.
Privacy failures often come from metadata, not plaintext.

Key questions

Which logs are trustworthy under compromise (append-only, signed, isolated)?
What is the minimum viable recovery path after a catastrophic event?
How do you make abuse expensive (proof-of-work, quotas, pricing, friction)?
How do you detect attacks that look like “normal traffic spikes”?
Where is the attacker’s leverage (routing, DNS, dependency, identity, time)?
Which controls fail first under load: auth, rate limits, storage, or observability?

Assumptions

Operators are human and will make mistakes under pressure.
Some dependencies will fail open or fail closed unexpectedly.
Observability pipelines can be attacked (cardinality explosions, log injection).
Traffic spikes can be malicious or accidental; you must handle both.

Non-goals

Assuming perfect attribution (you rarely know who is attacking in real time).
Assuming WAF/rate limits are sufficient without architecture changes.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Defense is about cost asymmetry. If the attacker spends $1$ and you spend $100$ , you lose.

\mathrm{Cost}_\text{defense} \ll \mathrm{Cost}_\text{attack}\ \text{(per unit of damage prevented)}.

Treat observability as a dependency: protect it from overload and manipulation.

Define which operations fail closed vs fail open. Do it before an incident.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Evidence: critical actions emit verifiable audit events.
Replay resistance: duplicated inputs do not change outcomes.
Authenticity: actions are bound to identity and purpose.
Least authority: privileges are scoped by purpose and time.

Failure modes

Recovery paths that only work when nothing is broken.
Timeout ambiguity causing double-apply or partial state transitions.
Config drift that weakens security posture over time.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart LR
  attack["Attack"] --> detect["Detect"]
  detect --> contain["Contain"]
  contain --> recover["Recover"]
  recover --> learn["Learn/Regress"]
  learn --> detect

Implementation notes

Degraded modes are design artifacts. Write them down and test them.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Degraded-mode table (example):
Operation | Normal | Under attack | Rationale
Auth      | full   | strict       | prevent abuse
Reads     | full   | cached/limited| protect core
Writes    | full   | queued/limited| preserve integrity
Admin     | full   | JIT + MFA     | reduce blast radius

Verification strategy

Observability stress: cardinality explosions and sampling under attack.
Incident replay: reconstruct timeline from evidence pipelines.
Dependency chaos: DNS issues, cert failures, upstream outages.
Policy tests: fail closed/open behaviors are unit-tested.
Game days: simulate DDoS, dependency failure, and credential abuse.

Operational notes

Protect the edge and the evidence: rate limits + SIEM + log integrity.
Make emergency controls quick: feature flags, circuit breakers, safe defaults.
Document and rehearse degraded-mode policy with on-call rotations.
Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
Instrument cost: which defenses become expensive and when.

Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

Authz failures and policy denials (unexpected spikes).
Error budget burn + tail latency under load.
Admission-control / rate-limit rejections (by reason).
Rollback events and the conditions that triggered them.
Retry/timeout rates by endpoint and client cohort.

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Define an explicit rollback trigger (metrics + thresholds).

Evidence

Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Let's Encrypt Incident Reports (2) — Operational failures and recovery in real-world PKI.
- Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.

Open questions

What is your ‘safe mode’ when dependencies fail?
Where do you pay cost asymmetry today—and can you flip it?
Which operation, if abused, causes irreversible damage?
How do you keep control-plane access during widespread incidents?

Checklist

Assumptions listed and reviewed.
Failure modes enumerated with mitigations.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Rollback plan rehearsed and automated.
Telemetry captures correctness signals.
Safety properties stated as invariants.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading