Monthly research note. Theme: Adversarial Infrastructure & Global Systems.
TL;DR
A focused memo on Formal Verification of Crypto Protocols: Models, Gaps, and Pain: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.
Key takeaways
- Protect observability: you can’t respond blind, and telemetry can be attacked.
- Degraded modes are security decisions; write them down and test them.
- Evidence pipelines (audit/config history) are part of incident response correctness.
- Treat retries, reordering, and partial failure as default conditions.
- Measure correctness signals, not only latency/throughput.
Why this matters
- Logs are only useful if they remain trustworthy under compromise.
- Degraded modes without explicit policy become accidental vulnerabilities.
- Incident response is a protocol: practice it, automate it, validate it.
- Privacy failures often come from metadata, not plaintext.
Key questions
- Which logs are trustworthy under compromise (append-only, signed, isolated)?
- What is the minimum viable recovery path after a catastrophic event?
- How do you make abuse expensive (proof-of-work, quotas, pricing, friction)?
- How do you detect attacks that look like “normal traffic spikes”?
- Where is the attacker’s leverage (routing, DNS, dependency, identity, time)?
- Which controls fail first under load: auth, rate limits, storage, or observability?
Assumptions
- Operators are human and will make mistakes under pressure.
- Some dependencies will fail open or fail closed unexpectedly.
- Observability pipelines can be attacked (cardinality explosions, log injection).
- Traffic spikes can be malicious or accidental; you must handle both.
Non-goals
- Assuming perfect attribution (you rarely know who is attacking in real time).
- Assuming WAF/rate limits are sufficient without architecture changes.
Any unbounded work per request becomes a DoS primitive under adversaries.
Model & invariants
Defense is about cost asymmetry. If the attacker spends and you spend , you lose.
Treat observability as a dependency: protect it from overload and manipulation.
Define which operations fail closed vs fail open. Do it before an incident.
Monotonicity beats timestamps: counters and epochs survive clock skew.
Security properties
- Evidence: critical actions emit verifiable audit events.
- Replay resistance: duplicated inputs do not change outcomes.
- Authenticity: actions are bound to identity and purpose.
- Least authority: privileges are scoped by purpose and time.
Failure modes
- Recovery paths that only work when nothing is broken.
- Timeout ambiguity causing double-apply or partial state transitions.
- Config drift that weakens security posture over time.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Mixed-version deployments create states you never tested—plan for them explicitly.
Design sketch
flowchart LR
attack["Attack"] --> detect["Detect"]
detect --> contain["Contain"]
contain --> recover["Recover"]
recover --> learn["Learn/Regress"]
learn --> detectImplementation notes
Degraded modes are design artifacts. Write them down and test them.
Make rollbacks boring: if rollback is a hero move, it will fail.
Degraded-mode table (example):
Operation | Normal | Under attack | Rationale
Auth | full | strict | prevent abuse
Reads | full | cached/limited| protect core
Writes | full | queued/limited| preserve integrity
Admin | full | JIT + MFA | reduce blast radiusVerification strategy
- Observability stress: cardinality explosions and sampling under attack.
- Incident replay: reconstruct timeline from evidence pipelines.
- Dependency chaos: DNS issues, cert failures, upstream outages.
- Policy tests: fail closed/open behaviors are unit-tested.
- Game days: simulate DDoS, dependency failure, and credential abuse.
Operational notes
- Protect the edge and the evidence: rate limits + SIEM + log integrity.
- Make emergency controls quick: feature flags, circuit breakers, safe defaults.
- Document and rehearse degraded-mode policy with on-call rotations.
- Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
- Instrument cost: which defenses become expensive and when.
Make degraded modes explicit: fail closed vs fail open is a policy choice.
What to monitor
- Authz failures and policy denials (unexpected spikes).
- Error budget burn + tail latency under load.
- Admission-control / rate-limit rejections (by reason).
- Rollback events and the conditions that triggered them.
- Retry/timeout rates by endpoint and client cohort.
Rollback plan
- Use canaries and staged rollout; stop early when signals degrade.
- Keep dual-write / dual-verify windows where appropriate.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Define an explicit rollback trigger (metrics + thresholds).
Evidence
- Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
- Let's Encrypt Incident Reports (2) — Operational failures and recovery in real-world PKI.
- Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.
Open questions
- What is your ‘safe mode’ when dependencies fail?
- Where do you pay cost asymmetry today—and can you flip it?
- Which operation, if abused, causes irreversible damage?
- How do you keep control-plane access during widespread incidents?
Checklist
- Assumptions listed and reviewed.
- Failure modes enumerated with mitigations.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Rollback plan rehearsed and automated.
- Telemetry captures correctness signals.
- Safety properties stated as invariants.
Further reading
- RFC 4271: BGP-4 — Routing is part of your threat model whether you like it or not.
- RFC 6480: An Infrastructure to Support Secure Internet Routing — RPKI basics and why routing security is hard operationally.
- Let's Encrypt Incident Reports — Operational failures and recovery in real-world PKI.
- Cloudflare Outage (July 2, 2019) Postmortem — A concrete example of global failure, containment, and recovery lessons.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Jepsen — Fault injection and correctness testing for distributed systems.