Monthly research note. Theme: Adversarial Infrastructure & Global Systems.
TL;DR
A focused memo on Designing for Catastrophic Failure: Compartmentalization and Recovery: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.
Key takeaways
- Protect observability: you can’t respond blind, and telemetry can be attacked.
- Engineer cost asymmetry: defense must be cheaper than attack per unit of damage prevented.
- Dependencies (DNS, routing, PKI) are shared attack surfaces—plan containment.
- Prefer protocols and APIs that make invalid states hard to express.
- Design rollbacks as part of the happy path.
Why this matters
- Incident response is a protocol: practice it, automate it, validate it.
- Global dependencies (DNS, routing, PKI) are shared attack surfaces.
- Logs are only useful if they remain trustworthy under compromise.
- Privacy failures often come from metadata, not plaintext.
Key questions
- What is your degraded-mode behavior (and is it safe)?
- Which logs are trustworthy under compromise (append-only, signed, isolated)?
- Where is the attacker’s leverage (routing, DNS, dependency, identity, time)?
- How do you prevent dependency failures from becoming integrity failures?
- What is the minimum viable recovery path after a catastrophic event?
- Which controls fail first under load: auth, rate limits, storage, or observability?
Assumptions
- Some dependencies will fail open or fail closed unexpectedly.
- Operators are human and will make mistakes under pressure.
- Traffic spikes can be malicious or accidental; you must handle both.
- Observability pipelines can be attacked (cardinality explosions, log injection).
Non-goals
- Assuming perfect attribution (you rarely know who is attacking in real time).
- Relying on dashboards that vanish during the incident.
Any unbounded work per request becomes a DoS primitive under adversaries.
Model & invariants
Resilience is about containment:
Treat observability as a dependency: protect it from overload and manipulation.
Engineer friction where attackers pay but legitimate users don’t (asymmetric controls).
If the system can enter an invalid state, it eventually will—usually during an incident.
Security properties
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Replay resistance: duplicated inputs do not change outcomes.
- Authenticity: actions are bound to identity and purpose.
- Evidence: critical actions emit verifiable audit events.
Failure modes
- Timeout ambiguity causing double-apply or partial state transitions.
- Mixed-version behavior that violates assumptions silently.
- Observability gaps during incidents (missing evidence).
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Mixed-version deployments create states you never tested—plan for them explicitly.
Design sketch
flowchart TD
edge["Edge (rate limits + WAF)"] --> core["Core Services"]
core --> data["Data Plane"]
data --> control["Control Plane"]
control --> edge
siem["Detection/Response"] --> core
siem --> edgeImplementation notes
Prefer containment over heroics: isolate blast radius, keep core correct.
Make rollbacks boring: if rollback is a hero move, it will fail.
Evidence checklist:
- Immutable logs (append-only)
- Signed audit events
- Time sync monitoring
- Dependency health snapshots
- Config change historyVerification strategy
- Dependency chaos: DNS issues, cert failures, upstream outages.
- Policy tests: fail closed/open behaviors are unit-tested.
- Game days: simulate DDoS, dependency failure, and credential abuse.
- Observability stress: cardinality explosions and sampling under attack.
- Incident replay: reconstruct timeline from evidence pipelines.
Operational notes
- Instrument cost: which defenses become expensive and when.
- Document and rehearse degraded-mode policy with on-call rotations.
- Protect the edge and the evidence: rate limits + SIEM + log integrity.
- Make emergency controls quick: feature flags, circuit breakers, safe defaults.
- Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
Attach explicit rollout/rollback triggers to changes that touch security or correctness.
What to monitor
- Authz failures and policy denials (unexpected spikes).
- Rollback events and the conditions that triggered them.
- Invariant violation rate (should be ~0).
- Admission-control / rate-limit rejections (by reason).
- Retry/timeout rates by endpoint and client cohort.
Rollback plan
- Keep dual-write / dual-verify windows where appropriate.
- Use canaries and staged rollout; stop early when signals degrade.
- Define an explicit rollback trigger (metrics + thresholds).
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Evidence
- Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
- Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Open questions
- Where do you pay cost asymmetry today—and can you flip it?
- Which operation, if abused, causes irreversible damage?
- How do you keep control-plane access during widespread incidents?
- What is your ‘safe mode’ when dependencies fail?
Checklist
- Assumptions listed and reviewed.
- Safety properties stated as invariants.
- Rollback plan rehearsed and automated.
- Telemetry captures correctness signals.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Failure modes enumerated with mitigations.
Further reading
- Cloudflare Outage (July 2, 2019) Postmortem — A concrete example of global failure, containment, and recovery lessons.
- RFC 6480: An Infrastructure to Support Secure Internet Routing — RPKI basics and why routing security is hard operationally.
- RFC 4271: BGP-4 — Routing is part of your threat model whether you like it or not.
- Let's Encrypt Incident Reports — Operational failures and recovery in real-world PKI.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.