Monthly research note. Theme: Adversarial Infrastructure & Global Systems.
TL;DR
A focused memo on Byzantine Fault Injection: Testing Protocols Like an Attacker: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Engineer cost asymmetry: defense must be cheaper than attack per unit of damage prevented.
- Protect observability: you can’t respond blind, and telemetry can be attacked.
- Evidence pipelines (audit/config history) are part of incident response correctness.
- Write assumptions down; treat them as interfaces.
- Prefer protocols and APIs that make invalid states hard to express.
Why this matters
- Privacy failures often come from metadata, not plaintext.
- Global dependencies (DNS, routing, PKI) are shared attack surfaces.
- Attackers exploit cost asymmetry: make abuse cheap and defense expensive.
- Logs are only useful if they remain trustworthy under compromise.
Key questions
- What is the minimum viable recovery path after a catastrophic event?
- How do you detect attacks that look like “normal traffic spikes”?
- How do you make abuse expensive (proof-of-work, quotas, pricing, friction)?
- How do you prevent dependency failures from becoming integrity failures?
- What is your degraded-mode behavior (and is it safe)?
- Which controls fail first under load: auth, rate limits, storage, or observability?
Assumptions
- Some dependencies will fail open or fail closed unexpectedly.
- Operators are human and will make mistakes under pressure.
- Observability pipelines can be attacked (cardinality explosions, log injection).
- Traffic spikes can be malicious or accidental; you must handle both.
Non-goals
- Assuming perfect attribution (you rarely know who is attacking in real time).
- Relying on dashboards that vanish during the incident.
Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.
Model & invariants
Defense is about cost asymmetry. If the attacker spends and you spend , you lose.
Treat observability as a dependency: protect it from overload and manipulation.
Engineer friction where attackers pay but legitimate users don’t (asymmetric controls).
Invariants must be checkable from evidence you actually have (state + logs + counters).
Security properties
- Least authority: privileges are scoped by purpose and time.
- Authenticity: actions are bound to identity and purpose.
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Integrity: invalid transitions are rejected (and detectable).
Failure modes
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Mixed-version behavior that violates assumptions silently.
- Observability gaps during incidents (missing evidence).
- Timeout ambiguity causing double-apply or partial state transitions.
A recovery plan that isn’t exercised will fail when you need it.
Design sketch
flowchart TD
edge["Edge (rate limits + WAF)"] --> core["Core Services"]
core --> data["Data Plane"]
data --> control["Control Plane"]
control --> edge
siem["Detection/Response"] --> core
siem --> edgeImplementation notes
Prefer containment over heroics: isolate blast radius, keep core correct.
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
Degraded-mode table (example):
Operation | Normal | Under attack | Rationale
Auth | full | strict | prevent abuse
Reads | full | cached/limited| protect core
Writes | full | queued/limited| preserve integrity
Admin | full | JIT + MFA | reduce blast radiusVerification strategy
- Dependency chaos: DNS issues, cert failures, upstream outages.
- Policy tests: fail closed/open behaviors are unit-tested.
- Observability stress: cardinality explosions and sampling under attack.
- Incident replay: reconstruct timeline from evidence pipelines.
- Game days: simulate DDoS, dependency failure, and credential abuse.
Operational notes
- Protect the edge and the evidence: rate limits + SIEM + log integrity.
- Instrument cost: which defenses become expensive and when.
- Make emergency controls quick: feature flags, circuit breakers, safe defaults.
- Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
- Document and rehearse degraded-mode policy with on-call rotations.
Attach explicit rollout/rollback triggers to changes that touch security or correctness.
What to monitor
- Error budget burn + tail latency under load.
- Authz failures and policy denials (unexpected spikes).
- Rollback events and the conditions that triggered them.
- Admission-control / rate-limit rejections (by reason).
- Retry/timeout rates by endpoint and client cohort.
Rollback plan
- Keep dual-write / dual-verify windows where appropriate.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Define an explicit rollback trigger (metrics + thresholds).
- Use canaries and staged rollout; stop early when signals degrade.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Evidence
- Let's Encrypt Incident Reports (1) — Operational failures and recovery in real-world PKI.
- Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.
- Learn TLA+ (2) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
Open questions
- Which operation, if abused, causes irreversible damage?
- Where do you pay cost asymmetry today—and can you flip it?
- What is your ‘safe mode’ when dependencies fail?
- How do you keep control-plane access during widespread incidents?
Checklist
- Telemetry captures correctness signals.
- Safety properties stated as invariants.
- Failure modes enumerated with mitigations.
- Assumptions listed and reviewed.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Rollback plan rehearsed and automated.
Further reading
- RFC 4271: BGP-4 — Routing is part of your threat model whether you like it or not.
- RFC 6480: An Infrastructure to Support Secure Internet Routing — RPKI basics and why routing security is hard operationally.
- Let's Encrypt Incident Reports — Operational failures and recovery in real-world PKI.
- Cloudflare Outage (July 2, 2019) Postmortem — A concrete example of global failure, containment, and recovery lessons.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Learn TLA+ — Practical entry point for specification and model checking.