Monthly research note. Theme: Adversarial Infrastructure & Global Systems.
TL;DR
A focused memo on Secure Enclaves in Distributed Systems: Remote Attestation and Trust: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.
Key takeaways
- Protect observability: you can’t respond blind, and telemetry can be attacked.
- Evidence pipelines (audit/config history) are part of incident response correctness.
- Engineer cost asymmetry: defense must be cheaper than attack per unit of damage prevented.
- Measure correctness signals, not only latency/throughput.
- Define safety properties before performance goals.
Why this matters
- Incident response is a protocol: practice it, automate it, validate it.
- Logs are only useful if they remain trustworthy under compromise.
- Privacy failures often come from metadata, not plaintext.
- Global dependencies (DNS, routing, PKI) are shared attack surfaces.
Key questions
- What is your degraded-mode behavior (and is it safe)?
- How do you prevent dependency failures from becoming integrity failures?
- Which logs are trustworthy under compromise (append-only, signed, isolated)?
- Where is the attacker’s leverage (routing, DNS, dependency, identity, time)?
- Which controls fail first under load: auth, rate limits, storage, or observability?
- How do you detect attacks that look like “normal traffic spikes”?
Assumptions
- Some dependencies will fail open or fail closed unexpectedly.
- Attackers can manipulate routing and DNS indirectly (upstream failures, BGP issues).
- Operators are human and will make mistakes under pressure.
- Traffic spikes can be malicious or accidental; you must handle both.
Non-goals
- Relying on dashboards that vanish during the incident.
- Assuming perfect attribution (you rarely know who is attacking in real time).
Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.
Model & invariants
Defense is about cost asymmetry. If the attacker spends and you spend , you lose.
Treat observability as a dependency: protect it from overload and manipulation.
Engineer friction where attackers pay but legitimate users don’t (asymmetric controls).
If the system can enter an invalid state, it eventually will—usually during an incident.
Security properties
- Authenticity: actions are bound to identity and purpose.
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Replay resistance: duplicated inputs do not change outcomes.
- Least authority: privileges are scoped by purpose and time.
Failure modes
- Config drift that weakens security posture over time.
- Mixed-version behavior that violates assumptions silently.
- Observability gaps during incidents (missing evidence).
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Sampling hides the rare schedule that breaks your invariants.
Design sketch
flowchart TD
edge["Edge (rate limits + WAF)"] --> core["Core Services"]
core --> data["Data Plane"]
data --> control["Control Plane"]
control --> edge
siem["Detection/Response"] --> core
siem --> edgeImplementation notes
Keep evidence pipelines alive: you can’t respond blind.
Acknowledge only after durability (or make “ack” explicitly best-effort).
Degraded-mode table (example):
Operation | Normal | Under attack | Rationale
Auth | full | strict | prevent abuse
Reads | full | cached/limited| protect core
Writes | full | queued/limited| preserve integrity
Admin | full | JIT + MFA | reduce blast radiusVerification strategy
- Observability stress: cardinality explosions and sampling under attack.
- Game days: simulate DDoS, dependency failure, and credential abuse.
- Dependency chaos: DNS issues, cert failures, upstream outages.
- Incident replay: reconstruct timeline from evidence pipelines.
- Policy tests: fail closed/open behaviors are unit-tested.
Operational notes
- Make emergency controls quick: feature flags, circuit breakers, safe defaults.
- Document and rehearse degraded-mode policy with on-call rotations.
- Protect the edge and the evidence: rate limits + SIEM + log integrity.
- Instrument cost: which defenses become expensive and when.
- Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.
What to monitor
- Authz failures and policy denials (unexpected spikes).
- Invariant violation rate (should be ~0).
- Retry/timeout rates by endpoint and client cohort.
- Error budget burn + tail latency under load.
- Rollback events and the conditions that triggered them.
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Use canaries and staged rollout; stop early when signals degrade.
- Define an explicit rollback trigger (metrics + thresholds).
- Keep dual-write / dual-verify windows where appropriate.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Evidence
- Let's Encrypt Incident Reports (1) — Operational failures and recovery in real-world PKI.
- Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.
- Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Open questions
- Where do you pay cost asymmetry today—and can you flip it?
- What is your ‘safe mode’ when dependencies fail?
- Which operation, if abused, causes irreversible damage?
- How do you keep control-plane access during widespread incidents?
Checklist
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Failure modes enumerated with mitigations.
- Telemetry captures correctness signals.
- Rollback plan rehearsed and automated.
- Safety properties stated as invariants.
- Assumptions listed and reviewed.
Further reading
- RFC 4271: BGP-4 — Routing is part of your threat model whether you like it or not.
- RFC 6480: An Infrastructure to Support Secure Internet Routing — RPKI basics and why routing security is hard operationally.
- Let's Encrypt Incident Reports — Operational failures and recovery in real-world PKI.
- Cloudflare Outage (July 2, 2019) Postmortem — A concrete example of global failure, containment, and recovery lessons.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.