Monthly research note. Theme: Adversarial Infrastructure & Global Systems.

TL;DR

A focused memo on DDoS at Scale: Adaptive Defense and Cost Asymmetry: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Engineer cost asymmetry: defense must be cheaper than attack per unit of damage prevented.
  • Evidence pipelines (audit/config history) are part of incident response correctness.
  • Protect observability: you can’t respond blind, and telemetry can be attacked.
  • Define safety properties before performance goals.
  • Write assumptions down; treat them as interfaces.

Why this matters

  • Attackers exploit cost asymmetry: make abuse cheap and defense expensive.
  • Incident response is a protocol: practice it, automate it, validate it.
  • Privacy failures often come from metadata, not plaintext.
  • Global dependencies (DNS, routing, PKI) are shared attack surfaces.

Key questions

  • What is the minimum viable recovery path after a catastrophic event?
  • Which logs are trustworthy under compromise (append-only, signed, isolated)?
  • Where is the attacker’s leverage (routing, DNS, dependency, identity, time)?
  • How do you make abuse expensive (proof-of-work, quotas, pricing, friction)?
  • How do you prevent dependency failures from becoming integrity failures?
  • What is your degraded-mode behavior (and is it safe)?

Assumptions

  • Operators are human and will make mistakes under pressure.
  • Observability pipelines can be attacked (cardinality explosions, log injection).
  • Traffic spikes can be malicious or accidental; you must handle both.
  • Some dependencies will fail open or fail closed unexpectedly.

Non-goals

  • Assuming WAF/rate limits are sufficient without architecture changes.
  • Assuming perfect attribution (you rarely know who is attacking in real time).
Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

Defense is about cost asymmetry. If the attacker spends 11 and you spend 100100, you lose.

CostdefenseCostattack (per unit of damage prevented).\mathrm{Cost}_\text{defense} \ll \mathrm{Cost}_\text{attack}\ \text{(per unit of damage prevented)}.

Engineer friction where attackers pay but legitimate users don’t (asymmetric controls).

Treat observability as a dependency: protect it from overload and manipulation.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

  • Authenticity: actions are bound to identity and purpose.
  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Evidence: critical actions emit verifiable audit events.
  • Least authority: privileges are scoped by purpose and time.

Failure modes

  • Mixed-version behavior that violates assumptions silently.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Observability gaps during incidents (missing evidence).
  • Config drift that weakens security posture over time.
Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  edge["Edge (rate limits + WAF)"] --> core["Core Services"]
  core --> data["Data Plane"]
  data --> control["Control Plane"]
  control --> edge
  siem["Detection/Response"] --> core
  siem --> edge

Implementation notes

Prefer containment over heroics: isolate blast radius, keep core correct.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Evidence checklist:
- Immutable logs (append-only)
- Signed audit events
- Time sync monitoring
- Dependency health snapshots
- Config change history

Verification strategy

  • Policy tests: fail closed/open behaviors are unit-tested.
  • Dependency chaos: DNS issues, cert failures, upstream outages.
  • Game days: simulate DDoS, dependency failure, and credential abuse.
  • Incident replay: reconstruct timeline from evidence pipelines.
  • Observability stress: cardinality explosions and sampling under attack.

Operational notes

  • Instrument cost: which defenses become expensive and when.
  • Make emergency controls quick: feature flags, circuit breakers, safe defaults.
  • Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
  • Protect the edge and the evidence: rate limits + SIEM + log integrity.
  • Document and rehearse degraded-mode policy with on-call rotations.
Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

  • Rollback events and the conditions that triggered them.
  • Admission-control / rate-limit rejections (by reason).
  • Authz failures and policy denials (unexpected spikes).
  • Retry/timeout rates by endpoint and client cohort.
  • Invariant violation rate (should be ~0).

Rollback plan

  • Define an explicit rollback trigger (metrics + thresholds).
  • Use canaries and staged rollout; stop early when signals degrade.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Keep dual-write / dual-verify windows where appropriate.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

  • Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
  • Learn TLA+ (2) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

  • Where do you pay cost asymmetry today—and can you flip it?
  • How do you keep control-plane access during widespread incidents?
  • What is your ‘safe mode’ when dependencies fail?
  • Which operation, if abused, causes irreversible damage?

Checklist

  • Failure modes enumerated with mitigations.
  • Rollback plan rehearsed and automated.
  • Assumptions listed and reviewed.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Safety properties stated as invariants.
  • Telemetry captures correctness signals.

Further reading

1.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/
2.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/