BGP and Routing Attacks: Engineering for the Internet We Have

Monthly research note. Theme: Adversarial Infrastructure & Global Systems.

TL;DR

BGP and Routing Attacks: Engineering for the Internet We Have as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Protect observability: you can’t respond blind, and telemetry can be attacked.
Degraded modes are security decisions; write them down and test them.
Engineer cost asymmetry: defense must be cheaper than attack per unit of damage prevented.
Design rollbacks as part of the happy path.
Write assumptions down; treat them as interfaces.

Why this matters

Global dependencies (DNS, routing, PKI) are shared attack surfaces.
Attackers exploit cost asymmetry: make abuse cheap and defense expensive.
Degraded modes without explicit policy become accidental vulnerabilities.
Privacy failures often come from metadata, not plaintext.

Key questions

What is your degraded-mode behavior (and is it safe)?
What is the minimum viable recovery path after a catastrophic event?
Which controls fail first under load: auth, rate limits, storage, or observability?
Which logs are trustworthy under compromise (append-only, signed, isolated)?
How do you detect attacks that look like “normal traffic spikes”?
How do you prevent dependency failures from becoming integrity failures?

Assumptions

Observability pipelines can be attacked (cardinality explosions, log injection).
Traffic spikes can be malicious or accidental; you must handle both.
Attackers can manipulate routing and DNS indirectly (upstream failures, BGP issues).
Some dependencies will fail open or fail closed unexpectedly.

Non-goals

Treating degraded modes as “we’ll decide later.”
Relying on dashboards that vanish during the incident.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

Defense is about cost asymmetry. If the attacker spends $1$ and you spend $100$ , you lose.

\mathrm{Cost}_\text{defense} \ll \mathrm{Cost}_\text{attack}\ \text{(per unit of damage prevented)}.

Treat observability as a dependency: protect it from overload and manipulation.

Define which operations fail closed vs fail open. Do it before an incident.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Authenticity: actions are bound to identity and purpose.
Evidence: critical actions emit verifiable audit events.
Integrity: invalid transitions are rejected (and detectable).
Least authority: privileges are scoped by purpose and time.

Failure modes

Recovery paths that only work when nothing is broken.
Config drift that weakens security posture over time.
Observability gaps during incidents (missing evidence).
Mixed-version behavior that violates assumptions silently.

Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart LR
  attack["Attack"] --> detect["Detect"]
  detect --> contain["Contain"]
  contain --> recover["Recover"]
  recover --> learn["Learn/Regress"]
  learn --> detect

Implementation notes

Keep evidence pipelines alive: you can’t respond blind.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

Degraded-mode table (example):
Operation | Normal | Under attack | Rationale
Auth      | full   | strict       | prevent abuse
Reads     | full   | cached/limited| protect core
Writes    | full   | queued/limited| preserve integrity
Admin     | full   | JIT + MFA     | reduce blast radius

Verification strategy

Observability stress: cardinality explosions and sampling under attack.
Incident replay: reconstruct timeline from evidence pipelines.
Policy tests: fail closed/open behaviors are unit-tested.
Game days: simulate DDoS, dependency failure, and credential abuse.
Dependency chaos: DNS issues, cert failures, upstream outages.

Operational notes

Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
Protect the edge and the evidence: rate limits + SIEM + log integrity.
Instrument cost: which defenses become expensive and when.
Make emergency controls quick: feature flags, circuit breakers, safe defaults.
Document and rehearse degraded-mode policy with on-call rotations.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Rollback events and the conditions that triggered them.
Error budget burn + tail latency under load.
Admission-control / rate-limit rejections (by reason).
Authz failures and policy denials (unexpected spikes).
Retry/timeout rates by endpoint and client cohort.

Rollback plan

Prefer backward-compatible changes; avoid “flag day” upgrades.
Define an explicit rollback trigger (metrics + thresholds).
Use canaries and staged rollout; stop early when signals degrade.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Keep dual-write / dual-verify windows where appropriate.

Evidence

Let's Encrypt Incident Reports (1) — Operational failures and recovery in real-world PKI.
- Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.
Learn TLA+ (2) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

How do you keep control-plane access during widespread incidents?
What is your ‘safe mode’ when dependencies fail?
Which operation, if abused, causes irreversible damage?
Where do you pay cost asymmetry today—and can you flip it?

Checklist

Failure modes enumerated with mitigations.
Safety properties stated as invariants.
Assumptions listed and reviewed.
Rollback plan rehearsed and automated.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Telemetry captures correctness signals.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading