Monthly research note. Theme: Adversarial Infrastructure & Global Systems.

TL;DR

A focused memo on Supply Chain Attacks: Dependency Poisoning and Maintainer Compromise: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

  • Evidence pipelines (audit/config history) are part of incident response correctness.
  • Engineer cost asymmetry: defense must be cheaper than attack per unit of damage prevented.
  • Degraded modes are security decisions; write them down and test them.
  • Design rollbacks as part of the happy path.
  • Make failure modes explicit and observable.

Why this matters

  • Global dependencies (DNS, routing, PKI) are shared attack surfaces.
  • Attackers exploit cost asymmetry: make abuse cheap and defense expensive.
  • Privacy failures often come from metadata, not plaintext.
  • Logs are only useful if they remain trustworthy under compromise.

Key questions

  • How do you make abuse expensive (proof-of-work, quotas, pricing, friction)?
  • What is your degraded-mode behavior (and is it safe)?
  • How do you prevent dependency failures from becoming integrity failures?
  • Where is the attacker’s leverage (routing, DNS, dependency, identity, time)?
  • Which logs are trustworthy under compromise (append-only, signed, isolated)?
  • What is the minimum viable recovery path after a catastrophic event?

Assumptions

  • Operators are human and will make mistakes under pressure.
  • Traffic spikes can be malicious or accidental; you must handle both.
  • Some dependencies will fail open or fail closed unexpectedly.
  • Attackers can manipulate routing and DNS indirectly (upstream failures, BGP issues).

Non-goals

  • Assuming perfect attribution (you rarely know who is attacking in real time).
  • Treating degraded modes as “we’ll decide later.”
Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Defense is about cost asymmetry. If the attacker spends 11 and you spend 100100, you lose.

CostdefenseCostattack (per unit of damage prevented).\mathrm{Cost}_\text{defense} \ll \mathrm{Cost}_\text{attack}\ \text{(per unit of damage prevented)}.

Engineer friction where attackers pay but legitimate users don’t (asymmetric controls).

Define which operations fail closed vs fail open. Do it before an incident.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

  • Integrity: invalid transitions are rejected (and detectable).
  • Evidence: critical actions emit verifiable audit events.
  • Least authority: privileges are scoped by purpose and time.
  • Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

  • Config drift that weakens security posture over time.
  • Mixed-version behavior that violates assumptions silently.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Observability gaps during incidents (missing evidence).
Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart LR
  attack["Attack"] --> detect["Detect"]
  detect --> contain["Contain"]
  contain --> recover["Recover"]
  recover --> learn["Learn/Regress"]
  learn --> detect

Implementation notes

Prefer containment over heroics: isolate blast radius, keep core correct.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Degraded-mode table (example):
Operation | Normal | Under attack | Rationale
Auth      | full   | strict       | prevent abuse
Reads     | full   | cached/limited| protect core
Writes    | full   | queued/limited| preserve integrity
Admin     | full   | JIT + MFA     | reduce blast radius

Verification strategy

  • Observability stress: cardinality explosions and sampling under attack.
  • Game days: simulate DDoS, dependency failure, and credential abuse.
  • Incident replay: reconstruct timeline from evidence pipelines.
  • Policy tests: fail closed/open behaviors are unit-tested.
  • Dependency chaos: DNS issues, cert failures, upstream outages.

Operational notes

  • Protect the edge and the evidence: rate limits + SIEM + log integrity.
  • Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
  • Instrument cost: which defenses become expensive and when.
  • Document and rehearse degraded-mode policy with on-call rotations.
  • Make emergency controls quick: feature flags, circuit breakers, safe defaults.
Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

  • Error budget burn + tail latency under load.
  • Authz failures and policy denials (unexpected spikes).
  • Invariant violation rate (should be ~0).
  • Rollback events and the conditions that triggered them.
  • Admission-control / rate-limit rejections (by reason).

Rollback plan

  • Use canaries and staged rollout; stop early when signals degrade.
  • Keep dual-write / dual-verify windows where appropriate.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • Let's Encrypt Incident Reports (1) — Operational failures and recovery in real-world PKI.
    • Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.
  • Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
    • Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

  • What is your ‘safe mode’ when dependencies fail?
  • Where do you pay cost asymmetry today—and can you flip it?
  • How do you keep control-plane access during widespread incidents?
  • Which operation, if abused, causes irreversible damage?

Checklist

  • Assumptions listed and reviewed.
  • Failure modes enumerated with mitigations.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Safety properties stated as invariants.
  • Telemetry captures correctness signals.
  • Rollback plan rehearsed and automated.

Further reading

1.
Let’s Encrypt. Let’s Encrypt Incident Reports [Internet]. Web; Available from: https://community.letsencrypt.org/c/incidents/16/l/top
2.
Beyer B, Jones C, Petoff J, Murphy NR. Site Reliability Engineering: How Google Runs Production Systems [Internet]. O’Reilly Media; 2016. Available from: https://sre.google/sre-book/table-of-contents/