Monthly research note. Theme: Adversarial Infrastructure & Global Systems.

TL;DR

A focused memo on Consensus Under Attack: Adaptive Adversaries and Network Control: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Protect observability: you can’t respond blind, and telemetry can be attacked.
  • Degraded modes are security decisions; write them down and test them.
  • Dependencies (DNS, routing, PKI) are shared attack surfaces—plan containment.
  • Measure correctness signals, not only latency/throughput.
  • Make failure modes explicit and observable.

Why this matters

  • Privacy failures often come from metadata, not plaintext.
  • Degraded modes without explicit policy become accidental vulnerabilities.
  • Logs are only useful if they remain trustworthy under compromise.
  • Attackers exploit cost asymmetry: make abuse cheap and defense expensive.

Key questions

  • How do you make abuse expensive (proof-of-work, quotas, pricing, friction)?
  • Which logs are trustworthy under compromise (append-only, signed, isolated)?
  • Which controls fail first under load: auth, rate limits, storage, or observability?
  • Where is the attacker’s leverage (routing, DNS, dependency, identity, time)?
  • How do you detect attacks that look like “normal traffic spikes”?
  • How do you prevent dependency failures from becoming integrity failures?

Assumptions

  • Traffic spikes can be malicious or accidental; you must handle both.
  • Some dependencies will fail open or fail closed unexpectedly.
  • Observability pipelines can be attacked (cardinality explosions, log injection).
  • Operators are human and will make mistakes under pressure.

Non-goals

  • Assuming perfect attribution (you rarely know who is attacking in real time).
  • Relying on dashboards that vanish during the incident.
Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

Resilience is about containment:

damageiblast_radius(i)withblast_radius(i) bounded by design.\text{damage} \le \sum_i \text{blast\_radius}(i)\quad\text{with}\quad \text{blast\_radius}(i)\ \text{bounded by design}.

Treat observability as a dependency: protect it from overload and manipulation.

Define which operations fail closed vs fail open. Do it before an incident.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

  • Least authority: privileges are scoped by purpose and time.
  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Integrity: invalid transitions are rejected (and detectable).
  • Authenticity: actions are bound to identity and purpose.

Failure modes

  • Observability gaps during incidents (missing evidence).
  • Config drift that weakens security posture over time.
  • Recovery paths that only work when nothing is broken.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart LR
  attack["Attack"] --> detect["Detect"]
  detect --> contain["Contain"]
  contain --> recover["Recover"]
  recover --> learn["Learn/Regress"]
  learn --> detect

Implementation notes

Keep evidence pipelines alive: you can’t respond blind.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Degraded-mode table (example):
Operation | Normal | Under attack | Rationale
Auth      | full   | strict       | prevent abuse
Reads     | full   | cached/limited| protect core
Writes    | full   | queued/limited| preserve integrity
Admin     | full   | JIT + MFA     | reduce blast radius

Verification strategy

  • Policy tests: fail closed/open behaviors are unit-tested.
  • Incident replay: reconstruct timeline from evidence pipelines.
  • Observability stress: cardinality explosions and sampling under attack.
  • Dependency chaos: DNS issues, cert failures, upstream outages.
  • Game days: simulate DDoS, dependency failure, and credential abuse.

Operational notes

  • Document and rehearse degraded-mode policy with on-call rotations.
  • Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
  • Make emergency controls quick: feature flags, circuit breakers, safe defaults.
  • Instrument cost: which defenses become expensive and when.
  • Protect the edge and the evidence: rate limits + SIEM + log integrity.
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Rollback events and the conditions that triggered them.
  • Invariant violation rate (should be ~0).
  • Retry/timeout rates by endpoint and client cohort.
  • Authz failures and policy denials (unexpected spikes).
  • Admission-control / rate-limit rejections (by reason).

Rollback plan

  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Keep dual-write / dual-verify windows where appropriate.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
  • Learn TLA+ (2) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

  • Where do you pay cost asymmetry today—and can you flip it?
  • What is your ‘safe mode’ when dependencies fail?
  • Which operation, if abused, causes irreversible damage?
  • How do you keep control-plane access during widespread incidents?

Checklist

  • Telemetry captures correctness signals.
  • Rollback plan rehearsed and automated.
  • Failure modes enumerated with mitigations.
  • Assumptions listed and reviewed.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Safety properties stated as invariants.

Further reading

1.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/
2.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/