Monthly research note. Theme: Adversarial Infrastructure & Global Systems.

TL;DR

Metadata and Privacy: The Hard Part Isn’t Encryption as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Protect observability: you can’t respond blind, and telemetry can be attacked.
  • Engineer cost asymmetry: defense must be cheaper than attack per unit of damage prevented.
  • Evidence pipelines (audit/config history) are part of incident response correctness.
  • Treat retries, reordering, and partial failure as default conditions.
  • Bind security decisions to evidence (audit, invariants, telemetry).

Why this matters

  • Degraded modes without explicit policy become accidental vulnerabilities.
  • Privacy failures often come from metadata, not plaintext.
  • Incident response is a protocol: practice it, automate it, validate it.
  • Global dependencies (DNS, routing, PKI) are shared attack surfaces.

Key questions

  • Where is the attacker’s leverage (routing, DNS, dependency, identity, time)?
  • Which logs are trustworthy under compromise (append-only, signed, isolated)?
  • How do you make abuse expensive (proof-of-work, quotas, pricing, friction)?
  • How do you detect attacks that look like “normal traffic spikes”?
  • What is your degraded-mode behavior (and is it safe)?
  • How do you prevent dependency failures from becoming integrity failures?

Assumptions

  • Attackers can manipulate routing and DNS indirectly (upstream failures, BGP issues).
  • Operators are human and will make mistakes under pressure.
  • Some dependencies will fail open or fail closed unexpectedly.
  • Observability pipelines can be attacked (cardinality explosions, log injection).

Non-goals

  • Relying on dashboards that vanish during the incident.
  • Assuming perfect attribution (you rarely know who is attacking in real time).
Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Resilience is about containment:

damageiblast_radius(i)withblast_radius(i) bounded by design.\text{damage} \le \sum_i \text{blast\_radius}(i)\quad\text{with}\quad \text{blast\_radius}(i)\ \text{bounded by design}.

Define which operations fail closed vs fail open. Do it before an incident.

Treat observability as a dependency: protect it from overload and manipulation.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

  • Least authority: privileges are scoped by purpose and time.
  • Authenticity: actions are bound to identity and purpose.
  • Evidence: critical actions emit verifiable audit events.
  • Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

  • Config drift that weakens security posture over time.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Mixed-version behavior that violates assumptions silently.
  • Timeout ambiguity causing double-apply or partial state transitions.
Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  edge["Edge (rate limits + WAF)"] --> core["Core Services"]
  core --> data["Data Plane"]
  data --> control["Control Plane"]
  control --> edge
  siem["Detection/Response"] --> core
  siem --> edge

Implementation notes

Keep evidence pipelines alive: you can’t respond blind.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Evidence checklist:
- Immutable logs (append-only)
- Signed audit events
- Time sync monitoring
- Dependency health snapshots
- Config change history

Verification strategy

  • Observability stress: cardinality explosions and sampling under attack.
  • Game days: simulate DDoS, dependency failure, and credential abuse.
  • Dependency chaos: DNS issues, cert failures, upstream outages.
  • Incident replay: reconstruct timeline from evidence pipelines.
  • Policy tests: fail closed/open behaviors are unit-tested.

Operational notes

  • Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
  • Instrument cost: which defenses become expensive and when.
  • Protect the edge and the evidence: rate limits + SIEM + log integrity.
  • Document and rehearse degraded-mode policy with on-call rotations.
  • Make emergency controls quick: feature flags, circuit breakers, safe defaults.
Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

  • Rollback events and the conditions that triggered them.
  • Retry/timeout rates by endpoint and client cohort.
  • Invariant violation rate (should be ~0).
  • Error budget burn + tail latency under load.
  • Authz failures and policy denials (unexpected spikes).

Rollback plan

  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Keep dual-write / dual-verify windows where appropriate.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • Learn TLA+ (1) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
  • Let's Encrypt Incident Reports (2) — Operational failures and recovery in real-world PKI.
    • Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.

Open questions

  • What is your ‘safe mode’ when dependencies fail?
  • How do you keep control-plane access during widespread incidents?
  • Where do you pay cost asymmetry today—and can you flip it?
  • Which operation, if abused, causes irreversible damage?

Checklist

  • Assumptions listed and reviewed.
  • Safety properties stated as invariants.
  • Rollback plan rehearsed and automated.
  • Failure modes enumerated with mitigations.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Telemetry captures correctness signals.

Further reading

1.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/
2.
Let’s Encrypt. Let’s Encrypt Incident Reports [Internet]. Web; Available from: https://community.letsencrypt.org/c/incidents/16/l/top