Monthly research note. Theme: Adversarial Infrastructure & Global Systems.

TL;DR

A focused memo on Time-Based Attacks: NTP Manipulation, Expiration, and Replay: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

  • Engineer cost asymmetry: defense must be cheaper than attack per unit of damage prevented.
  • Evidence pipelines (audit/config history) are part of incident response correctness.
  • Degraded modes are security decisions; write them down and test them.
  • Prefer protocols and APIs that make invalid states hard to express.
  • Bind security decisions to evidence (audit, invariants, telemetry).

Why this matters

  • Degraded modes without explicit policy become accidental vulnerabilities.
  • Privacy failures often come from metadata, not plaintext.
  • Incident response is a protocol: practice it, automate it, validate it.
  • Logs are only useful if they remain trustworthy under compromise.

Key questions

  • What is your degraded-mode behavior (and is it safe)?
  • Which logs are trustworthy under compromise (append-only, signed, isolated)?
  • Which controls fail first under load: auth, rate limits, storage, or observability?
  • How do you prevent dependency failures from becoming integrity failures?
  • What is the minimum viable recovery path after a catastrophic event?
  • Where is the attacker’s leverage (routing, DNS, dependency, identity, time)?

Assumptions

  • Observability pipelines can be attacked (cardinality explosions, log injection).
  • Traffic spikes can be malicious or accidental; you must handle both.
  • Some dependencies will fail open or fail closed unexpectedly.
  • Attackers can manipulate routing and DNS indirectly (upstream failures, BGP issues).

Non-goals

  • Relying on dashboards that vanish during the incident.
  • Treating degraded modes as “we’ll decide later.”
Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Resilience is about containment:

damageiblast_radius(i)withblast_radius(i) bounded by design.\text{damage} \le \sum_i \text{blast\_radius}(i)\quad\text{with}\quad \text{blast\_radius}(i)\ \text{bounded by design}.

Define which operations fail closed vs fail open. Do it before an incident.

Engineer friction where attackers pay but legitimate users don’t (asymmetric controls).

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

  • Replay resistance: duplicated inputs do not change outcomes.
  • Least authority: privileges are scoped by purpose and time.
  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Authenticity: actions are bound to identity and purpose.

Failure modes

  • Recovery paths that only work when nothing is broken.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Mixed-version behavior that violates assumptions silently.
  • Timeout ambiguity causing double-apply or partial state transitions.
Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart LR
  attack["Attack"] --> detect["Detect"]
  detect --> contain["Contain"]
  contain --> recover["Recover"]
  recover --> learn["Learn/Regress"]
  learn --> detect

Implementation notes

Prefer containment over heroics: isolate blast radius, keep core correct.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Evidence checklist:
- Immutable logs (append-only)
- Signed audit events
- Time sync monitoring
- Dependency health snapshots
- Config change history

Verification strategy

  • Observability stress: cardinality explosions and sampling under attack.
  • Policy tests: fail closed/open behaviors are unit-tested.
  • Dependency chaos: DNS issues, cert failures, upstream outages.
  • Game days: simulate DDoS, dependency failure, and credential abuse.
  • Incident replay: reconstruct timeline from evidence pipelines.

Operational notes

  • Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
  • Make emergency controls quick: feature flags, circuit breakers, safe defaults.
  • Instrument cost: which defenses become expensive and when.
  • Protect the edge and the evidence: rate limits + SIEM + log integrity.
  • Document and rehearse degraded-mode policy with on-call rotations.
Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

  • Error budget burn + tail latency under load.
  • Invariant violation rate (should be ~0).
  • Authz failures and policy denials (unexpected spikes).
  • Retry/timeout rates by endpoint and client cohort.
  • Rollback events and the conditions that triggered them.

Rollback plan

  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Keep dual-write / dual-verify windows where appropriate.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

  • Learn TLA+ (1) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
  • Let's Encrypt Incident Reports (2) — Operational failures and recovery in real-world PKI.
    • Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.

Open questions

  • What is your ‘safe mode’ when dependencies fail?
  • Which operation, if abused, causes irreversible damage?
  • How do you keep control-plane access during widespread incidents?
  • Where do you pay cost asymmetry today—and can you flip it?

Checklist

  • Rollback plan rehearsed and automated.
  • Assumptions listed and reviewed.
  • Failure modes enumerated with mitigations.
  • Safety properties stated as invariants.
  • Telemetry captures correctness signals.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

Further reading

1.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/
2.
Let’s Encrypt. Let’s Encrypt Incident Reports [Internet]. Web; Available from: https://community.letsencrypt.org/c/incidents/16/l/top