Monthly research note. Theme: Adversarial Infrastructure & Global Systems.

TL;DR

ZKP Systems Engineering: Provers, Verifiers, and Operational Cost as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Engineer cost asymmetry: defense must be cheaper than attack per unit of damage prevented.
  • Evidence pipelines (audit/config history) are part of incident response correctness.
  • Degraded modes are security decisions; write them down and test them.
  • Prefer protocols and APIs that make invalid states hard to express.
  • Write assumptions down; treat them as interfaces.

Why this matters

  • Attackers exploit cost asymmetry: make abuse cheap and defense expensive.
  • Privacy failures often come from metadata, not plaintext.
  • Logs are only useful if they remain trustworthy under compromise.
  • Incident response is a protocol: practice it, automate it, validate it.

Key questions

  • What is the minimum viable recovery path after a catastrophic event?
  • How do you prevent dependency failures from becoming integrity failures?
  • Which logs are trustworthy under compromise (append-only, signed, isolated)?
  • What is your degraded-mode behavior (and is it safe)?
  • How do you detect attacks that look like “normal traffic spikes”?
  • Which controls fail first under load: auth, rate limits, storage, or observability?

Assumptions

  • Operators are human and will make mistakes under pressure.
  • Some dependencies will fail open or fail closed unexpectedly.
  • Observability pipelines can be attacked (cardinality explosions, log injection).
  • Traffic spikes can be malicious or accidental; you must handle both.

Non-goals

  • Treating degraded modes as “we’ll decide later.”
  • Assuming perfect attribution (you rarely know who is attacking in real time).
Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

Defense is about cost asymmetry. If the attacker spends 11 and you spend 100100, you lose.

CostdefenseCostattack (per unit of damage prevented).\mathrm{Cost}_\text{defense} \ll \mathrm{Cost}_\text{attack}\ \text{(per unit of damage prevented)}.

Treat observability as a dependency: protect it from overload and manipulation.

Engineer friction where attackers pay but legitimate users don’t (asymmetric controls).

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

  • Replay resistance: duplicated inputs do not change outcomes.
  • Integrity: invalid transitions are rejected (and detectable).
  • Least authority: privileges are scoped by purpose and time.
  • Authenticity: actions are bound to identity and purpose.

Failure modes

  • Config drift that weakens security posture over time.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Mixed-version behavior that violates assumptions silently.
Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart LR
  attack["Attack"] --> detect["Detect"]
  detect --> contain["Contain"]
  contain --> recover["Recover"]
  recover --> learn["Learn/Regress"]
  learn --> detect

Implementation notes

Prefer containment over heroics: isolate blast radius, keep core correct.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

Degraded-mode table (example):
Operation | Normal | Under attack | Rationale
Auth      | full   | strict       | prevent abuse
Reads     | full   | cached/limited| protect core
Writes    | full   | queued/limited| preserve integrity
Admin     | full   | JIT + MFA     | reduce blast radius

Verification strategy

  • Incident replay: reconstruct timeline from evidence pipelines.
  • Observability stress: cardinality explosions and sampling under attack.
  • Dependency chaos: DNS issues, cert failures, upstream outages.
  • Policy tests: fail closed/open behaviors are unit-tested.
  • Game days: simulate DDoS, dependency failure, and credential abuse.

Operational notes

  • Protect the edge and the evidence: rate limits + SIEM + log integrity.
  • Document and rehearse degraded-mode policy with on-call rotations.
  • Instrument cost: which defenses become expensive and when.
  • Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.
  • Make emergency controls quick: feature flags, circuit breakers, safe defaults.
Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

  • Rollback events and the conditions that triggered them.
  • Admission-control / rate-limit rejections (by reason).
  • Invariant violation rate (should be ~0).
  • Retry/timeout rates by endpoint and client cohort.
  • Error budget burn + tail latency under load.

Rollback plan

  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Keep dual-write / dual-verify windows where appropriate.
  • Use canaries and staged rollout; stop early when signals degrade.

Evidence

  • Let's Encrypt Incident Reports (1) — Operational failures and recovery in real-world PKI.
    • Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.
  • Learn TLA+ (2) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

  • How do you keep control-plane access during widespread incidents?
  • What is your ‘safe mode’ when dependencies fail?
  • Which operation, if abused, causes irreversible damage?
  • Where do you pay cost asymmetry today—and can you flip it?

Checklist

  • Assumptions listed and reviewed.
  • Failure modes enumerated with mitigations.
  • Safety properties stated as invariants.
  • Telemetry captures correctness signals.
  • Rollback plan rehearsed and automated.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

Further reading

1.
Let’s Encrypt. Let’s Encrypt Incident Reports [Internet]. Web; Available from: https://community.letsencrypt.org/c/incidents/16/l/top
2.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/