Monthly research note. Theme: Quantum-Resilient Systems Engineering.

TL;DR

Research Frontiers: Composability, Proofs, and Future Primitives as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Downgrade resistance must be explicit and tested under active attackers.
  • Inventory long-lived secrets first; you can’t migrate what you can’t locate.
  • Measure cost shifts (CPU/bandwidth) and adapt DoS defenses accordingly.
  • Make failure modes explicit and observable.
  • Treat retries, reordering, and partial failure as default conditions.

Why this matters

  • Migration risk is operational: inventory, rollout, rollback, and monitoring.
  • Long-lived devices and PKI lifecycles are the hard constraint.
  • Hybrid protocols fail if binding is unclear or downgrade is possible.
  • Cost changes drive new DoS surfaces; defenses must evolve.

Key questions

  • What secrets must remain confidential for 10–30 years (and where are they today)?
  • How do you manage mixed deployments across regions and vendors?
  • What does rotation look like at fleet scale (devices, certs, tunnels, identities)?
  • Which protocols need hybrid now, and which can wait without regret?
  • How do you define success metrics for PQ readiness beyond “enabled”?
  • How do you stop downgrade under active adversaries?

Assumptions

  • Operational teams need safe playbooks; crypto changes are not one-off.
  • Key and certificate lifecycles outlive application versions.
  • Rollouts happen under partial adoption; compatibility matters.
  • Adversaries record traffic today (HNDL) and attack later.

Non-goals

  • Assuming performance impacts will be negligible.
  • Treating PQ migration as a single deployment event.
Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

Hybrid composition should be explicit and transcript-bound:

ss=HKDF(ssclassical  sspqc, info=transcript).\mathrm{ss} = \mathrm{HKDF}(\mathrm{ss}_\text{classical}\ \Vert\ \mathrm{ss}_\text{pqc},\ \text{info}=\mathrm{transcript}).

Inventory first. You can’t migrate what you can’t locate.

Treat ops as part of the protocol: monitoring, rollback, and incident response.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Integrity: invalid transitions are rejected (and detectable).
  • Authenticity: actions are bound to identity and purpose.
  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Replay resistance: duplicated inputs do not change outcomes.

Failure modes

  • Mixed-version behavior that violates assumptions silently.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Observability gaps during incidents (missing evidence).
Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart LR
  threat["Threat Model (quantum + classical)"] --> design["Protocol Design"]
  design --> impl["Implementation (no_std where needed)"]
  impl --> verify["Verification (tests + formal)"]
  verify --> ops["Operationalization (rotation + monitoring)"]
  ops --> threat

Implementation notes

PQ readiness is a systems program: crypto, networking, ops, and UX must compose.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

// PQ migration note: "enabled" is not "safe" unless binding and downgrade resistance are explicit.

Verification strategy

  • Rotation drills: certificates, tunnels, device identities.
  • Downgrade simulations with active attackers.
  • Interop tests across stacks and versions.
  • Performance profiling under load to quantify DoS risk.
  • Side-channel audits for constrained implementations.

Operational notes

  • Maintain an inventory of long-lived secrets and their lifetimes.
  • Define compatibility windows and communicate them to stakeholders.
  • Roll out hybrid with canaries and explicit rollback triggers.
  • Practice emergency deprecation (turn off broken algorithms quickly).
  • Add telemetry for algorithm negotiation and failure modes.
Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

  • Rollback events and the conditions that triggered them.
  • Invariant violation rate (should be ~0).
  • Admission-control / rate-limit rejections (by reason).
  • Error budget burn + tail latency under load.
  • Retry/timeout rates by endpoint and client cohort.

Rollback plan

  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Keep dual-write / dual-verify windows where appropriate.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
    • Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
  • Jepsen (2) — Fault injection and correctness testing for distributed systems.
    • Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

  • Which protocol surfaces are most exposed to HNDL risk in your environment?
  • How do you prevent configuration drift from re-enabling weak modes?
  • What is your minimal ‘safe mode’ when PQ paths fail?
  • What is your plan for third-party dependencies that can’t migrate quickly?

Checklist

  • Safety properties stated as invariants.
  • Telemetry captures correctness signals.
  • Assumptions listed and reviewed.
  • Failure modes enumerated with mitigations.
  • Rollback plan rehearsed and automated.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

Further reading

1.
Beyer B, Jones C, Petoff J, Murphy NR. Site Reliability Engineering: How Google Runs Production Systems [Internet]. O’Reilly Media; 2016. Available from: https://sre.google/sre-book/table-of-contents/
2.
Jepsen. Jepsen: Distributed Systems Safety Analysis [Internet]. Web; Available from: https://jepsen.io/