Monthly research note. Theme: Post-Quantum Cryptography & Migration.

TL;DR

A focused memo on Signatures in Practice: Dilithium/Falcon and Deployment Constraints: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

  • Interop is the migration plan—test matrices are more important than whitepapers.
  • PQC changes handshake costs; plan DoS defenses and budgets.
  • Hybrid composition must be explicit and transcript-bound to resist downgrade.
  • Treat retries, reordering, and partial failure as default conditions.
  • Define safety properties before performance goals.

Why this matters

  • Interop is the real risk: multiple stacks, vendors, and versions.
  • PQC changes bandwidth and CPU costs; DoS surfaces move.
  • Migration will be mixed-version for years; plan for it explicitly.
  • Operationalization (monitoring, rollback) determines success more than crypto choice.

Key questions

  • Which parts must be constant-time, and how will you validate that?
  • What are the new DoS surfaces (bigger keys, more CPU, more bandwidth)?
  • How do you rotate algorithms safely (crypto agility without chaos)?
  • What does interoperability testing look like across vendors and stacks?
  • Which secrets require long-term confidentiality (HNDL) and where are they today?
  • What telemetry proves PQC is working (not just enabled)?

Assumptions

  • Side channels exist: timing and cache behavior leak information.
  • Bandwidth is limited in some environments; larger handshakes matter.
  • Deployments are mixed; old clients must interoperate or fail safely.
  • Vendors vary: implementations and defaults differ.

Non-goals

  • Relying on silent fallback to weaker modes during interop failures.
  • Assuming PQC is “drop-in” without changing operational processes.
Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

A KEM gives you shared secrets without discrete-log assumptions:

(pk,sk)KeyGen(); (ct,ss)Enc(pk); ssDec(sk,ct).(\mathrm{pk},\mathrm{sk})\leftarrow \mathrm{KeyGen}();\ (\mathrm{ct},\mathrm{ss})\leftarrow \mathrm{Enc}(\mathrm{pk});\ \mathrm{ss}\leftarrow \mathrm{Dec}(\mathrm{sk},\mathrm{ct}).

Treat algorithm negotiation as adversarial: explicit downgrade resistance.

Binding is the whole game: make the transcript an input to the KDF.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Least authority: privileges are scoped by purpose and time.
  • Integrity: invalid transitions are rejected (and detectable).
  • Authenticity: actions are bound to identity and purpose.

Failure modes

  • Recovery paths that only work when nothing is broken.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Config drift that weakens security posture over time.
  • Mixed-version behavior that violates assumptions silently.
Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart TD
  negotiate["Negotiate Algorithms"] --> bind["Bind Transcript"]
  bind --> kdf["KDF (hybrid)"]
  kdf --> keys["Traffic Keys"]
  keys --> monitor["Monitor + Rollback"]

Implementation notes

Interop tests are the migration plan; everything else is a hypothesis.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

// Hybrid binding sketch (pseudocode):
// ss = HKDF(ss_classical || ss_pqc, info=transcript_hash)
// Then derive traffic keys from ss.

Verification strategy

  • Interop matrices across vendors/versions and failure modes.
  • Side-channel tests where tooling exists; constant-time audits.
  • Chaos deploys: mixed versions + rollback during partial outages.
  • Downgrade tests: active attacker manipulates negotiation.
  • DoS tests: measure CPU/bandwidth amplification and mitigation impact.

Operational notes

  • Inventory long-lived secrets and migrate the highest-risk first.
  • Document supported algorithm sets and deprecation timelines.
  • Add telemetry for negotiation outcomes, failures, and client cohorts.
  • Cap handshake cost per peer/IP; use stateless cookies when needed.
  • Roll out with canaries and explicit rollback triggers.
Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

  • Invariant violation rate (should be ~0).
  • Error budget burn + tail latency under load.
  • Admission-control / rate-limit rejections (by reason).
  • Authz failures and policy denials (unexpected spikes).
  • Retry/timeout rates by endpoint and client cohort.

Rollback plan

  • Define an explicit rollback trigger (metrics + thresholds).
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Keep dual-write / dual-verify windows where appropriate.
  • Use canaries and staged rollout; stop early when signals degrade.

Evidence

  • NIST Post-Quantum Cryptography Project (1) — Standardization process and algorithm selections.
    • Evidence: Treat PQ migration as a program (inventory, interop, rollback). Use NIST status to drive prioritization and timelines.
  • Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

  • Which clients will fail first, and what is the safe fallback behavior?
  • What is the worst-case handshake cost under attack?
  • Where would a downgrade be visible today, and how would you detect it?
  • How do you rotate algorithms without introducing configuration chaos?

Checklist

  • Safety properties stated as invariants.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Rollback plan rehearsed and automated.
  • Telemetry captures correctness signals.
  • Failure modes enumerated with mitigations.
  • Assumptions listed and reviewed.

Further reading

1.
National Institute of Standards and Technology (NIST). Post-Quantum Cryptography [Internet]. Web; Available from: https://csrc.nist.gov/projects/post-quantum-cryptography
2.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/