Monthly research note. Theme: Post-Quantum Cryptography & Migration.
TL;DR
A focused memo on PQC Threat Models: 'Harvest Now, Decrypt Later' in Real Systems: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.
Key takeaways
- Migration is mixed-version for years: compatibility and rollback are security features.
- Hybrid composition must be explicit and transcript-bound to resist downgrade.
- PQC changes handshake costs; plan DoS defenses and budgets.
- Prefer protocols and APIs that make invalid states hard to express.
- Define safety properties before performance goals.
Why this matters
- PQC changes bandwidth and CPU costs; DoS surfaces move.
- Hybrid designs fail if binding is ambiguous (mix-and-match, downgrade).
- Interop is the real risk: multiple stacks, vendors, and versions.
- Operationalization (monitoring, rollback) determines success more than crypto choice.
Key questions
- Which parts must be constant-time, and how will you validate that?
- How do you bind hybrid secrets to prevent downgrade and mix-and-match attacks?
- What does interoperability testing look like across vendors and stacks?
- What telemetry proves PQC is working (not just enabled)?
- How do you handle failures: decryption failures, invalid ciphertexts, malformed keys?
- Which secrets require long-term confidentiality (HNDL) and where are they today?
Assumptions
- Vendors vary: implementations and defaults differ.
- Side channels exist: timing and cache behavior leak information.
- Bandwidth is limited in some environments; larger handshakes matter.
- Deployments are mixed; old clients must interoperate or fail safely.
Non-goals
- Treating migration as a single flag flip.
- Assuming PQC is “drop-in” without changing operational processes.
Any unbounded work per request becomes a DoS primitive under adversaries.
Model & invariants
A KEM gives you shared secrets without discrete-log assumptions:
Treat algorithm negotiation as adversarial: explicit downgrade resistance.
Make costs explicit: measure CPU and bandwidth, then add protections.
Monotonicity beats timestamps: counters and epochs survive clock skew.
Security properties
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Evidence: critical actions emit verifiable audit events.
- Least authority: privileges are scoped by purpose and time.
- Integrity: invalid transitions are rejected (and detectable).
Failure modes
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Timeout ambiguity causing double-apply or partial state transitions.
- Observability gaps during incidents (missing evidence).
- Recovery paths that only work when nothing is broken.
A recovery plan that isn’t exercised will fail when you need it.
Design sketch
sequenceDiagram
participant A as Initiator
participant B as Responder
A->>B: classical_keyshare + pqc_pk
B-->>A: classical_keyshare + pqc_ct + sig
A-->>B: sig
Note over A,B: ss = HKDF(ss_classical || ss_pqc, transcript)Implementation notes
Explicit binding prevents downgrade and mix-and-match. Don’t leave it implicit.
If you can’t explain a timeout outcome, you can’t make retries safe.
Hybrid handshake checklist:
- Explicit negotiation (no silent downgrade)
- Transcript-bound KDF
- DoS protections (rate limits, cookies, puzzles)
- Constant-time operations
- Telemetry: which mode, which failures, which clientsVerification strategy
- Downgrade tests: active attacker manipulates negotiation.
- Interop matrices across vendors/versions and failure modes.
- Side-channel tests where tooling exists; constant-time audits.
- Chaos deploys: mixed versions + rollback during partial outages.
- DoS tests: measure CPU/bandwidth amplification and mitigation impact.
Operational notes
- Roll out with canaries and explicit rollback triggers.
- Add telemetry for negotiation outcomes, failures, and client cohorts.
- Inventory long-lived secrets and migrate the highest-risk first.
- Cap handshake cost per peer/IP; use stateless cookies when needed.
- Document supported algorithm sets and deprecation timelines.
Make degraded modes explicit: fail closed vs fail open is a policy choice.
What to monitor
- Admission-control / rate-limit rejections (by reason).
- Retry/timeout rates by endpoint and client cohort.
- Invariant violation rate (should be ~0).
- Error budget burn + tail latency under load.
- Authz failures and policy denials (unexpected spikes).
Rollback plan
- Define an explicit rollback trigger (metrics + thresholds).
- Use canaries and staged rollout; stop early when signals degrade.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Keep dual-write / dual-verify windows where appropriate.
Evidence
- RFC 5869: HKDF (1) — Useful when discussing hybrid binding and context separation.
- Evidence: HKDF is the workhorse for domain separation; bind purpose/context to avoid cross-protocol key reuse.
- NIST Post-Quantum Cryptography Project (2) — Standardization process and algorithm selections.
- Evidence: Treat PQ migration as a program (inventory, interop, rollback). Use NIST status to drive prioritization and timelines.
Open questions
- How do you rotate algorithms without introducing configuration chaos?
- Which clients will fail first, and what is the safe fallback behavior?
- What is the worst-case handshake cost under attack?
- Where would a downgrade be visible today, and how would you detect it?
Checklist
- Safety properties stated as invariants.
- Failure modes enumerated with mitigations.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Assumptions listed and reviewed.
- Rollback plan rehearsed and automated.
- Telemetry captures correctness signals.
Further reading
- CRYSTALS-Kyber — KEM design and parameters commonly referenced in deployments.
- CRYSTALS-Dilithium — Signature scheme design and deployment constraints.
- RFC 5869: HKDF — Useful when discussing hybrid binding and context separation.
- NIST Post-Quantum Cryptography Project — Standardization process and algorithm selections.
- Learn TLA+ — Practical entry point for specification and model checking.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.