Monthly research note. Theme: Post-Quantum Cryptography & Migration.
TL;DR
PQC in TLS: Negotiation, Downgrade, and Interop as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- PQC changes handshake costs; plan DoS defenses and budgets.
- Interop is the migration plan—test matrices are more important than whitepapers.
- Migration is mixed-version for years: compatibility and rollback are security features.
- Treat retries, reordering, and partial failure as default conditions.
- Write assumptions down; treat them as interfaces.
Why this matters
- Operationalization (monitoring, rollback) determines success more than crypto choice.
- Constant-time constraints are harder under large primitives.
- Migration will be mixed-version for years; plan for it explicitly.
- Hybrid designs fail if binding is ambiguous (mix-and-match, downgrade).
Key questions
- How do you handle failures: decryption failures, invalid ciphertexts, malformed keys?
- What does interoperability testing look like across vendors and stacks?
- Which parts must be constant-time, and how will you validate that?
- Which secrets require long-term confidentiality (HNDL) and where are they today?
- What are the new DoS surfaces (bigger keys, more CPU, more bandwidth)?
- What telemetry proves PQC is working (not just enabled)?
Assumptions
- Vendors vary: implementations and defaults differ.
- Active attacker can force retries, downgrades, and expensive handshakes.
- Side channels exist: timing and cache behavior leak information.
- Deployments are mixed; old clients must interoperate or fail safely.
Non-goals
- Assuming PQC is “drop-in” without changing operational processes.
- Relying on silent fallback to weaker modes during interop failures.
Any unbounded work per request becomes a DoS primitive under adversaries.
Model & invariants
A KEM gives you shared secrets without discrete-log assumptions:
Binding is the whole game: make the transcript an input to the KDF.
Make costs explicit: measure CPU and bandwidth, then add protections.
Make the “impossible state” observable: a metric or alert that fires when invariants drift.
Security properties
- Replay resistance: duplicated inputs do not change outcomes.
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Evidence: critical actions emit verifiable audit events.
- Authenticity: actions are bound to identity and purpose.
Failure modes
- Observability gaps during incidents (missing evidence).
- Timeout ambiguity causing double-apply or partial state transitions.
- Mixed-version behavior that violates assumptions silently.
- Recovery paths that only work when nothing is broken.
Sampling hides the rare schedule that breaks your invariants.
Design sketch
sequenceDiagram
participant A as Initiator
participant B as Responder
A->>B: classical_keyshare + pqc_pk
B-->>A: classical_keyshare + pqc_ct + sig
A-->>B: sig
Note over A,B: ss = HKDF(ss_classical || ss_pqc, transcript)Implementation notes
Explicit binding prevents downgrade and mix-and-match. Don’t leave it implicit.
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
Hybrid handshake checklist:
- Explicit negotiation (no silent downgrade)
- Transcript-bound KDF
- DoS protections (rate limits, cookies, puzzles)
- Constant-time operations
- Telemetry: which mode, which failures, which clientsVerification strategy
- DoS tests: measure CPU/bandwidth amplification and mitigation impact.
- Downgrade tests: active attacker manipulates negotiation.
- Chaos deploys: mixed versions + rollback during partial outages.
- Side-channel tests where tooling exists; constant-time audits.
- Interop matrices across vendors/versions and failure modes.
Operational notes
- Cap handshake cost per peer/IP; use stateless cookies when needed.
- Inventory long-lived secrets and migrate the highest-risk first.
- Document supported algorithm sets and deprecation timelines.
- Add telemetry for negotiation outcomes, failures, and client cohorts.
- Roll out with canaries and explicit rollback triggers.
Attach explicit rollout/rollback triggers to changes that touch security or correctness.
What to monitor
- Authz failures and policy denials (unexpected spikes).
- Retry/timeout rates by endpoint and client cohort.
- Rollback events and the conditions that triggered them.
- Error budget burn + tail latency under load.
- Invariant violation rate (should be ~0).
Rollback plan
- Keep dual-write / dual-verify windows where appropriate.
- Define an explicit rollback trigger (metrics + thresholds).
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Use canaries and staged rollout; stop early when signals degrade.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Evidence
- RFC 5869: HKDF (1) — Useful when discussing hybrid binding and context separation.
- Evidence: HKDF is the workhorse for domain separation; bind purpose/context to avoid cross-protocol key reuse.
- Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Open questions
- Where would a downgrade be visible today, and how would you detect it?
- What is the worst-case handshake cost under attack?
- How do you rotate algorithms without introducing configuration chaos?
- Which clients will fail first, and what is the safe fallback behavior?
Checklist
- Telemetry captures correctness signals.
- Safety properties stated as invariants.
- Failure modes enumerated with mitigations.
- Rollback plan rehearsed and automated.
- Assumptions listed and reviewed.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Further reading
- CRYSTALS-Kyber — KEM design and parameters commonly referenced in deployments.
- CRYSTALS-Dilithium — Signature scheme design and deployment constraints.
- NIST Post-Quantum Cryptography Project — Standardization process and algorithm selections.
- RFC 5869: HKDF — Useful when discussing hybrid binding and context separation.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.