Monthly research note. Theme: Post-Quantum Cryptography & Migration.
TL;DR
A focused memo on Hybrid Key Exchange: Binding Classical and PQ Secrets Correctly: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Correctness is cheaper to enforce at interfaces than to repair in production data.
Key takeaways
- Hybrid composition must be explicit and transcript-bound to resist downgrade.
- Interop is the migration plan—test matrices are more important than whitepapers.
- Constant-time requirements don’t disappear; they become harder under bigger primitives.
- Make failure modes explicit and observable.
- Define safety properties before performance goals.
Why this matters
- Constant-time constraints are harder under large primitives.
- Operationalization (monitoring, rollback) determines success more than crypto choice.
- Interop is the real risk: multiple stacks, vendors, and versions.
- PQC changes bandwidth and CPU costs; DoS surfaces move.
Key questions
- What does interoperability testing look like across vendors and stacks?
- What telemetry proves PQC is working (not just enabled)?
- How do you handle failures: decryption failures, invalid ciphertexts, malformed keys?
- How do you bind hybrid secrets to prevent downgrade and mix-and-match attacks?
- How do you rotate algorithms safely (crypto agility without chaos)?
- Which parts must be constant-time, and how will you validate that?
Assumptions
- Active attacker can force retries, downgrades, and expensive handshakes.
- Side channels exist: timing and cache behavior leak information.
- Vendors vary: implementations and defaults differ.
- Bandwidth is limited in some environments; larger handshakes matter.
Non-goals
- Assuming PQC is “drop-in” without changing operational processes.
- Treating migration as a single flag flip.
Parsing is an attacker-controlled interface—validate early and fail fast.
Model & invariants
Hybrid composition should be transcript-bound:
Treat algorithm negotiation as adversarial: explicit downgrade resistance.
Make costs explicit: measure CPU and bandwidth, then add protections.
Invariants must be checkable from evidence you actually have (state + logs + counters).
Security properties
- Least authority: privileges are scoped by purpose and time.
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Authenticity: actions are bound to identity and purpose.
- Replay resistance: duplicated inputs do not change outcomes.
Failure modes
- Mixed-version behavior that violates assumptions silently.
- Recovery paths that only work when nothing is broken.
- Observability gaps during incidents (missing evidence).
- Timeout ambiguity causing double-apply or partial state transitions.
A recovery plan that isn’t exercised will fail when you need it.
Design sketch
sequenceDiagram
participant A as Initiator
participant B as Responder
A->>B: classical_keyshare + pqc_pk
B-->>A: classical_keyshare + pqc_ct + sig
A-->>B: sig
Note over A,B: ss = HKDF(ss_classical || ss_pqc, transcript)Implementation notes
Interop tests are the migration plan; everything else is a hypothesis.
Acknowledge only after durability (or make “ack” explicitly best-effort).
// Hybrid binding sketch (pseudocode):
// ss = HKDF(ss_classical || ss_pqc, info=transcript_hash)
// Then derive traffic keys from ss.Verification strategy
- Interop matrices across vendors/versions and failure modes.
- Side-channel tests where tooling exists; constant-time audits.
- Chaos deploys: mixed versions + rollback during partial outages.
- Downgrade tests: active attacker manipulates negotiation.
- DoS tests: measure CPU/bandwidth amplification and mitigation impact.
Operational notes
- Add telemetry for negotiation outcomes, failures, and client cohorts.
- Inventory long-lived secrets and migrate the highest-risk first.
- Document supported algorithm sets and deprecation timelines.
- Roll out with canaries and explicit rollback triggers.
- Cap handshake cost per peer/IP; use stateless cookies when needed.
Attach explicit rollout/rollback triggers to changes that touch security or correctness.
What to monitor
- Error budget burn + tail latency under load.
- Authz failures and policy denials (unexpected spikes).
- Retry/timeout rates by endpoint and client cohort.
- Invariant violation rate (should be ~0).
- Admission-control / rate-limit rejections (by reason).
Rollback plan
- Define an explicit rollback trigger (metrics + thresholds).
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Keep dual-write / dual-verify windows where appropriate.
- Use canaries and staged rollout; stop early when signals degrade.
Evidence
- Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
- Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Open questions
- What is the worst-case handshake cost under attack?
- Which clients will fail first, and what is the safe fallback behavior?
- Where would a downgrade be visible today, and how would you detect it?
- How do you rotate algorithms without introducing configuration chaos?
Checklist
- Telemetry captures correctness signals.
- Assumptions listed and reviewed.
- Safety properties stated as invariants.
- Failure modes enumerated with mitigations.
- Rollback plan rehearsed and automated.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Further reading
- NIST Post-Quantum Cryptography Project — Standardization process and algorithm selections.
- RFC 5869: HKDF — Useful when discussing hybrid binding and context separation.
- CRYSTALS-Dilithium — Signature scheme design and deployment constraints.
- CRYSTALS-Kyber — KEM design and parameters commonly referenced in deployments.
- Jepsen — Fault injection and correctness testing for distributed systems.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.