Hybrid Key Exchange: Binding Classical and PQ Secrets Correctly

Monthly research note. Theme: Post-Quantum Cryptography & Migration.

TL;DR

A focused memo on Hybrid Key Exchange: Binding Classical and PQ Secrets Correctly: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

Hybrid composition must be explicit and transcript-bound to resist downgrade.
Interop is the migration plan—test matrices are more important than whitepapers.
Constant-time requirements don’t disappear; they become harder under bigger primitives.
Make failure modes explicit and observable.
Define safety properties before performance goals.

Why this matters

Constant-time constraints are harder under large primitives.
Operationalization (monitoring, rollback) determines success more than crypto choice.
Interop is the real risk: multiple stacks, vendors, and versions.
PQC changes bandwidth and CPU costs; DoS surfaces move.

Key questions

What does interoperability testing look like across vendors and stacks?
What telemetry proves PQC is working (not just enabled)?
How do you handle failures: decryption failures, invalid ciphertexts, malformed keys?
How do you bind hybrid secrets to prevent downgrade and mix-and-match attacks?
How do you rotate algorithms safely (crypto agility without chaos)?
Which parts must be constant-time, and how will you validate that?

Assumptions

Active attacker can force retries, downgrades, and expensive handshakes.
Side channels exist: timing and cache behavior leak information.
Vendors vary: implementations and defaults differ.
Bandwidth is limited in some environments; larger handshakes matter.

Non-goals

Assuming PQC is “drop-in” without changing operational processes.
Treating migration as a single flag flip.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

Hybrid composition should be transcript-bound:

\mathrm{ss} = \mathrm{HKDF}(\mathrm{ss}_\text{classical}\ \Vert\ \mathrm{ss}_\text{pqc},\ \text{info}=\mathrm{transcript}).

Treat algorithm negotiation as adversarial: explicit downgrade resistance.

Make costs explicit: measure CPU and bandwidth, then add protections.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

Least authority: privileges are scoped by purpose and time.
Downgrade resistance: negotiation can’t silently weaken security posture.
Authenticity: actions are bound to identity and purpose.
Replay resistance: duplicated inputs do not change outcomes.

Failure modes

Mixed-version behavior that violates assumptions silently.
Recovery paths that only work when nothing is broken.
Observability gaps during incidents (missing evidence).
Timeout ambiguity causing double-apply or partial state transitions.

Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

sequenceDiagram
  participant A as Initiator
  participant B as Responder
  A->>B: classical_keyshare + pqc_pk
  B-->>A: classical_keyshare + pqc_ct + sig
  A-->>B: sig
  Note over A,B: ss = HKDF(ss_classical || ss_pqc, transcript)

Implementation notes

Interop tests are the migration plan; everything else is a hypothesis.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

// Hybrid binding sketch (pseudocode):
// ss = HKDF(ss_classical || ss_pqc, info=transcript_hash)
// Then derive traffic keys from ss.

Verification strategy

Interop matrices across vendors/versions and failure modes.
Side-channel tests where tooling exists; constant-time audits.
Chaos deploys: mixed versions + rollback during partial outages.
Downgrade tests: active attacker manipulates negotiation.
DoS tests: measure CPU/bandwidth amplification and mitigation impact.

Operational notes

Add telemetry for negotiation outcomes, failures, and client cohorts.
Inventory long-lived secrets and migrate the highest-risk first.
Document supported algorithm sets and deprecation timelines.
Roll out with canaries and explicit rollback triggers.
Cap handshake cost per peer/IP; use stateless cookies when needed.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Error budget burn + tail latency under load.
Authz failures and policy denials (unexpected spikes).
Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Admission-control / rate-limit rejections (by reason).

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.
Use canaries and staged rollout; stop early when signals degrade.

Evidence

Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

What is the worst-case handshake cost under attack?
Which clients will fail first, and what is the safe fallback behavior?
Where would a downgrade be visible today, and how would you detect it?
How do you rotate algorithms without introducing configuration chaos?

Checklist

Telemetry captures correctness signals.
Assumptions listed and reviewed.
Safety properties stated as invariants.
Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading