Monthly research note. Theme: Post-Quantum Cryptography & Migration.

TL;DR

A focused memo on KEMs in Practice: Kyber Handshakes and Failure Surfaces: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

  • Constant-time requirements don’t disappear; they become harder under bigger primitives.
  • Hybrid composition must be explicit and transcript-bound to resist downgrade.
  • Interop is the migration plan—test matrices are more important than whitepapers.
  • Automate guardrails; humans are for judgment, not for consistent enforcement.
  • Write assumptions down; treat them as interfaces.

Why this matters

  • Hybrid designs fail if binding is ambiguous (mix-and-match, downgrade).
  • Interop is the real risk: multiple stacks, vendors, and versions.
  • PQC changes bandwidth and CPU costs; DoS surfaces move.
  • Operationalization (monitoring, rollback) determines success more than crypto choice.

Key questions

  • What are the new DoS surfaces (bigger keys, more CPU, more bandwidth)?
  • Which secrets require long-term confidentiality (HNDL) and where are they today?
  • Which parts must be constant-time, and how will you validate that?
  • How do you bind hybrid secrets to prevent downgrade and mix-and-match attacks?
  • What telemetry proves PQC is working (not just enabled)?
  • How do you handle failures: decryption failures, invalid ciphertexts, malformed keys?

Assumptions

  • Deployments are mixed; old clients must interoperate or fail safely.
  • Active attacker can force retries, downgrades, and expensive handshakes.
  • Side channels exist: timing and cache behavior leak information.
  • Bandwidth is limited in some environments; larger handshakes matter.

Non-goals

  • Relying on silent fallback to weaker modes during interop failures.
  • Assuming PQC is “drop-in” without changing operational processes.
Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

Hybrid composition should be transcript-bound:

ss=HKDF(ssclassical  sspqc, info=transcript).\mathrm{ss} = \mathrm{HKDF}(\mathrm{ss}_\text{classical}\ \Vert\ \mathrm{ss}_\text{pqc},\ \text{info}=\mathrm{transcript}).

Make costs explicit: measure CPU and bandwidth, then add protections.

Binding is the whole game: make the transcript an input to the KDF.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Least authority: privileges are scoped by purpose and time.
  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Authenticity: actions are bound to identity and purpose.
  • Replay resistance: duplicated inputs do not change outcomes.

Failure modes

  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Mixed-version behavior that violates assumptions silently.
  • Config drift that weakens security posture over time.
Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

sequenceDiagram
  participant A as Initiator
  participant B as Responder
  A->>B: classical_keyshare + pqc_pk
  B-->>A: classical_keyshare + pqc_ct + sig
  A-->>B: sig
  Note over A,B: ss = HKDF(ss_classical || ss_pqc, transcript)

Implementation notes

PQC migration is a systems program: protocol, performance, ops, and UX must compose.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

// Hybrid binding sketch (pseudocode):
// ss = HKDF(ss_classical || ss_pqc, info=transcript_hash)
// Then derive traffic keys from ss.

Verification strategy

  • Downgrade tests: active attacker manipulates negotiation.
  • Chaos deploys: mixed versions + rollback during partial outages.
  • Interop matrices across vendors/versions and failure modes.
  • DoS tests: measure CPU/bandwidth amplification and mitigation impact.
  • Side-channel tests where tooling exists; constant-time audits.

Operational notes

  • Cap handshake cost per peer/IP; use stateless cookies when needed.
  • Add telemetry for negotiation outcomes, failures, and client cohorts.
  • Roll out with canaries and explicit rollback triggers.
  • Document supported algorithm sets and deprecation timelines.
  • Inventory long-lived secrets and migrate the highest-risk first.
Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

  • Invariant violation rate (should be ~0).
  • Retry/timeout rates by endpoint and client cohort.
  • Admission-control / rate-limit rejections (by reason).
  • Rollback events and the conditions that triggered them.
  • Authz failures and policy denials (unexpected spikes).

Rollback plan

  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Keep dual-write / dual-verify windows where appropriate.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • RFC 5869: HKDF (1) — Useful when discussing hybrid binding and context separation.
    • Evidence: HKDF is the workhorse for domain separation; bind purpose/context to avoid cross-protocol key reuse.
  • Learn TLA+ (2) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

  • What is the worst-case handshake cost under attack?
  • Which clients will fail first, and what is the safe fallback behavior?
  • How do you rotate algorithms without introducing configuration chaos?
  • Where would a downgrade be visible today, and how would you detect it?

Checklist

  • Assumptions listed and reviewed.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Rollback plan rehearsed and automated.
  • Telemetry captures correctness signals.
  • Failure modes enumerated with mitigations.
  • Safety properties stated as invariants.

Further reading

1.
Krawczyk H, Eronen P. HMAC-based Extract-and-Expand Key Derivation Function (HKDF) [Internet]. RFC Editor; 2010. Report No.: 5869. Available from: https://www.rfc-editor.org/rfc/rfc5869
2.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/