PQC in TLS: Negotiation, Downgrade, and Interop

Monthly research note. Theme: Post-Quantum Cryptography & Migration.

TL;DR

PQC in TLS: Negotiation, Downgrade, and Interop as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

PQC changes handshake costs; plan DoS defenses and budgets.
Interop is the migration plan—test matrices are more important than whitepapers.
Migration is mixed-version for years: compatibility and rollback are security features.
Treat retries, reordering, and partial failure as default conditions.
Write assumptions down; treat them as interfaces.

Why this matters

Operationalization (monitoring, rollback) determines success more than crypto choice.
Constant-time constraints are harder under large primitives.
Migration will be mixed-version for years; plan for it explicitly.
Hybrid designs fail if binding is ambiguous (mix-and-match, downgrade).

Key questions

How do you handle failures: decryption failures, invalid ciphertexts, malformed keys?
What does interoperability testing look like across vendors and stacks?
Which parts must be constant-time, and how will you validate that?
Which secrets require long-term confidentiality (HNDL) and where are they today?
What are the new DoS surfaces (bigger keys, more CPU, more bandwidth)?
What telemetry proves PQC is working (not just enabled)?

Assumptions

Vendors vary: implementations and defaults differ.
Active attacker can force retries, downgrades, and expensive handshakes.
Side channels exist: timing and cache behavior leak information.
Deployments are mixed; old clients must interoperate or fail safely.

Non-goals

Assuming PQC is “drop-in” without changing operational processes.
Relying on silent fallback to weaker modes during interop failures.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

A KEM gives you shared secrets without discrete-log assumptions:

(\mathrm{pk},\mathrm{sk})\leftarrow \mathrm{KeyGen}();\ (\mathrm{ct},\mathrm{ss})\leftarrow \mathrm{Enc}(\mathrm{pk});\ \mathrm{ss}\leftarrow \mathrm{Dec}(\mathrm{sk},\mathrm{ct}).

Binding is the whole game: make the transcript an input to the KDF.

Make costs explicit: measure CPU and bandwidth, then add protections.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Replay resistance: duplicated inputs do not change outcomes.
Downgrade resistance: negotiation can’t silently weaken security posture.
Evidence: critical actions emit verifiable audit events.
Authenticity: actions are bound to identity and purpose.

Failure modes

Observability gaps during incidents (missing evidence).
Timeout ambiguity causing double-apply or partial state transitions.
Mixed-version behavior that violates assumptions silently.
Recovery paths that only work when nothing is broken.

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

sequenceDiagram
  participant A as Initiator
  participant B as Responder
  A->>B: classical_keyshare + pqc_pk
  B-->>A: classical_keyshare + pqc_ct + sig
  A-->>B: sig
  Note over A,B: ss = HKDF(ss_classical || ss_pqc, transcript)

Implementation notes

Explicit binding prevents downgrade and mix-and-match. Don’t leave it implicit.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

Hybrid handshake checklist:
- Explicit negotiation (no silent downgrade)
- Transcript-bound KDF
- DoS protections (rate limits, cookies, puzzles)
- Constant-time operations
- Telemetry: which mode, which failures, which clients

Verification strategy

DoS tests: measure CPU/bandwidth amplification and mitigation impact.
Downgrade tests: active attacker manipulates negotiation.
Chaos deploys: mixed versions + rollback during partial outages.
Side-channel tests where tooling exists; constant-time audits.
Interop matrices across vendors/versions and failure modes.

Operational notes

Cap handshake cost per peer/IP; use stateless cookies when needed.
Inventory long-lived secrets and migrate the highest-risk first.
Document supported algorithm sets and deprecation timelines.
Add telemetry for negotiation outcomes, failures, and client cohorts.
Roll out with canaries and explicit rollback triggers.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Authz failures and policy denials (unexpected spikes).
Retry/timeout rates by endpoint and client cohort.
Rollback events and the conditions that triggered them.
Error budget burn + tail latency under load.
Invariant violation rate (should be ~0).

Rollback plan

Keep dual-write / dual-verify windows where appropriate.
Define an explicit rollback trigger (metrics + thresholds).
Prefer backward-compatible changes; avoid “flag day” upgrades.
Use canaries and staged rollout; stop early when signals degrade.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

RFC 5869: HKDF (1) — Useful when discussing hybrid binding and context separation.
- Evidence: HKDF is the workhorse for domain separation; bind purpose/context to avoid cross-protocol key reuse.
Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

Where would a downgrade be visible today, and how would you detect it?
What is the worst-case handshake cost under attack?
How do you rotate algorithms without introducing configuration chaos?
Which clients will fail first, and what is the safe fallback behavior?

Checklist

Telemetry captures correctness signals.
Safety properties stated as invariants.
Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Assumptions listed and reviewed.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading