Monthly research note. Theme: Post-Quantum Cryptography & Migration.

TL;DR

Side Channels in PQC Implementations: Where Theory Meets Cache as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Interop is the migration plan—test matrices are more important than whitepapers.
  • Hybrid composition must be explicit and transcript-bound to resist downgrade.
  • Migration is mixed-version for years: compatibility and rollback are security features.
  • Make failure modes explicit and observable.
  • Define safety properties before performance goals.

Why this matters

  • PQC changes bandwidth and CPU costs; DoS surfaces move.
  • Operationalization (monitoring, rollback) determines success more than crypto choice.
  • Interop is the real risk: multiple stacks, vendors, and versions.
  • Hybrid designs fail if binding is ambiguous (mix-and-match, downgrade).

Key questions

  • What are the new DoS surfaces (bigger keys, more CPU, more bandwidth)?
  • What does interoperability testing look like across vendors and stacks?
  • How do you handle failures: decryption failures, invalid ciphertexts, malformed keys?
  • Which secrets require long-term confidentiality (HNDL) and where are they today?
  • What telemetry proves PQC is working (not just enabled)?
  • Which parts must be constant-time, and how will you validate that?

Assumptions

  • Bandwidth is limited in some environments; larger handshakes matter.
  • Side channels exist: timing and cache behavior leak information.
  • Deployments are mixed; old clients must interoperate or fail safely.
  • Active attacker can force retries, downgrades, and expensive handshakes.

Non-goals

  • Relying on silent fallback to weaker modes during interop failures.
  • Ignoring DoS implications of large primitives.
Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

A KEM gives you shared secrets without discrete-log assumptions:

(pk,sk)KeyGen(); (ct,ss)Enc(pk); ssDec(sk,ct).(\mathrm{pk},\mathrm{sk})\leftarrow \mathrm{KeyGen}();\ (\mathrm{ct},\mathrm{ss})\leftarrow \mathrm{Enc}(\mathrm{pk});\ \mathrm{ss}\leftarrow \mathrm{Dec}(\mathrm{sk},\mathrm{ct}).

Treat algorithm negotiation as adversarial: explicit downgrade resistance.

Binding is the whole game: make the transcript an input to the KDF.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Least authority: privileges are scoped by purpose and time.
  • Integrity: invalid transitions are rejected (and detectable).
  • Authenticity: actions are bound to identity and purpose.
  • Evidence: critical actions emit verifiable audit events.

Failure modes

  • Config drift that weakens security posture over time.
  • Observability gaps during incidents (missing evidence).
  • Recovery paths that only work when nothing is broken.
  • Timeout ambiguity causing double-apply or partial state transitions.
Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  negotiate["Negotiate Algorithms"] --> bind["Bind Transcript"]
  bind --> kdf["KDF (hybrid)"]
  kdf --> keys["Traffic Keys"]
  keys --> monitor["Monitor + Rollback"]

Implementation notes

Explicit binding prevents downgrade and mix-and-match. Don’t leave it implicit.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

// Hybrid binding sketch (pseudocode):
// ss = HKDF(ss_classical || ss_pqc, info=transcript_hash)
// Then derive traffic keys from ss.

Verification strategy

  • DoS tests: measure CPU/bandwidth amplification and mitigation impact.
  • Side-channel tests where tooling exists; constant-time audits.
  • Downgrade tests: active attacker manipulates negotiation.
  • Chaos deploys: mixed versions + rollback during partial outages.
  • Interop matrices across vendors/versions and failure modes.

Operational notes

  • Cap handshake cost per peer/IP; use stateless cookies when needed.
  • Add telemetry for negotiation outcomes, failures, and client cohorts.
  • Roll out with canaries and explicit rollback triggers.
  • Document supported algorithm sets and deprecation timelines.
  • Inventory long-lived secrets and migrate the highest-risk first.
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Retry/timeout rates by endpoint and client cohort.
  • Authz failures and policy denials (unexpected spikes).
  • Invariant violation rate (should be ~0).
  • Admission-control / rate-limit rejections (by reason).
  • Rollback events and the conditions that triggered them.

Rollback plan

  • Use canaries and staged rollout; stop early when signals degrade.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Keep dual-write / dual-verify windows where appropriate.

Evidence

  • RFC 5869: HKDF (1) — Useful when discussing hybrid binding and context separation.
    • Evidence: HKDF is the workhorse for domain separation; bind purpose/context to avoid cross-protocol key reuse.
  • Jepsen (2) — Fault injection and correctness testing for distributed systems.
    • Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

  • How do you rotate algorithms without introducing configuration chaos?
  • Where would a downgrade be visible today, and how would you detect it?
  • Which clients will fail first, and what is the safe fallback behavior?
  • What is the worst-case handshake cost under attack?

Checklist

  • Safety properties stated as invariants.
  • Failure modes enumerated with mitigations.
  • Telemetry captures correctness signals.
  • Rollback plan rehearsed and automated.
  • Assumptions listed and reviewed.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

Further reading

1.
Krawczyk H, Eronen P. HMAC-based Extract-and-Expand Key Derivation Function (HKDF) [Internet]. RFC Editor; 2010. Report No.: 5869. Available from: https://www.rfc-editor.org/rfc/rfc5869
2.
Jepsen. Jepsen: Distributed Systems Safety Analysis [Internet]. Web; Available from: https://jepsen.io/