Migration Risk Management: Inventory, Prioritization, and Cutover

Monthly research note. Theme: Post-Quantum Cryptography & Migration.

TL;DR

Migration Risk Management: Inventory, Prioritization, and Cutover as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

Hybrid composition must be explicit and transcript-bound to resist downgrade.
PQC changes handshake costs; plan DoS defenses and budgets.
Migration is mixed-version for years: compatibility and rollback are security features.
Measure correctness signals, not only latency/throughput.
Define safety properties before performance goals.

Why this matters

Constant-time constraints are harder under large primitives.
Hybrid designs fail if binding is ambiguous (mix-and-match, downgrade).
Operationalization (monitoring, rollback) determines success more than crypto choice.
Migration will be mixed-version for years; plan for it explicitly.

Key questions

What does interoperability testing look like across vendors and stacks?
What are the new DoS surfaces (bigger keys, more CPU, more bandwidth)?
How do you handle failures: decryption failures, invalid ciphertexts, malformed keys?
How do you bind hybrid secrets to prevent downgrade and mix-and-match attacks?
Which parts must be constant-time, and how will you validate that?
How do you rotate algorithms safely (crypto agility without chaos)?

Assumptions

Side channels exist: timing and cache behavior leak information.
Bandwidth is limited in some environments; larger handshakes matter.
Active attacker can force retries, downgrades, and expensive handshakes.
Deployments are mixed; old clients must interoperate or fail safely.

Non-goals

Treating migration as a single flag flip.
Assuming PQC is “drop-in” without changing operational processes.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

A KEM gives you shared secrets without discrete-log assumptions:

(\mathrm{pk},\mathrm{sk})\leftarrow \mathrm{KeyGen}();\ (\mathrm{ct},\mathrm{ss})\leftarrow \mathrm{Enc}(\mathrm{pk});\ \mathrm{ss}\leftarrow \mathrm{Dec}(\mathrm{sk},\mathrm{ct}).

Binding is the whole game: make the transcript an input to the KDF.

Make costs explicit: measure CPU and bandwidth, then add protections.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

Evidence: critical actions emit verifiable audit events.
Integrity: invalid transitions are rejected (and detectable).
Least authority: privileges are scoped by purpose and time.
Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

Recovery paths that only work when nothing is broken.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Mixed-version behavior that violates assumptions silently.
Observability gaps during incidents (missing evidence).

Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

sequenceDiagram
  participant A as Initiator
  participant B as Responder
  A->>B: classical_keyshare + pqc_pk
  B-->>A: classical_keyshare + pqc_ct + sig
  A-->>B: sig
  Note over A,B: ss = HKDF(ss_classical || ss_pqc, transcript)

Implementation notes

Interop tests are the migration plan; everything else is a hypothesis.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

// Hybrid binding sketch (pseudocode):
// ss = HKDF(ss_classical || ss_pqc, info=transcript_hash)
// Then derive traffic keys from ss.

Verification strategy

DoS tests: measure CPU/bandwidth amplification and mitigation impact.
Side-channel tests where tooling exists; constant-time audits.
Interop matrices across vendors/versions and failure modes.
Downgrade tests: active attacker manipulates negotiation.
Chaos deploys: mixed versions + rollback during partial outages.

Operational notes

Document supported algorithm sets and deprecation timelines.
Roll out with canaries and explicit rollback triggers.
Add telemetry for negotiation outcomes, failures, and client cohorts.
Inventory long-lived secrets and migrate the highest-risk first.
Cap handshake cost per peer/IP; use stateless cookies when needed.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Retry/timeout rates by endpoint and client cohort.
Authz failures and policy denials (unexpected spikes).
Invariant violation rate (should be ~0).
Error budget burn + tail latency under load.
Rollback events and the conditions that triggered them.

Rollback plan

Keep dual-write / dual-verify windows where appropriate.
Use canaries and staged rollout; stop early when signals degrade.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Define an explicit rollback trigger (metrics + thresholds).

Evidence

NIST Post-Quantum Cryptography Project (1) — Standardization process and algorithm selections.
- Evidence: Treat PQ migration as a program (inventory, interop, rollback). Use NIST status to drive prioritization and timelines.
Jepsen (2) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

Which clients will fail first, and what is the safe fallback behavior?
What is the worst-case handshake cost under attack?
How do you rotate algorithms without introducing configuration chaos?
Where would a downgrade be visible today, and how would you detect it?

Checklist

Failure modes enumerated with mitigations.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Rollback plan rehearsed and automated.
Assumptions listed and reviewed.
Telemetry captures correctness signals.
Safety properties stated as invariants.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading