Monthly research note. Theme: Post-Quantum Cryptography & Migration.
TL;DR
PQC for IoT: Memory, CPU, and Timing Side Channels as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
Correctness is cheaper to enforce at interfaces than to repair in production data.
Key takeaways
- PQC changes handshake costs; plan DoS defenses and budgets.
- Migration is mixed-version for years: compatibility and rollback are security features.
- Interop is the migration plan—test matrices are more important than whitepapers.
- Write assumptions down; treat them as interfaces.
- Prefer protocols and APIs that make invalid states hard to express.
Why this matters
- Operationalization (monitoring, rollback) determines success more than crypto choice.
- PQC changes bandwidth and CPU costs; DoS surfaces move.
- Constant-time constraints are harder under large primitives.
- Interop is the real risk: multiple stacks, vendors, and versions.
Key questions
- Which parts must be constant-time, and how will you validate that?
- What are the new DoS surfaces (bigger keys, more CPU, more bandwidth)?
- Which secrets require long-term confidentiality (HNDL) and where are they today?
- How do you rotate algorithms safely (crypto agility without chaos)?
- What does interoperability testing look like across vendors and stacks?
- How do you handle failures: decryption failures, invalid ciphertexts, malformed keys?
Assumptions
- Active attacker can force retries, downgrades, and expensive handshakes.
- Deployments are mixed; old clients must interoperate or fail safely.
- Bandwidth is limited in some environments; larger handshakes matter.
- Side channels exist: timing and cache behavior leak information.
Non-goals
- Ignoring DoS implications of large primitives.
- Relying on silent fallback to weaker modes during interop failures.
Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.
Model & invariants
Hybrid composition should be transcript-bound:
Binding is the whole game: make the transcript an input to the KDF.
Make costs explicit: measure CPU and bandwidth, then add protections.
Monotonicity beats timestamps: counters and epochs survive clock skew.
Security properties
- Least authority: privileges are scoped by purpose and time.
- Integrity: invalid transitions are rejected (and detectable).
- Authenticity: actions are bound to identity and purpose.
- Downgrade resistance: negotiation can’t silently weaken security posture.
Failure modes
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Observability gaps during incidents (missing evidence).
- Mixed-version behavior that violates assumptions silently.
- Config drift that weakens security posture over time.
A recovery plan that isn’t exercised will fail when you need it.
Design sketch
flowchart TD
negotiate["Negotiate Algorithms"] --> bind["Bind Transcript"]
bind --> kdf["KDF (hybrid)"]
kdf --> keys["Traffic Keys"]
keys --> monitor["Monitor + Rollback"]Implementation notes
Interop tests are the migration plan; everything else is a hypothesis.
Make rollbacks boring: if rollback is a hero move, it will fail.
// Hybrid binding sketch (pseudocode):
// ss = HKDF(ss_classical || ss_pqc, info=transcript_hash)
// Then derive traffic keys from ss.Verification strategy
- Side-channel tests where tooling exists; constant-time audits.
- Downgrade tests: active attacker manipulates negotiation.
- Chaos deploys: mixed versions + rollback during partial outages.
- DoS tests: measure CPU/bandwidth amplification and mitigation impact.
- Interop matrices across vendors/versions and failure modes.
Operational notes
- Add telemetry for negotiation outcomes, failures, and client cohorts.
- Roll out with canaries and explicit rollback triggers.
- Inventory long-lived secrets and migrate the highest-risk first.
- Document supported algorithm sets and deprecation timelines.
- Cap handshake cost per peer/IP; use stateless cookies when needed.
Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.
What to monitor
- Authz failures and policy denials (unexpected spikes).
- Invariant violation rate (should be ~0).
- Retry/timeout rates by endpoint and client cohort.
- Rollback events and the conditions that triggered them.
- Error budget burn + tail latency under load.
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Use canaries and staged rollout; stop early when signals degrade.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Keep dual-write / dual-verify windows where appropriate.
- Define an explicit rollback trigger (metrics + thresholds).
Evidence
- Learn TLA+ (1) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
- Jepsen (2) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Open questions
- How do you rotate algorithms without introducing configuration chaos?
- Which clients will fail first, and what is the safe fallback behavior?
- Where would a downgrade be visible today, and how would you detect it?
- What is the worst-case handshake cost under attack?
Checklist
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Telemetry captures correctness signals.
- Rollback plan rehearsed and automated.
- Safety properties stated as invariants.
- Failure modes enumerated with mitigations.
- Assumptions listed and reviewed.
Further reading
- NIST Post-Quantum Cryptography Project — Standardization process and algorithm selections.
- CRYSTALS-Kyber — KEM design and parameters commonly referenced in deployments.
- CRYSTALS-Dilithium — Signature scheme design and deployment constraints.
- RFC 5869: HKDF — Useful when discussing hybrid binding and context separation.
- Jepsen — Fault injection and correctness testing for distributed systems.
- Learn TLA+ — Practical entry point for specification and model checking.