Monthly research note. Theme: Post-Quantum Cryptography & Migration.

TL;DR

PQC for IoT: Memory, CPU, and Timing Side Channels as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • PQC changes handshake costs; plan DoS defenses and budgets.
  • Migration is mixed-version for years: compatibility and rollback are security features.
  • Interop is the migration plan—test matrices are more important than whitepapers.
  • Write assumptions down; treat them as interfaces.
  • Prefer protocols and APIs that make invalid states hard to express.

Why this matters

  • Operationalization (monitoring, rollback) determines success more than crypto choice.
  • PQC changes bandwidth and CPU costs; DoS surfaces move.
  • Constant-time constraints are harder under large primitives.
  • Interop is the real risk: multiple stacks, vendors, and versions.

Key questions

  • Which parts must be constant-time, and how will you validate that?
  • What are the new DoS surfaces (bigger keys, more CPU, more bandwidth)?
  • Which secrets require long-term confidentiality (HNDL) and where are they today?
  • How do you rotate algorithms safely (crypto agility without chaos)?
  • What does interoperability testing look like across vendors and stacks?
  • How do you handle failures: decryption failures, invalid ciphertexts, malformed keys?

Assumptions

  • Active attacker can force retries, downgrades, and expensive handshakes.
  • Deployments are mixed; old clients must interoperate or fail safely.
  • Bandwidth is limited in some environments; larger handshakes matter.
  • Side channels exist: timing and cache behavior leak information.

Non-goals

  • Ignoring DoS implications of large primitives.
  • Relying on silent fallback to weaker modes during interop failures.
Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

Hybrid composition should be transcript-bound:

ss=HKDF(ssclassical  sspqc, info=transcript).\mathrm{ss} = \mathrm{HKDF}(\mathrm{ss}_\text{classical}\ \Vert\ \mathrm{ss}_\text{pqc},\ \text{info}=\mathrm{transcript}).

Binding is the whole game: make the transcript an input to the KDF.

Make costs explicit: measure CPU and bandwidth, then add protections.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

  • Least authority: privileges are scoped by purpose and time.
  • Integrity: invalid transitions are rejected (and detectable).
  • Authenticity: actions are bound to identity and purpose.
  • Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Observability gaps during incidents (missing evidence).
  • Mixed-version behavior that violates assumptions silently.
  • Config drift that weakens security posture over time.
Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart TD
  negotiate["Negotiate Algorithms"] --> bind["Bind Transcript"]
  bind --> kdf["KDF (hybrid)"]
  kdf --> keys["Traffic Keys"]
  keys --> monitor["Monitor + Rollback"]

Implementation notes

Interop tests are the migration plan; everything else is a hypothesis.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

// Hybrid binding sketch (pseudocode):
// ss = HKDF(ss_classical || ss_pqc, info=transcript_hash)
// Then derive traffic keys from ss.

Verification strategy

  • Side-channel tests where tooling exists; constant-time audits.
  • Downgrade tests: active attacker manipulates negotiation.
  • Chaos deploys: mixed versions + rollback during partial outages.
  • DoS tests: measure CPU/bandwidth amplification and mitigation impact.
  • Interop matrices across vendors/versions and failure modes.

Operational notes

  • Add telemetry for negotiation outcomes, failures, and client cohorts.
  • Roll out with canaries and explicit rollback triggers.
  • Inventory long-lived secrets and migrate the highest-risk first.
  • Document supported algorithm sets and deprecation timelines.
  • Cap handshake cost per peer/IP; use stateless cookies when needed.
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Authz failures and policy denials (unexpected spikes).
  • Invariant violation rate (should be ~0).
  • Retry/timeout rates by endpoint and client cohort.
  • Rollback events and the conditions that triggered them.
  • Error budget burn + tail latency under load.

Rollback plan

  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Keep dual-write / dual-verify windows where appropriate.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • Learn TLA+ (1) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
  • Jepsen (2) — Fault injection and correctness testing for distributed systems.
    • Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

  • How do you rotate algorithms without introducing configuration chaos?
  • Which clients will fail first, and what is the safe fallback behavior?
  • Where would a downgrade be visible today, and how would you detect it?
  • What is the worst-case handshake cost under attack?

Checklist

  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Telemetry captures correctness signals.
  • Rollback plan rehearsed and automated.
  • Safety properties stated as invariants.
  • Failure modes enumerated with mitigations.
  • Assumptions listed and reviewed.

Further reading

1.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/
2.
Jepsen. Jepsen: Distributed Systems Safety Analysis [Internet]. Web; Available from: https://jepsen.io/