Monthly research note. Theme: Quantum-Resilient Systems Engineering.
TL;DR
A focused memo on Quantum-Resilient Identity: Device + Human, Online + Offline: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Downgrade resistance must be explicit and tested under active attackers.
- Define success metrics beyond “enabled”: cohorts, failures, and evidence.
- Measure cost shifts (CPU/bandwidth) and adapt DoS defenses accordingly.
- Prefer protocols and APIs that make invalid states hard to express.
- Automate guardrails; humans are for judgment, not for consistent enforcement.
Why this matters
- Hybrid protocols fail if binding is unclear or downgrade is possible.
- Long-lived devices and PKI lifecycles are the hard constraint.
- Quantum risk is uneven: some secrets must last decades, others do not.
- Cost changes drive new DoS surfaces; defenses must evolve.
Key questions
- How do you manage mixed deployments across regions and vendors?
- Which protocols need hybrid now, and which can wait without regret?
- How do you validate resilience (DoS, side channels, rollback, compromise)?
- What secrets must remain confidential for 10–30 years (and where are they today)?
- How do you stop downgrade under active adversaries?
- How do you define success metrics for PQ readiness beyond “enabled”?
Assumptions
- Some environments require constrained implementations (no_std, embedded).
- Operational teams need safe playbooks; crypto changes are not one-off.
- Key and certificate lifecycles outlive application versions.
- Adversaries record traffic today (HNDL) and attack later.
Non-goals
- Switching algorithms without inventorying where secrets are used.
- Relying on ‘automatic’ negotiation without downgrade resistance.
Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.
Model & invariants
Hybrid composition should be explicit and transcript-bound:
Make downgrade resistance explicit and test it like a security feature.
Inventory first. You can’t migrate what you can’t locate.
Invariants must be checkable from evidence you actually have (state + logs + counters).
Security properties
- Least authority: privileges are scoped by purpose and time.
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Replay resistance: duplicated inputs do not change outcomes.
- Authenticity: actions are bound to identity and purpose.
Failure modes
- Config drift that weakens security posture over time.
- Mixed-version behavior that violates assumptions silently.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Recovery paths that only work when nothing is broken.
Caches tend to become sources of truth unless you can recompute and validate them.
Design sketch
flowchart TD
inventory["Inventory"] --> prioritize["Prioritize"]
prioritize --> hybrid["Hybrid Deploy"]
hybrid --> monitor["Monitor"]
monitor --> cutover["Cutover"]
cutover --> deprecate["Deprecate Old"]Implementation notes
PQ readiness is a systems program: crypto, networking, ops, and UX must compose.
Make rollbacks boring: if rollback is a hero move, it will fail.
// PQ migration note: "enabled" is not "safe" unless binding and downgrade resistance are explicit.Verification strategy
- Downgrade simulations with active attackers.
- Side-channel audits for constrained implementations.
- Interop tests across stacks and versions.
- Performance profiling under load to quantify DoS risk.
- Rotation drills: certificates, tunnels, device identities.
Operational notes
- Maintain an inventory of long-lived secrets and their lifetimes.
- Roll out hybrid with canaries and explicit rollback triggers.
- Add telemetry for algorithm negotiation and failure modes.
- Define compatibility windows and communicate them to stakeholders.
- Practice emergency deprecation (turn off broken algorithms quickly).
Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.
What to monitor
- Admission-control / rate-limit rejections (by reason).
- Authz failures and policy denials (unexpected spikes).
- Retry/timeout rates by endpoint and client cohort.
- Invariant violation rate (should be ~0).
- Error budget burn + tail latency under load.
Rollback plan
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Define an explicit rollback trigger (metrics + thresholds).
- Keep dual-write / dual-verify windows where appropriate.
- Use canaries and staged rollout; stop early when signals degrade.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
Evidence
- Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
- Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Open questions
- What is your minimal ‘safe mode’ when PQ paths fail?
- Which protocol surfaces are most exposed to HNDL risk in your environment?
- How do you prevent configuration drift from re-enabling weak modes?
- What is your plan for third-party dependencies that can’t migrate quickly?
Checklist
- Assumptions listed and reviewed.
- Telemetry captures correctness signals.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Failure modes enumerated with mitigations.
- Rollback plan rehearsed and automated.
- Safety properties stated as invariants.
Further reading
- NIST Post-Quantum Cryptography Project — The standardization baseline for PQC readiness programs.
- Let's Encrypt Incident Reports — Operational lessons relevant to rotation and recovery at scale.
- RFC 8446: TLS 1.3 — A useful reference for handshake structure and downgrade resistance patterns.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Jepsen — Fault injection and correctness testing for distributed systems.
- Learn TLA+ — Practical entry point for specification and model checking.