BFT with PQ Primitives: When Crypto Costs Dominate

Monthly research note. Theme: Quantum-Resilient Systems Engineering.

TL;DR

A focused memo on BFT with PQ Primitives: When Crypto Costs Dominate: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Define success metrics beyond “enabled”: cohorts, failures, and evidence.
Hybrid is an operational mode: deploy, monitor, rollback—not a paper design.
Downgrade resistance must be explicit and tested under active attackers.
Write assumptions down; treat them as interfaces.
Design rollbacks as part of the happy path.

Why this matters

Cost changes drive new DoS surfaces; defenses must evolve.
Quantum risk is uneven: some secrets must last decades, others do not.
Long-lived devices and PKI lifecycles are the hard constraint.
Migration risk is operational: inventory, rollout, rollback, and monitoring.

Key questions

How do you validate resilience (DoS, side channels, rollback, compromise)?
How do you manage mixed deployments across regions and vendors?
What does rotation look like at fleet scale (devices, certs, tunnels, identities)?
How do you define success metrics for PQ readiness beyond “enabled”?
What secrets must remain confidential for 10–30 years (and where are they today)?
Which protocols need hybrid now, and which can wait without regret?

Assumptions

Some environments require constrained implementations (no_std, embedded).
Operational teams need safe playbooks; crypto changes are not one-off.
Key and certificate lifecycles outlive application versions.
Rollouts happen under partial adoption; compatibility matters.

Non-goals

Switching algorithms without inventorying where secrets are used.
Assuming performance impacts will be negligible.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Hybrid composition should be explicit and transcript-bound:

\mathrm{ss} = \mathrm{HKDF}(\mathrm{ss}_\text{classical}\ \Vert\ \mathrm{ss}_\text{pqc},\ \text{info}=\mathrm{transcript}).

Inventory first. You can’t migrate what you can’t locate.

Make downgrade resistance explicit and test it like a security feature.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

Integrity: invalid transitions are rejected (and detectable).
Replay resistance: duplicated inputs do not change outcomes.
Downgrade resistance: negotiation can’t silently weaken security posture.
Authenticity: actions are bound to identity and purpose.

Failure modes

Mixed-version behavior that violates assumptions silently.
Config drift that weakens security posture over time.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Observability gaps during incidents (missing evidence).

Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart TD
  inventory["Inventory"] --> prioritize["Prioritize"]
  prioritize --> hybrid["Hybrid Deploy"]
  hybrid --> monitor["Monitor"]
  monitor --> cutover["Cutover"]
  cutover --> deprecate["Deprecate Old"]

Implementation notes

Operationalize early: rollback and monitoring are part of the design.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

// PQ migration note: "enabled" is not "safe" unless binding and downgrade resistance are explicit.

Verification strategy

Downgrade simulations with active attackers.
Side-channel audits for constrained implementations.
Interop tests across stacks and versions.
Rotation drills: certificates, tunnels, device identities.
Performance profiling under load to quantify DoS risk.

Operational notes

Maintain an inventory of long-lived secrets and their lifetimes.
Practice emergency deprecation (turn off broken algorithms quickly).
Add telemetry for algorithm negotiation and failure modes.
Define compatibility windows and communicate them to stakeholders.
Roll out hybrid with canaries and explicit rollback triggers.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Authz failures and policy denials (unexpected spikes).
Admission-control / rate-limit rejections (by reason).
Invariant violation rate (should be ~0).
Rollback events and the conditions that triggered them.
Retry/timeout rates by endpoint and client cohort.

Rollback plan

Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Keep dual-write / dual-verify windows where appropriate.
Use canaries and staged rollout; stop early when signals degrade.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Define an explicit rollback trigger (metrics + thresholds).

Evidence

Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Let's Encrypt Incident Reports (2) — Operational lessons relevant to rotation and recovery at scale.
- Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.

Open questions

How do you prevent configuration drift from re-enabling weak modes?
What is your plan for third-party dependencies that can’t migrate quickly?
What is your minimal ‘safe mode’ when PQ paths fail?
Which protocol surfaces are most exposed to HNDL risk in your environment?

Checklist

Telemetry captures correctness signals.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Safety properties stated as invariants.
Assumptions listed and reviewed.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading