Cryptographic Agility: Designing for the Algorithm You Haven't Met Yet

Monthly research note. Theme: Cryptographic Infrastructure.

TL;DR

Cryptographic Agility: Designing for the Algorithm You Haven't Met Yet as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Audit logs are evidence: make them tamper-evident and queryable during incidents.
Treat key IDs as capabilities; never pass raw private key material across boundaries.
Side-channel constraints turn performance details into security boundaries.
Design rollbacks as part of the happy path.
Write assumptions down; treat them as interfaces.

Why this matters

Most organizations don’t know where their keys live—until an incident.
Policy drift silently turns strong crypto into weak practice.
Managed services shift responsibilities; they don’t remove them.
Cryptographic agility is useless if rollout and rollback are unsafe.

Key questions

What is your disaster recovery story for KMS/HSM outages?
What is the root of trust (HSM, TPM, offline CA, threshold ceremony)?
What is the rollback plan when a new algorithm breaks production?
How do you handle key erasure and “right to be forgotten” constraints?
How do you prove usage (who signed what, when, and why) without leaking secrets?
How do keys rotate safely (overlap windows, dual-sign, staged rollout)?

Assumptions

Attackers can observe timing and resource usage in shared environments.
Certificate chains and policies evolve; clients won’t all update together.
Key usage is high-volume; audit pipelines must scale without sampling away truth.
Some environments are hostile (CI, ephemeral runners, shared build agents).

Non-goals

Passing raw private keys across process boundaries.
Relying on manual rotation procedures for fleet-scale systems.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

Key derivation is where protocols quietly succeed or fail. A sane default is domain-separated HKDF:

k \leftarrow \mathrm{HKDF}(\text{salt},\ \text{ikm},\ \text{info}=\text{context}).

Bind every derived key to context: protocol, role, version, and transcript.

Audit logs are evidence. Make them tamper-evident and operationally accessible.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Replay resistance: duplicated inputs do not change outcomes.
Integrity: invalid transitions are rejected (and detectable).
Authenticity: actions are bound to identity and purpose.
Evidence: critical actions emit verifiable audit events.

Failure modes

Observability gaps during incidents (missing evidence).
Recovery paths that only work when nothing is broken.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Timeout ambiguity causing double-apply or partial state transitions.

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart LR
  policy["Policy (purpose + TTL)"] --> service["Signer Service"]
  service --> hsm["HSM/KMS"]
  service --> audit["Audit Stream"]
  audit --> siem["Detection/Response"]

Implementation notes

Never pass secrets around; pass handles with purpose constraints.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

// Capability-style API: callers get a handle scoped to purpose + TTL.
type KeyPurpose string
type KeyHandle struct {
  ID string
  Purpose KeyPurpose
  ExpiresAtUnix int64
}

type Signer interface {
  Sign(h KeyHandle, msg []byte) (sig []byte, err error)
}

Verification strategy

Constant-time validation: microbenchmarks + side-channel tooling where feasible.
Rotation drills: staged rollout, dual-sign windows, and rollback.
Misuse resistance tests: wrong purpose, wrong context, wrong key type must fail.
Config drift detection: policy-as-code with diffs treated as security events.
Forensics tests: can you reconstruct “who signed what” under load?

Operational notes

Separate duties and restrict production key access paths.
Make audit streams append-only and queryable during incidents.
Alert on policy drift: cipher suites, key sizes, algorithm toggles, TTL changes.
Automate rotation with safety rails (canary, dual-sign, fast rollback).
Test backup/restore for crypto material with the same rigor as databases.

Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

Authz failures and policy denials (unexpected spikes).
Rollback events and the conditions that triggered them.
Error budget burn + tail latency under load.
Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).

Rollback plan

Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.
Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

RFC 5869: HKDF (1) — Domain separation and key derivation done sanely.
- Evidence: HKDF is the workhorse for domain separation; bind purpose/context to avoid cross-protocol key reuse.
Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

What would a KMS compromise look like in your telemetry?
How do you guarantee that audit does not become a data exfiltration channel?
What is your plan for emergency revocation at global scale?
Which secrets must remain confidential for 10+ years and where are they stored today?

Checklist

Safety properties stated as invariants.
Failure modes enumerated with mitigations.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Rollback plan rehearsed and automated.
Telemetry captures correctness signals.
Assumptions listed and reviewed.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading