KMS/HSM Threat Models: When 'Managed' Doesn't Mean 'Safe'

Monthly research note. Theme: Cryptographic Infrastructure.

TL;DR

KMS/HSM Threat Models: When 'Managed' Doesn't Mean 'Safe' as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

Bind purpose and context (domain separation) so keys can’t be misused accidentally.
Audit logs are evidence: make them tamper-evident and queryable during incidents.
Side-channel constraints turn performance details into security boundaries.
Make boundaries boring: validate inputs, cap costs, and be deterministic where needed.
Automate guardrails; humans are for judgment, not for consistent enforcement.

Why this matters

Most organizations don’t know where their keys live—until an incident.
Auditability must not become a secret-leaking logging pipeline.
Managed services shift responsibilities; they don’t remove them.
Operational reality (rotation, audit, rollback) is where crypto systems fail.

Key questions

How do you separate duties (operators vs developers vs security responders)?
How do you handle key erasure and “right to be forgotten” constraints?
Which operations must be constant-time and how do you validate that?
What is your disaster recovery story for KMS/HSM outages?
How do you prove usage (who signed what, when, and why) without leaking secrets?
What is the blast radius of compromise (tenant, service, region, environment)?

Assumptions

Attackers can observe timing and resource usage in shared environments.
Some environments are hostile (CI, ephemeral runners, shared build agents).
Secrets leak through logs, metrics, crash dumps, and backups unless prevented.
Rotation must occur under incident pressure; automation must be safe.

Non-goals

Relying on manual rotation procedures for fleet-scale systems.
Passing raw private keys across process boundaries.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

Audit integrity is a cryptographic property:

\mathrm{log\_entry} \leftarrow \mathrm{Sign}_{k_\text{audit}}(\mathrm{hash}(\text{event})\ \Vert\ \text{metadata}).

Assume compromise and design for recovery: rotation, revocation, and forensics.

Treat key identifiers as capabilities with purpose constraints—enforce in code and policy.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Least authority: privileges are scoped by purpose and time.
Integrity: invalid transitions are rejected (and detectable).
Authenticity: actions are bound to identity and purpose.
Evidence: critical actions emit verifiable audit events.

Failure modes

Timeout ambiguity causing double-apply or partial state transitions.
Recovery paths that only work when nothing is broken.
Mixed-version behavior that violates assumptions silently.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  gen["KeyGen (HSM/KMS)"] --> use["Use (TLS/VPN/Signing)"]
  use --> rot["Rotate (policy + automation)"]
  rot --> revoke["Revoke (incident)"]
  revoke --> audit["Audit/Forensics"]
  audit --> gen

Implementation notes

Make policy explicit and enforce it in the narrowest component possible.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

#[derive(Clone, Copy, Debug)]
pub enum Purpose { Tls, Jwt, Firmware, Ledger }

pub struct KeyHandle { id: String, purpose: Purpose }

// Enforce purpose and algorithm policy at the boundary, not in the caller.

Verification strategy

Config drift detection: policy-as-code with diffs treated as security events.
Misuse resistance tests: wrong purpose, wrong context, wrong key type must fail.
Forensics tests: can you reconstruct “who signed what” under load?
Constant-time validation: microbenchmarks + side-channel tooling where feasible.
Chaos for KMS: inject throttling, partial outages, and latency spikes.

Operational notes

Alert on policy drift: cipher suites, key sizes, algorithm toggles, TTL changes.
Test backup/restore for crypto material with the same rigor as databases.
Inventory keys and usage paths; treat unknown usage as an incident.
Make audit streams append-only and queryable during incidents.
Automate rotation with safety rails (canary, dual-sign, fast rollback).

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Error budget burn + tail latency under load.
Invariant violation rate (should be ~0).
Authz failures and policy denials (unexpected spikes).
Admission-control / rate-limit rejections (by reason).
Rollback events and the conditions that triggered them.

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Define an explicit rollback trigger (metrics + thresholds).
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
RFC 5869: HKDF (2) — Domain separation and key derivation done sanely.
- Evidence: HKDF is the workhorse for domain separation; bind purpose/context to avoid cross-protocol key reuse.

Open questions

How do you guarantee that audit does not become a data exfiltration channel?
Which secrets must remain confidential for 10+ years and where are they stored today?
What would a KMS compromise look like in your telemetry?
What is your plan for emergency revocation at global scale?

Checklist

Rollback plan rehearsed and automated.
Failure modes enumerated with mitigations.
Safety properties stated as invariants.
Assumptions listed and reviewed.
Telemetry captures correctness signals.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading