Monthly research note. Theme: Cryptographic Infrastructure.

TL;DR

Incident Response for Crypto Systems: Key Compromise Playbooks as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

  • Audit logs are evidence: make them tamper-evident and queryable during incidents.
  • Rotation and rollback are core features—design them before you ship.
  • Bind purpose and context (domain separation) so keys can’t be misused accidentally.
  • Automate guardrails; humans are for judgment, not for consistent enforcement.
  • Define safety properties before performance goals.

Why this matters

  • Cryptographic agility is useless if rollout and rollback are unsafe.
  • Side channels turn performance details into security boundaries.
  • Key management failures are systemic: the breach is “a workflow,” not a bug.
  • Most organizations don’t know where their keys live—until an incident.

Key questions

  • How do you separate duties (operators vs developers vs security responders)?
  • How do you prove usage (who signed what, when, and why) without leaking secrets?
  • What is the root of trust (HSM, TPM, offline CA, threshold ceremony)?
  • What is the blast radius of compromise (tenant, service, region, environment)?
  • What is your disaster recovery story for KMS/HSM outages?
  • How do you handle key erasure and “right to be forgotten” constraints?

Assumptions

  • Secrets leak through logs, metrics, crash dumps, and backups unless prevented.
  • Certificate chains and policies evolve; clients won’t all update together.
  • Some environments are hostile (CI, ephemeral runners, shared build agents).
  • Key usage is high-volume; audit pipelines must scale without sampling away truth.

Non-goals

  • Passing raw private keys across process boundaries.
  • Assuming “HSM = secure” without defining the threat model.
Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

Audit integrity is a cryptographic property:

log_entrySignkaudit(hash(event)  metadata).\mathrm{log\_entry} \leftarrow \mathrm{Sign}_{k_\text{audit}}(\mathrm{hash}(\text{event})\ \Vert\ \text{metadata}).

Audit logs are evidence. Make them tamper-evident and operationally accessible.

Treat key identifiers as capabilities with purpose constraints—enforce in code and policy.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

  • Integrity: invalid transitions are rejected (and detectable).
  • Authenticity: actions are bound to identity and purpose.
  • Evidence: critical actions emit verifiable audit events.
  • Least authority: privileges are scoped by purpose and time.

Failure modes

  • Recovery paths that only work when nothing is broken.
  • Config drift that weakens security posture over time.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart TD
  gen["KeyGen (HSM/KMS)"] --> use["Use (TLS/VPN/Signing)"]
  use --> rot["Rotate (policy + automation)"]
  rot --> revoke["Revoke (incident)"]
  revoke --> audit["Audit/Forensics"]
  audit --> gen

Implementation notes

Never pass secrets around; pass handles with purpose constraints.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

#[derive(Clone, Copy, Debug)]
pub enum Purpose { Tls, Jwt, Firmware, Ledger }

pub struct KeyHandle { id: String, purpose: Purpose }

// Enforce purpose and algorithm policy at the boundary, not in the caller.

Verification strategy

  • Constant-time validation: microbenchmarks + side-channel tooling where feasible.
  • Forensics tests: can you reconstruct “who signed what” under load?
  • Config drift detection: policy-as-code with diffs treated as security events.
  • Misuse resistance tests: wrong purpose, wrong context, wrong key type must fail.
  • Chaos for KMS: inject throttling, partial outages, and latency spikes.

Operational notes

  • Alert on policy drift: cipher suites, key sizes, algorithm toggles, TTL changes.
  • Separate duties and restrict production key access paths.
  • Automate rotation with safety rails (canary, dual-sign, fast rollback).
  • Inventory keys and usage paths; treat unknown usage as an incident.
  • Test backup/restore for crypto material with the same rigor as databases.
Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

  • Admission-control / rate-limit rejections (by reason).
  • Retry/timeout rates by endpoint and client cohort.
  • Invariant violation rate (should be ~0).
  • Error budget burn + tail latency under load.
  • Authz failures and policy denials (unexpected spikes).

Rollback plan

  • Keep dual-write / dual-verify windows where appropriate.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Use canaries and staged rollout; stop early when signals degrade.
  • Define an explicit rollback trigger (metrics + thresholds).

Evidence

  • Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
    • Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
  • RFC 8446: TLS 1.3 (2) — Modern handshake design, key schedule, and downgrade resistance patterns.
    • Evidence: Handshake transcript binding and downgrade resistance patterns; monitor negotiation paths and failure reasons.

Open questions

  • What would a KMS compromise look like in your telemetry?
  • Which secrets must remain confidential for 10+ years and where are they stored today?
  • How do you guarantee that audit does not become a data exfiltration channel?
  • What is your plan for emergency revocation at global scale?

Checklist

  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Telemetry captures correctness signals.
  • Failure modes enumerated with mitigations.
  • Rollback plan rehearsed and automated.
  • Assumptions listed and reviewed.
  • Safety properties stated as invariants.

Further reading

1.
Beyer B, Jones C, Petoff J, Murphy NR. Site Reliability Engineering: How Google Runs Production Systems [Internet]. O’Reilly Media; 2016. Available from: https://sre.google/sre-book/table-of-contents/
2.
Rescorla E. The Transport Layer Security (TLS) Protocol Version 1.3 [Internet]. RFC Editor; 2018. Report No.: 8446. Available from: https://www.rfc-editor.org/rfc/rfc8446