Hardware Roots of Trust: TPM, Secure Boot, and Attestation

Monthly research note. Theme: Cryptographic Infrastructure.

TL;DR

Hardware Roots of Trust: TPM, Secure Boot, and Attestation as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

Rotation and rollback are core features—design them before you ship.
Treat key IDs as capabilities; never pass raw private key material across boundaries.
Side-channel constraints turn performance details into security boundaries.
Measure correctness signals, not only latency/throughput.
Define safety properties before performance goals.

Why this matters

Managed services shift responsibilities; they don’t remove them.
Most organizations don’t know where their keys live—until an incident.
Cryptographic agility is useless if rollout and rollback are unsafe.
Auditability must not become a secret-leaking logging pipeline.

Key questions

What is your disaster recovery story for KMS/HSM outages?
What is the blast radius of compromise (tenant, service, region, environment)?
What is the rollback plan when a new algorithm breaks production?
How do keys rotate safely (overlap windows, dual-sign, staged rollout)?
How do you prove usage (who signed what, when, and why) without leaking secrets?
How do you handle key erasure and “right to be forgotten” constraints?

Assumptions

Rotation must occur under incident pressure; automation must be safe.
Secrets leak through logs, metrics, crash dumps, and backups unless prevented.
Key usage is high-volume; audit pipelines must scale without sampling away truth.
Certificate chains and policies evolve; clients won’t all update together.

Non-goals

Passing raw private keys across process boundaries.
Designing audit trails that expose sensitive plaintext or identifiers.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

Key derivation is where protocols quietly succeed or fail. A sane default is domain-separated HKDF:

k \leftarrow \mathrm{HKDF}(\text{salt},\ \text{ikm},\ \text{info}=\text{context}).

Audit logs are evidence. Make them tamper-evident and operationally accessible.

Bind every derived key to context: protocol, role, version, and transcript.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Integrity: invalid transitions are rejected (and detectable).
Authenticity: actions are bound to identity and purpose.
Least authority: privileges are scoped by purpose and time.
Replay resistance: duplicated inputs do not change outcomes.

Failure modes

Recovery paths that only work when nothing is broken.
Timeout ambiguity causing double-apply or partial state transitions.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Config drift that weakens security posture over time.

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart TD
  gen["KeyGen (HSM/KMS)"] --> use["Use (TLS/VPN/Signing)"]
  use --> rot["Rotate (policy + automation)"]
  rot --> revoke["Revoke (incident)"]
  revoke --> audit["Audit/Forensics"]
  audit --> gen

Implementation notes

Crypto infra is a product: UX, policy, audit, and rollback must compose.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

// Capability-style API: callers get a handle scoped to purpose + TTL.
type KeyPurpose string
type KeyHandle struct {
  ID string
  Purpose KeyPurpose
  ExpiresAtUnix int64
}

type Signer interface {
  Sign(h KeyHandle, msg []byte) (sig []byte, err error)
}

Verification strategy

Rotation drills: staged rollout, dual-sign windows, and rollback.
Constant-time validation: microbenchmarks + side-channel tooling where feasible.
Chaos for KMS: inject throttling, partial outages, and latency spikes.
Misuse resistance tests: wrong purpose, wrong context, wrong key type must fail.
Forensics tests: can you reconstruct “who signed what” under load?

Operational notes

Test backup/restore for crypto material with the same rigor as databases.
Alert on policy drift: cipher suites, key sizes, algorithm toggles, TTL changes.
Separate duties and restrict production key access paths.
Automate rotation with safety rails (canary, dual-sign, fast rollback).
Inventory keys and usage paths; treat unknown usage as an incident.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Rollback events and the conditions that triggered them.
Error budget burn + tail latency under load.
Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Authz failures and policy denials (unexpected spikes).

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Define an explicit rollback trigger (metrics + thresholds).
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

RFC 5869: HKDF (1) — Domain separation and key derivation done sanely.
- Evidence: HKDF is the workhorse for domain separation; bind purpose/context to avoid cross-protocol key reuse.
Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

What is your plan for emergency revocation at global scale?
What would a KMS compromise look like in your telemetry?
Which secrets must remain confidential for 10+ years and where are they stored today?
How do you guarantee that audit does not become a data exfiltration channel?

Checklist

Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Assumptions listed and reviewed.
Failure modes enumerated with mitigations.
Telemetry captures correctness signals.
Safety properties stated as invariants.
Rollback plan rehearsed and automated.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading