Side Channels: Constant-Time, Cache Attacks, and Real Threat Models

Monthly research note. Theme: Cryptographic Infrastructure.

TL;DR

A focused memo on Side Channels: Constant-Time, Cache Attacks, and Real Threat Models: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Audit logs are evidence: make them tamper-evident and queryable during incidents.
Treat key IDs as capabilities; never pass raw private key material across boundaries.
Rotation and rollback are core features—design them before you ship.
Write assumptions down; treat them as interfaces.
Define safety properties before performance goals.

Why this matters

Key management failures are systemic: the breach is “a workflow,” not a bug.
Policy drift silently turns strong crypto into weak practice.
Most organizations don’t know where their keys live—until an incident.
Side channels turn performance details into security boundaries.

Key questions

What is the rollback plan when a new algorithm breaks production?
What is the blast radius of compromise (tenant, service, region, environment)?
How do keys rotate safely (overlap windows, dual-sign, staged rollout)?
What is your disaster recovery story for KMS/HSM outages?
How do you separate duties (operators vs developers vs security responders)?
What is the root of trust (HSM, TPM, offline CA, threshold ceremony)?

Assumptions

Key usage is high-volume; audit pipelines must scale without sampling away truth.
Secrets leak through logs, metrics, crash dumps, and backups unless prevented.
Attackers can observe timing and resource usage in shared environments.
Some environments are hostile (CI, ephemeral runners, shared build agents).

Non-goals

Designing audit trails that expose sensitive plaintext or identifiers.
Relying on manual rotation procedures for fleet-scale systems.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

Key derivation is where protocols quietly succeed or fail. A sane default is domain-separated HKDF:

k \leftarrow \mathrm{HKDF}(\text{salt},\ \text{ikm},\ \text{info}=\text{context}).

Assume compromise and design for recovery: rotation, revocation, and forensics.

Bind every derived key to context: protocol, role, version, and transcript.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Replay resistance: duplicated inputs do not change outcomes.
Evidence: critical actions emit verifiable audit events.
Downgrade resistance: negotiation can’t silently weaken security posture.
Integrity: invalid transitions are rejected (and detectable).

Failure modes

Timeout ambiguity causing double-apply or partial state transitions.
Observability gaps during incidents (missing evidence).
Config drift that weakens security posture over time.
Mixed-version behavior that violates assumptions silently.

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart LR
  policy["Policy (purpose + TTL)"] --> service["Signer Service"]
  service --> hsm["HSM/KMS"]
  service --> audit["Audit Stream"]
  audit --> siem["Detection/Response"]

Implementation notes

Crypto infra is a product: UX, policy, audit, and rollback must compose.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

#[derive(Clone, Copy, Debug)]
pub enum Purpose { Tls, Jwt, Firmware, Ledger }

pub struct KeyHandle { id: String, purpose: Purpose }

// Enforce purpose and algorithm policy at the boundary, not in the caller.

Verification strategy

Forensics tests: can you reconstruct “who signed what” under load?
Constant-time validation: microbenchmarks + side-channel tooling where feasible.
Misuse resistance tests: wrong purpose, wrong context, wrong key type must fail.
Chaos for KMS: inject throttling, partial outages, and latency spikes.
Rotation drills: staged rollout, dual-sign windows, and rollback.

Operational notes

Automate rotation with safety rails (canary, dual-sign, fast rollback).
Separate duties and restrict production key access paths.
Alert on policy drift: cipher suites, key sizes, algorithm toggles, TTL changes.
Make audit streams append-only and queryable during incidents.
Test backup/restore for crypto material with the same rigor as databases.

Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

Authz failures and policy denials (unexpected spikes).
Invariant violation rate (should be ~0).
Admission-control / rate-limit rejections (by reason).
Retry/timeout rates by endpoint and client cohort.
Rollback events and the conditions that triggered them.

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Use canaries and staged rollout; stop early when signals degrade.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Keep dual-write / dual-verify windows where appropriate.
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.

Open questions

What is your plan for emergency revocation at global scale?
Which secrets must remain confidential for 10+ years and where are they stored today?
How do you guarantee that audit does not become a data exfiltration channel?
What would a KMS compromise look like in your telemetry?

Checklist

Safety properties stated as invariants.
Telemetry captures correctness signals.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Assumptions listed and reviewed.
Rollback plan rehearsed and automated.
Failure modes enumerated with mitigations.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading