PKI as an Operating System: Certificates, Policies, and Expiration

Monthly research note. Theme: Cryptographic Infrastructure.

TL;DR

PKI as an Operating System: Certificates, Policies, and Expiration as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Treat key IDs as capabilities; never pass raw private key material across boundaries.
Side-channel constraints turn performance details into security boundaries.
Bind purpose and context (domain separation) so keys can’t be misused accidentally.
Design rollbacks as part of the happy path.
Bind security decisions to evidence (audit, invariants, telemetry).

Why this matters

Managed services shift responsibilities; they don’t remove them.
Most organizations don’t know where their keys live—until an incident.
Side channels turn performance details into security boundaries.
Policy drift silently turns strong crypto into weak practice.

Key questions

What is the blast radius of compromise (tenant, service, region, environment)?
What is your disaster recovery story for KMS/HSM outages?
What is the rollback plan when a new algorithm breaks production?
What is the root of trust (HSM, TPM, offline CA, threshold ceremony)?
How do you separate duties (operators vs developers vs security responders)?
How do you prove usage (who signed what, when, and why) without leaking secrets?

Assumptions

Secrets leak through logs, metrics, crash dumps, and backups unless prevented.
Some environments are hostile (CI, ephemeral runners, shared build agents).
Rotation must occur under incident pressure; automation must be safe.
Key usage is high-volume; audit pipelines must scale without sampling away truth.

Non-goals

Designing audit trails that expose sensitive plaintext or identifiers.
Assuming “HSM = secure” without defining the threat model.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Audit integrity is a cryptographic property:

\mathrm{log\_entry} \leftarrow \mathrm{Sign}_{k_\text{audit}}(\mathrm{hash}(\text{event})\ \Vert\ \text{metadata}).

Audit logs are evidence. Make them tamper-evident and operationally accessible.

Assume compromise and design for recovery: rotation, revocation, and forensics.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

Least authority: privileges are scoped by purpose and time.
Authenticity: actions are bound to identity and purpose.
Downgrade resistance: negotiation can’t silently weaken security posture.
Evidence: critical actions emit verifiable audit events.

Failure modes

Timeout ambiguity causing double-apply or partial state transitions.
Recovery paths that only work when nothing is broken.
Mixed-version behavior that violates assumptions silently.
Config drift that weakens security posture over time.

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart TD
  gen["KeyGen (HSM/KMS)"] --> use["Use (TLS/VPN/Signing)"]
  use --> rot["Rotate (policy + automation)"]
  rot --> revoke["Revoke (incident)"]
  revoke --> audit["Audit/Forensics"]
  audit --> gen

Implementation notes

Crypto infra is a product: UX, policy, audit, and rollback must compose.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

#[derive(Clone, Copy, Debug)]
pub enum Purpose { Tls, Jwt, Firmware, Ledger }

pub struct KeyHandle { id: String, purpose: Purpose }

// Enforce purpose and algorithm policy at the boundary, not in the caller.

Verification strategy

Config drift detection: policy-as-code with diffs treated as security events.
Constant-time validation: microbenchmarks + side-channel tooling where feasible.
Forensics tests: can you reconstruct “who signed what” under load?
Misuse resistance tests: wrong purpose, wrong context, wrong key type must fail.
Chaos for KMS: inject throttling, partial outages, and latency spikes.

Operational notes

Automate rotation with safety rails (canary, dual-sign, fast rollback).
Make audit streams append-only and queryable during incidents.
Alert on policy drift: cipher suites, key sizes, algorithm toggles, TTL changes.
Inventory keys and usage paths; treat unknown usage as an incident.
Separate duties and restrict production key access paths.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Admission-control / rate-limit rejections (by reason).
Authz failures and policy denials (unexpected spikes).
Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Error budget burn + tail latency under load.

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Keep dual-write / dual-verify windows where appropriate.
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Let's Encrypt Incident Reports (2) — Real-world PKI incidents and operational lessons.
- Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.

Open questions

Which secrets must remain confidential for 10+ years and where are they stored today?
How do you guarantee that audit does not become a data exfiltration channel?
What would a KMS compromise look like in your telemetry?
What is your plan for emergency revocation at global scale?

Checklist

Rollback plan rehearsed and automated.
Safety properties stated as invariants.
Failure modes enumerated with mitigations.
Telemetry captures correctness signals.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Assumptions listed and reviewed.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading