Secrets vs Capabilities: Token Design in Microservices

Monthly research note. Theme: Cryptographic Infrastructure.

TL;DR

A focused memo on Secrets vs Capabilities: Token Design in Microservices: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Rotation and rollback are core features—design them before you ship.
Audit logs are evidence: make them tamper-evident and queryable during incidents.
Side-channel constraints turn performance details into security boundaries.
Bind security decisions to evidence (audit, invariants, telemetry).
Automate guardrails; humans are for judgment, not for consistent enforcement.

Why this matters

Auditability must not become a secret-leaking logging pipeline.
Most organizations don’t know where their keys live—until an incident.
Operational reality (rotation, audit, rollback) is where crypto systems fail.
Side channels turn performance details into security boundaries.

Key questions

How do keys rotate safely (overlap windows, dual-sign, staged rollout)?
What is the rollback plan when a new algorithm breaks production?
How do you separate duties (operators vs developers vs security responders)?
What is the root of trust (HSM, TPM, offline CA, threshold ceremony)?
What is your disaster recovery story for KMS/HSM outages?
How do you handle key erasure and “right to be forgotten” constraints?

Assumptions

Certificate chains and policies evolve; clients won’t all update together.
Attackers can observe timing and resource usage in shared environments.
Secrets leak through logs, metrics, crash dumps, and backups unless prevented.
Key usage is high-volume; audit pipelines must scale without sampling away truth.

Non-goals

Designing audit trails that expose sensitive plaintext or identifiers.
Relying on manual rotation procedures for fleet-scale systems.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

A practical safety statement for key usage is least authority:

\text{capability}(\text{key},\ \text{purpose}) \Rightarrow \neg \text{use}(\text{key},\ \text{other purpose}).

Assume compromise and design for recovery: rotation, revocation, and forensics.

Bind every derived key to context: protocol, role, version, and transcript.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

Integrity: invalid transitions are rejected (and detectable).
Downgrade resistance: negotiation can’t silently weaken security posture.
Least authority: privileges are scoped by purpose and time.
Replay resistance: duplicated inputs do not change outcomes.

Failure modes

Recovery paths that only work when nothing is broken.
Mixed-version behavior that violates assumptions silently.
Timeout ambiguity causing double-apply or partial state transitions.
Observability gaps during incidents (missing evidence).

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart TD
  gen["KeyGen (HSM/KMS)"] --> use["Use (TLS/VPN/Signing)"]
  use --> rot["Rotate (policy + automation)"]
  rot --> revoke["Revoke (incident)"]
  revoke --> audit["Audit/Forensics"]
  audit --> gen

Implementation notes

Crypto infra is a product: UX, policy, audit, and rollback must compose.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

// Capability-style API: callers get a handle scoped to purpose + TTL.
type KeyPurpose string
type KeyHandle struct {
  ID string
  Purpose KeyPurpose
  ExpiresAtUnix int64
}

type Signer interface {
  Sign(h KeyHandle, msg []byte) (sig []byte, err error)
}

Verification strategy

Chaos for KMS: inject throttling, partial outages, and latency spikes.
Constant-time validation: microbenchmarks + side-channel tooling where feasible.
Misuse resistance tests: wrong purpose, wrong context, wrong key type must fail.
Rotation drills: staged rollout, dual-sign windows, and rollback.
Forensics tests: can you reconstruct “who signed what” under load?

Operational notes

Separate duties and restrict production key access paths.
Automate rotation with safety rails (canary, dual-sign, fast rollback).
Make audit streams append-only and queryable during incidents.
Inventory keys and usage paths; treat unknown usage as an incident.
Test backup/restore for crypto material with the same rigor as databases.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Error budget burn + tail latency under load.
Authz failures and policy denials (unexpected spikes).
Rollback events and the conditions that triggered them.

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Keep dual-write / dual-verify windows where appropriate.
Use canaries and staged rollout; stop early when signals degrade.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

Let's Encrypt Incident Reports (1) — Real-world PKI incidents and operational lessons.
- Evidence: Rotation and revocation are operational protocols; extract failure patterns into drills and automated rollbacks.
RFC 8446: TLS 1.3 (2) — Modern handshake design, key schedule, and downgrade resistance patterns.
- Evidence: Handshake transcript binding and downgrade resistance patterns; monitor negotiation paths and failure reasons.

Open questions

How do you guarantee that audit does not become a data exfiltration channel?
What is your plan for emergency revocation at global scale?
Which secrets must remain confidential for 10+ years and where are they stored today?
What would a KMS compromise look like in your telemetry?

Checklist

Failure modes enumerated with mitigations.
Assumptions listed and reviewed.
Rollback plan rehearsed and automated.
Telemetry captures correctness signals.
Safety properties stated as invariants.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading