Multi-Tenant Isolation: Crypto Boundaries vs Kernel Boundaries

Monthly research note. Theme: Cryptographic Infrastructure.

TL;DR

A focused memo on Multi-Tenant Isolation: Crypto Boundaries vs Kernel Boundaries: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Treat key IDs as capabilities; never pass raw private key material across boundaries.
Audit logs are evidence: make them tamper-evident and queryable during incidents.
Rotation and rollback are core features—design them before you ship.
Design rollbacks as part of the happy path.
Bind security decisions to evidence (audit, invariants, telemetry).

Why this matters

Managed services shift responsibilities; they don’t remove them.
Key management failures are systemic: the breach is “a workflow,” not a bug.
Operational reality (rotation, audit, rollback) is where crypto systems fail.
Most organizations don’t know where their keys live—until an incident.

Key questions

What is the root of trust (HSM, TPM, offline CA, threshold ceremony)?
Which operations must be constant-time and how do you validate that?
What is your disaster recovery story for KMS/HSM outages?
How do keys rotate safely (overlap windows, dual-sign, staged rollout)?
How do you prove usage (who signed what, when, and why) without leaking secrets?
What is the rollback plan when a new algorithm breaks production?

Assumptions

Some environments are hostile (CI, ephemeral runners, shared build agents).
Attackers can observe timing and resource usage in shared environments.
Secrets leak through logs, metrics, crash dumps, and backups unless prevented.
Key usage is high-volume; audit pipelines must scale without sampling away truth.

Non-goals

Passing raw private keys across process boundaries.
Designing audit trails that expose sensitive plaintext or identifiers.

Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

A practical safety statement for key usage is least authority:

\text{capability}(\text{key},\ \text{purpose}) \Rightarrow \neg \text{use}(\text{key},\ \text{other purpose}).

Audit logs are evidence. Make them tamper-evident and operationally accessible.

Treat key identifiers as capabilities with purpose constraints—enforce in code and policy.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

Integrity: invalid transitions are rejected (and detectable).
Authenticity: actions are bound to identity and purpose.
Downgrade resistance: negotiation can’t silently weaken security posture.
Replay resistance: duplicated inputs do not change outcomes.

Failure modes

Mixed-version behavior that violates assumptions silently.
Config drift that weakens security posture over time.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Timeout ambiguity causing double-apply or partial state transitions.

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart TD
  gen["KeyGen (HSM/KMS)"] --> use["Use (TLS/VPN/Signing)"]
  use --> rot["Rotate (policy + automation)"]
  rot --> revoke["Revoke (incident)"]
  revoke --> audit["Audit/Forensics"]
  audit --> gen

Implementation notes

Make policy explicit and enforce it in the narrowest component possible.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

// Capability-style API: callers get a handle scoped to purpose + TTL.
type KeyPurpose string
type KeyHandle struct {
  ID string
  Purpose KeyPurpose
  ExpiresAtUnix int64
}

type Signer interface {
  Sign(h KeyHandle, msg []byte) (sig []byte, err error)
}

Verification strategy

Chaos for KMS: inject throttling, partial outages, and latency spikes.
Constant-time validation: microbenchmarks + side-channel tooling where feasible.
Rotation drills: staged rollout, dual-sign windows, and rollback.
Config drift detection: policy-as-code with diffs treated as security events.
Misuse resistance tests: wrong purpose, wrong context, wrong key type must fail.

Operational notes

Make audit streams append-only and queryable during incidents.
Inventory keys and usage paths; treat unknown usage as an incident.
Alert on policy drift: cipher suites, key sizes, algorithm toggles, TTL changes.
Separate duties and restrict production key access paths.
Test backup/restore for crypto material with the same rigor as databases.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Authz failures and policy denials (unexpected spikes).
Rollback events and the conditions that triggered them.
Invariant violation rate (should be ~0).
Error budget burn + tail latency under load.
Admission-control / rate-limit rejections (by reason).

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Keep dual-write / dual-verify windows where appropriate.
Define an explicit rollback trigger (metrics + thresholds).
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

What is your plan for emergency revocation at global scale?
What would a KMS compromise look like in your telemetry?
Which secrets must remain confidential for 10+ years and where are they stored today?
How do you guarantee that audit does not become a data exfiltration channel?

Checklist

Failure modes enumerated with mitigations.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Safety properties stated as invariants.
Assumptions listed and reviewed.
Rollback plan rehearsed and automated.
Telemetry captures correctness signals.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading