Secrets Hygiene: Rotation, Scoping, and Runtime Delivery

Monthly research note. Theme: DevSecOps & Resilience Engineering.

TL;DR

A focused memo on Secrets Hygiene: Rotation, Scoping, and Runtime Delivery: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Treat CI/CD as attacker-controlled until proven otherwise; minimize secrets and privileges.
Policy-as-code needs tests, rollout, and rollback like any other production system.
Make rollback a first-class operation with explicit triggers and rehearsal.
Design rollbacks as part of the happy path.
Write assumptions down; treat them as interfaces.

Why this matters

Policy drift is the default; guardrails must be automated and enforced.
Rollouts are where incidents happen; safe rollback is a security feature.
Supply-chain attacks target your CI/CD because it has keys and reach.
Infrastructure-as-code without policy is just scripting the attack surface.

Key questions

What is your supply-chain threat model (dependency poisoning, CI compromise)?
How do you rehearse incident response as code (runbooks, chaos, drills)?
How do you prevent “break glass” from becoming the standard path?
How do you do safe rollouts (canary, blast-radius, rapid rollback)?
What is the minimum set of humans who can ship to production?
Which signals prove correctness (not just availability) in production?

Assumptions

Dependencies can be compromised upstream (typosquatting, maintainer takeover).
Rollbacks must be executed under time pressure.
Policy enforcement must be consistent across environments.
CI runners are exposed to untrusted code (PRs, dependencies).

Non-goals

Trusting CI environments by default.
Long-lived credentials embedded in pipelines.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

A policy gate is a predicate over metadata:

\mathrm{allow}(\text{deploy}) \Leftrightarrow P(\text{attestation},\ \text{scan},\ \text{env}).

Policy should be code with diffs and reviews—guardrails, not guidelines.

Treat CI as attacker-controlled until proven otherwise; minimize secrets and privileges.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

Least authority: privileges are scoped by purpose and time.
Authenticity: actions are bound to identity and purpose.
Downgrade resistance: negotiation can’t silently weaken security posture.
Integrity: invalid transitions are rejected (and detectable).

Failure modes

Mixed-version behavior that violates assumptions silently.
Observability gaps during incidents (missing evidence).
Timeout ambiguity causing double-apply or partial state transitions.
Config drift that weakens security posture over time.

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  pr["PR"] --> checks["Checks"]
  checks --> merge["Merge"]
  merge --> release["Release"]
  release --> canary["Canary"]
  canary --> prod["Prod"]
  prod --> rollback["Rollback Plan"]

Implementation notes

The pipeline is production: it has credentials, network reach, and authority.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

CI hardening checklist:
- No long-lived secrets in CI
- OIDC to obtain short-lived creds
- Pin dependencies and verify integrity
- Reproducible builds + provenance attestation
- Policy-as-code gates (deploy blocked on evidence)

Verification strategy

Policy tests: unit tests for policy-as-code rules.
Dependency tampering drills: lockfile changes, integrity failures.
Runtime conformance: detect drift between desired and actual state.
Rollback tests as part of release (not “if needed”).
Pipeline attack simulations: compromise a runner and measure blast radius.

Operational notes

Audit who can ship and how; remove implicit paths.
Rehearse incident response for the pipeline itself.
Keep a provenance trail for every artifact deployed to production.
Treat policy changes as security-sensitive deploys (review + rollout).
Continuously scan and inventory dependencies; prioritize by exposure.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Invariant violation rate (should be ~0).
Error budget burn + tail latency under load.
Retry/timeout rates by endpoint and client cohort.
Rollback events and the conditions that triggered them.
Authz failures and policy denials (unexpected spikes).

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Use canaries and staged rollout; stop early when signals degrade.
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Learn TLA+ (2) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

What is the smallest CI compromise that becomes a prod compromise today?
Can you answer “what code is running” with cryptographic evidence?
Which deploy actions are irreversible and how do you mitigate that?
How quickly can you revoke all pipeline credentials in an incident?

Checklist

Assumptions listed and reviewed.
Failure modes enumerated with mitigations.
Telemetry captures correctness signals.
Rollback plan rehearsed and automated.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Safety properties stated as invariants.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading