Monthly research note. Theme: DevSecOps & Resilience Engineering.

TL;DR

A focused memo on Secure Configuration: Policy-as-Code and Guardrails: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

  • Provenance is a cryptographic statement; ship evidence with artifacts.
  • Treat CI/CD as attacker-controlled until proven otherwise; minimize secrets and privileges.
  • Policy-as-code needs tests, rollout, and rollback like any other production system.
  • Treat retries, reordering, and partial failure as default conditions.
  • Automate guardrails; humans are for judgment, not for consistent enforcement.

Why this matters

  • Policy drift is the default; guardrails must be automated and enforced.
  • Supply-chain attacks target your CI/CD because it has keys and reach.
  • Reproducibility is how you know what you shipped is what you built.
  • Runtime security needs evidence pipelines, not just dashboards.

Key questions

  • How do you manage secrets without long-lived credentials in CI?
  • What is your supply-chain threat model (dependency poisoning, CI compromise)?
  • How do you rehearse incident response as code (runbooks, chaos, drills)?
  • How do you prevent “break glass” from becoming the standard path?
  • Where do you enforce policy (pre-merge, build, deploy, runtime)?
  • What is the minimum set of humans who can ship to production?

Assumptions

  • Rollbacks must be executed under time pressure.
  • Observability pipelines can be attacked (log injection, PII leaks).
  • Policy enforcement must be consistent across environments.
  • CI runners are exposed to untrusted code (PRs, dependencies).

Non-goals

  • Manual policy enforcement or manual security review as the only control.
  • Trusting CI environments by default.
Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

A policy gate is a predicate over metadata:

allow(deploy)P(attestation, scan, env).\mathrm{allow}(\text{deploy}) \Leftrightarrow P(\text{attestation},\ \text{scan},\ \text{env}).

Policy should be code with diffs and reviews—guardrails, not guidelines.

Treat CI as attacker-controlled until proven otherwise; minimize secrets and privileges.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

  • Authenticity: actions are bound to identity and purpose.
  • Evidence: critical actions emit verifiable audit events.
  • Replay resistance: duplicated inputs do not change outcomes.
  • Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

  • Timeout ambiguity causing double-apply or partial state transitions.
  • Observability gaps during incidents (missing evidence).
  • Mixed-version behavior that violates assumptions silently.
  • Recovery paths that only work when nothing is broken.
Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  pr["PR"] --> checks["Checks"]
  checks --> merge["Merge"]
  merge --> release["Release"]
  release --> canary["Canary"]
  canary --> prod["Prod"]
  prod --> rollback["Rollback Plan"]

Implementation notes

Build systems that can prove what happened after an incident.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

CI hardening checklist:
- No long-lived secrets in CI
- OIDC to obtain short-lived creds
- Pin dependencies and verify integrity
- Reproducible builds + provenance attestation
- Policy-as-code gates (deploy blocked on evidence)

Verification strategy

  • Dependency tampering drills: lockfile changes, integrity failures.
  • Runtime conformance: detect drift between desired and actual state.
  • Pipeline attack simulations: compromise a runner and measure blast radius.
  • Rollback tests as part of release (not “if needed”).
  • Policy tests: unit tests for policy-as-code rules.

Operational notes

  • Rehearse incident response for the pipeline itself.
  • Treat policy changes as security-sensitive deploys (review + rollout).
  • Audit who can ship and how; remove implicit paths.
  • Keep a provenance trail for every artifact deployed to production.
  • Continuously scan and inventory dependencies; prioritize by exposure.
Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

  • Retry/timeout rates by endpoint and client cohort.
  • Error budget burn + tail latency under load.
  • Rollback events and the conditions that triggered them.
  • Invariant violation rate (should be ~0).
  • Admission-control / rate-limit rejections (by reason).

Rollback plan

  • Use canaries and staged rollout; stop early when signals degrade.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Keep dual-write / dual-verify windows where appropriate.

Evidence

  • Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
  • Learn TLA+ (2) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

  • Which deploy actions are irreversible and how do you mitigate that?
  • How quickly can you revoke all pipeline credentials in an incident?
  • What is the smallest CI compromise that becomes a prod compromise today?
  • Can you answer “what code is running” with cryptographic evidence?

Checklist

  • Assumptions listed and reviewed.
  • Safety properties stated as invariants.
  • Telemetry captures correctness signals.
  • Rollback plan rehearsed and automated.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Failure modes enumerated with mitigations.

Further reading

1.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/
2.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/