Monthly research note. Theme: DevSecOps & Resilience Engineering.

TL;DR

Reproducible CI/CD: Determinism as Defense as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

  • Make rollback a first-class operation with explicit triggers and rehearsal.
  • Policy-as-code needs tests, rollout, and rollback like any other production system.
  • Treat CI/CD as attacker-controlled until proven otherwise; minimize secrets and privileges.
  • Measure correctness signals, not only latency/throughput.
  • Make boundaries boring: validate inputs, cap costs, and be deterministic where needed.

Why this matters

  • Infrastructure-as-code without policy is just scripting the attack surface.
  • Policy drift is the default; guardrails must be automated and enforced.
  • Reproducibility is how you know what you shipped is what you built.
  • Secrets in CI turn “one compromised job” into “full compromise.”

Key questions

  • Which signals prove correctness (not just availability) in production?
  • How do you rehearse incident response as code (runbooks, chaos, drills)?
  • What is the minimum set of humans who can ship to production?
  • What is your supply-chain threat model (dependency poisoning, CI compromise)?
  • How do you manage secrets without long-lived credentials in CI?
  • Where do you enforce policy (pre-merge, build, deploy, runtime)?

Assumptions

  • CI runners are exposed to untrusted code (PRs, dependencies).
  • Rollbacks must be executed under time pressure.
  • Dependencies can be compromised upstream (typosquatting, maintainer takeover).
  • Policy enforcement must be consistent across environments.

Non-goals

  • Long-lived credentials embedded in pipelines.
  • Manual policy enforcement or manual security review as the only control.
Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

Build provenance is a cryptographic statement:

attestSignkbuild(hash(artifact)  metadata).\mathrm{attest} \leftarrow \mathrm{Sign}_{k_\text{build}}(\mathrm{hash}(\text{artifact})\ \Vert\ \text{metadata}).

Treat CI as attacker-controlled until proven otherwise; minimize secrets and privileges.

Make provenance verifiable: “what built this” must be cryptographically bound.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

  • Evidence: critical actions emit verifiable audit events.
  • Least authority: privileges are scoped by purpose and time.
  • Integrity: invalid transitions are rejected (and detectable).
  • Authenticity: actions are bound to identity and purpose.

Failure modes

  • Observability gaps during incidents (missing evidence).
  • Mixed-version behavior that violates assumptions silently.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart LR
  src["Source"] --> build["Build (reproducible)"]
  build --> attest["Attestation"]
  attest --> scan["SAST/DAST/SCA"]
  scan --> deploy["Deploy (policy gates)"]
  deploy --> runtime["Runtime Policy + Observability"]

Implementation notes

The pipeline is production: it has credentials, network reach, and authority.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

// Treat CI as untrusted: keep tokens short-lived and scoped.
type Token struct {
  Value string
  ExpiresAtUnix int64
  Scope string
}

Verification strategy

  • Pipeline attack simulations: compromise a runner and measure blast radius.
  • Rollback tests as part of release (not “if needed”).
  • Dependency tampering drills: lockfile changes, integrity failures.
  • Policy tests: unit tests for policy-as-code rules.
  • Runtime conformance: detect drift between desired and actual state.

Operational notes

  • Keep a provenance trail for every artifact deployed to production.
  • Treat policy changes as security-sensitive deploys (review + rollout).
  • Audit who can ship and how; remove implicit paths.
  • Rehearse incident response for the pipeline itself.
  • Continuously scan and inventory dependencies; prioritize by exposure.
Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

  • Invariant violation rate (should be ~0).
  • Authz failures and policy denials (unexpected spikes).
  • Error budget burn + tail latency under load.
  • Rollback events and the conditions that triggered them.
  • Retry/timeout rates by endpoint and client cohort.

Rollback plan

  • Define an explicit rollback trigger (metrics + thresholds).
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Keep dual-write / dual-verify windows where appropriate.
  • Use canaries and staged rollout; stop early when signals degrade.

Evidence

  • Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
  • Learn TLA+ (2) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

  • Can you answer “what code is running” with cryptographic evidence?
  • What is the smallest CI compromise that becomes a prod compromise today?
  • How quickly can you revoke all pipeline credentials in an incident?
  • Which deploy actions are irreversible and how do you mitigate that?

Checklist

  • Telemetry captures correctness signals.
  • Rollback plan rehearsed and automated.
  • Assumptions listed and reviewed.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Failure modes enumerated with mitigations.
  • Safety properties stated as invariants.

Further reading

1.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/
2.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/