Monthly research note. Theme: DevSecOps & Resilience Engineering.

TL;DR

Multi-Region Design: Failover That You Can Actually Test as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

  • Policy-as-code needs tests, rollout, and rollback like any other production system.
  • Make rollback a first-class operation with explicit triggers and rehearsal.
  • Treat CI/CD as attacker-controlled until proven otherwise; minimize secrets and privileges.
  • Treat retries, reordering, and partial failure as default conditions.
  • Design rollbacks as part of the happy path.

Why this matters

  • Supply-chain attacks target your CI/CD because it has keys and reach.
  • Reproducibility is how you know what you shipped is what you built.
  • Runtime security needs evidence pipelines, not just dashboards.
  • Secrets in CI turn “one compromised job” into “full compromise.”

Key questions

  • Which signals prove correctness (not just availability) in production?
  • What is the minimum set of humans who can ship to production?
  • Where do you enforce policy (pre-merge, build, deploy, runtime)?
  • How do you manage secrets without long-lived credentials in CI?
  • How do you prevent “break glass” from becoming the standard path?
  • How do you rehearse incident response as code (runbooks, chaos, drills)?

Assumptions

  • Rollbacks must be executed under time pressure.
  • Dependencies can be compromised upstream (typosquatting, maintainer takeover).
  • CI runners are exposed to untrusted code (PRs, dependencies).
  • Policy enforcement must be consistent across environments.

Non-goals

  • Assuming deploy equals success without runtime evidence.
  • Trusting CI environments by default.
Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

A policy gate is a predicate over metadata:

allow(deploy)P(attestation, scan, env).\mathrm{allow}(\text{deploy}) \Leftrightarrow P(\text{attestation},\ \text{scan},\ \text{env}).

Treat CI as attacker-controlled until proven otherwise; minimize secrets and privileges.

Make provenance verifiable: “what built this” must be cryptographically bound.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

  • Authenticity: actions are bound to identity and purpose.
  • Integrity: invalid transitions are rejected (and detectable).
  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Replay resistance: duplicated inputs do not change outcomes.

Failure modes

  • Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
  • Recovery paths that only work when nothing is broken.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Mixed-version behavior that violates assumptions silently.
Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  pr["PR"] --> checks["Checks"]
  checks --> merge["Merge"]
  merge --> release["Release"]
  release --> canary["Canary"]
  canary --> prod["Prod"]
  prod --> rollback["Rollback Plan"]

Implementation notes

The pipeline is production: it has credentials, network reach, and authority.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

CI hardening checklist:
- No long-lived secrets in CI
- OIDC to obtain short-lived creds
- Pin dependencies and verify integrity
- Reproducible builds + provenance attestation
- Policy-as-code gates (deploy blocked on evidence)

Verification strategy

  • Runtime conformance: detect drift between desired and actual state.
  • Dependency tampering drills: lockfile changes, integrity failures.
  • Policy tests: unit tests for policy-as-code rules.
  • Pipeline attack simulations: compromise a runner and measure blast radius.
  • Rollback tests as part of release (not “if needed”).

Operational notes

  • Keep a provenance trail for every artifact deployed to production.
  • Rehearse incident response for the pipeline itself.
  • Continuously scan and inventory dependencies; prioritize by exposure.
  • Audit who can ship and how; remove implicit paths.
  • Treat policy changes as security-sensitive deploys (review + rollout).
Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

  • Rollback events and the conditions that triggered them.
  • Retry/timeout rates by endpoint and client cohort.
  • Authz failures and policy denials (unexpected spikes).
  • Admission-control / rate-limit rejections (by reason).
  • Invariant violation rate (should be ~0).

Rollback plan

  • Keep dual-write / dual-verify windows where appropriate.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
  • Use canaries and staged rollout; stop early when signals degrade.

Evidence

  • Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
    • Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
  • Jepsen (2) — Fault injection and correctness testing for distributed systems.
    • Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

  • Can you answer “what code is running” with cryptographic evidence?
  • How quickly can you revoke all pipeline credentials in an incident?
  • What is the smallest CI compromise that becomes a prod compromise today?
  • Which deploy actions are irreversible and how do you mitigate that?

Checklist

  • Rollback plan rehearsed and automated.
  • Safety properties stated as invariants.
  • Telemetry captures correctness signals.
  • Assumptions listed and reviewed.
  • Failure modes enumerated with mitigations.
  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

Further reading

  • NIST SP 800-218 (SSDF) — Secure software development practices as an engineering framework.
  • SLSA v1.0 Specification — Supply-chain levels and provenance requirements.
  • in-toto — Securing the integrity of software supply chains with attestations.
  • Sigstore — Signing and verifying artifacts at scale with transparency logs.
  • Jepsen — Fault injection and correctness testing for distributed systems.
  • Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.
1.
Beyer B, Jones C, Petoff J, Murphy NR. Site Reliability Engineering: How Google Runs Production Systems [Internet]. O’Reilly Media; 2016. Available from: https://sre.google/sre-book/table-of-contents/
2.
Jepsen. Jepsen: Distributed Systems Safety Analysis [Internet]. Web; Available from: https://jepsen.io/