Monthly research note. Theme: DevSecOps & Resilience Engineering.
TL;DR
A focused memo on Rust/Go Secure Coding Patterns: The Bugs That Still Happen: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
Correctness is cheaper to enforce at interfaces than to repair in production data.
Key takeaways
- Make rollback a first-class operation with explicit triggers and rehearsal.
- Policy-as-code needs tests, rollout, and rollback like any other production system.
- Treat CI/CD as attacker-controlled until proven otherwise; minimize secrets and privileges.
- Write assumptions down; treat them as interfaces.
- Measure correctness signals, not only latency/throughput.
Why this matters
- Policy drift is the default; guardrails must be automated and enforced.
- Infrastructure-as-code without policy is just scripting the attack surface.
- Supply-chain attacks target your CI/CD because it has keys and reach.
- Runtime security needs evidence pipelines, not just dashboards.
Key questions
- What is the minimum set of humans who can ship to production?
- How do you prevent “break glass” from becoming the standard path?
- How do you rehearse incident response as code (runbooks, chaos, drills)?
- Where do you enforce policy (pre-merge, build, deploy, runtime)?
- What is your supply-chain threat model (dependency poisoning, CI compromise)?
- How do you manage secrets without long-lived credentials in CI?
Assumptions
- CI runners are exposed to untrusted code (PRs, dependencies).
- Rollbacks must be executed under time pressure.
- Observability pipelines can be attacked (log injection, PII leaks).
- Dependencies can be compromised upstream (typosquatting, maintainer takeover).
Non-goals
- Assuming deploy equals success without runtime evidence.
- Long-lived credentials embedded in pipelines.
Any unbounded work per request becomes a DoS primitive under adversaries.
Model & invariants
A policy gate is a predicate over metadata:
Treat CI as attacker-controlled until proven otherwise; minimize secrets and privileges.
Policy should be code with diffs and reviews—guardrails, not guidelines.
If the system can enter an invalid state, it eventually will—usually during an incident.
Security properties
- Integrity: invalid transitions are rejected (and detectable).
- Evidence: critical actions emit verifiable audit events.
- Authenticity: actions are bound to identity and purpose.
- Replay resistance: duplicated inputs do not change outcomes.
Failure modes
- Observability gaps during incidents (missing evidence).
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Mixed-version behavior that violates assumptions silently.
- Config drift that weakens security posture over time.
Sampling hides the rare schedule that breaks your invariants.
Design sketch
flowchart LR
src["Source"] --> build["Build (reproducible)"]
build --> attest["Attestation"]
attest --> scan["SAST/DAST/SCA"]
scan --> deploy["Deploy (policy gates)"]
deploy --> runtime["Runtime Policy + Observability"]Implementation notes
The pipeline is production: it has credentials, network reach, and authority.
If you can’t explain a timeout outcome, you can’t make retries safe.
// Treat CI as untrusted: keep tokens short-lived and scoped.
type Token struct {
Value string
ExpiresAtUnix int64
Scope string
}Verification strategy
- Pipeline attack simulations: compromise a runner and measure blast radius.
- Policy tests: unit tests for policy-as-code rules.
- Rollback tests as part of release (not “if needed”).
- Runtime conformance: detect drift between desired and actual state.
- Dependency tampering drills: lockfile changes, integrity failures.
Operational notes
- Audit who can ship and how; remove implicit paths.
- Rehearse incident response for the pipeline itself.
- Keep a provenance trail for every artifact deployed to production.
- Treat policy changes as security-sensitive deploys (review + rollout).
- Continuously scan and inventory dependencies; prioritize by exposure.
Keep audit and config history queryable during incidents—evidence beats intuition.
What to monitor
- Error budget burn + tail latency under load.
- Rollback events and the conditions that triggered them.
- Retry/timeout rates by endpoint and client cohort.
- Invariant violation rate (should be ~0).
- Admission-control / rate-limit rejections (by reason).
Rollback plan
- Keep dual-write / dual-verify windows where appropriate.
- Use canaries and staged rollout; stop early when signals degrade.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Define an explicit rollback trigger (metrics + thresholds).
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Evidence
- Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
- Learn TLA+ (2) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
Open questions
- How quickly can you revoke all pipeline credentials in an incident?
- Can you answer “what code is running” with cryptographic evidence?
- What is the smallest CI compromise that becomes a prod compromise today?
- Which deploy actions are irreversible and how do you mitigate that?
Checklist
- Safety properties stated as invariants.
- Rollback plan rehearsed and automated.
- Assumptions listed and reviewed.
- Failure modes enumerated with mitigations.
- Telemetry captures correctness signals.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Further reading
- NIST SP 800-218 (SSDF) — Secure software development practices as an engineering framework.
- Sigstore — Signing and verifying artifacts at scale with transparency logs.
- in-toto — Securing the integrity of software supply chains with attestations.
- SLSA v1.0 Specification — Supply-chain levels and provenance requirements.
- Learn TLA+ — Practical entry point for specification and model checking.
- Jepsen — Fault injection and correctness testing for distributed systems.