Monthly research note. Theme: DevSecOps & Resilience Engineering.
TL;DR
Rate Limiting & Load Shedding: Protecting Reliability SLOs as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Provenance is a cryptographic statement; ship evidence with artifacts.
- Policy-as-code needs tests, rollout, and rollback like any other production system.
- Short-lived credentials (OIDC) beat long-lived tokens in pipelines.
- Write assumptions down; treat them as interfaces.
- Design rollbacks as part of the happy path.
Why this matters
- Reproducibility is how you know what you shipped is what you built.
- Policy drift is the default; guardrails must be automated and enforced.
- Secrets in CI turn “one compromised job” into “full compromise.”
- Infrastructure-as-code without policy is just scripting the attack surface.
Key questions
- How do you manage secrets without long-lived credentials in CI?
- Which signals prove correctness (not just availability) in production?
- What is the minimum set of humans who can ship to production?
- How do you rehearse incident response as code (runbooks, chaos, drills)?
- Where do you enforce policy (pre-merge, build, deploy, runtime)?
- How do you prevent “break glass” from becoming the standard path?
Assumptions
- Rollbacks must be executed under time pressure.
- Observability pipelines can be attacked (log injection, PII leaks).
- Policy enforcement must be consistent across environments.
- CI runners are exposed to untrusted code (PRs, dependencies).
Non-goals
- Manual policy enforcement or manual security review as the only control.
- Long-lived credentials embedded in pipelines.
Parsing is an attacker-controlled interface—validate early and fail fast.
Model & invariants
A policy gate is a predicate over metadata:
Policy should be code with diffs and reviews—guardrails, not guidelines.
Make provenance verifiable: “what built this” must be cryptographically bound.
Invariants must be checkable from evidence you actually have (state + logs + counters).
Security properties
- Least authority: privileges are scoped by purpose and time.
- Authenticity: actions are bound to identity and purpose.
- Replay resistance: duplicated inputs do not change outcomes.
- Evidence: critical actions emit verifiable audit events.
Failure modes
- Config drift that weakens security posture over time.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Timeout ambiguity causing double-apply or partial state transitions.
- Mixed-version behavior that violates assumptions silently.
Sampling hides the rare schedule that breaks your invariants.
Design sketch
flowchart LR
src["Source"] --> build["Build (reproducible)"]
build --> attest["Attestation"]
attest --> scan["SAST/DAST/SCA"]
scan --> deploy["Deploy (policy gates)"]
deploy --> runtime["Runtime Policy + Observability"]Implementation notes
Build systems that can prove what happened after an incident.
Bound work per request: parse, validate, and cap cost before you allocate heavy resources.
// Treat CI as untrusted: keep tokens short-lived and scoped.
type Token struct {
Value string
ExpiresAtUnix int64
Scope string
}Verification strategy
- Runtime conformance: detect drift between desired and actual state.
- Rollback tests as part of release (not “if needed”).
- Pipeline attack simulations: compromise a runner and measure blast radius.
- Policy tests: unit tests for policy-as-code rules.
- Dependency tampering drills: lockfile changes, integrity failures.
Operational notes
- Rehearse incident response for the pipeline itself.
- Continuously scan and inventory dependencies; prioritize by exposure.
- Treat policy changes as security-sensitive deploys (review + rollout).
- Audit who can ship and how; remove implicit paths.
- Keep a provenance trail for every artifact deployed to production.
Keep audit and config history queryable during incidents—evidence beats intuition.
What to monitor
- Retry/timeout rates by endpoint and client cohort.
- Admission-control / rate-limit rejections (by reason).
- Rollback events and the conditions that triggered them.
- Error budget burn + tail latency under load.
- Invariant violation rate (should be ~0).
Rollback plan
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Define an explicit rollback trigger (metrics + thresholds).
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Use canaries and staged rollout; stop early when signals degrade.
- Keep dual-write / dual-verify windows where appropriate.
Evidence
- Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
- Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Open questions
- How quickly can you revoke all pipeline credentials in an incident?
- Which deploy actions are irreversible and how do you mitigate that?
- What is the smallest CI compromise that becomes a prod compromise today?
- Can you answer “what code is running” with cryptographic evidence?
Checklist
- Assumptions listed and reviewed.
- Safety properties stated as invariants.
- Telemetry captures correctness signals.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Rollback plan rehearsed and automated.
- Failure modes enumerated with mitigations.
Further reading
- in-toto — Securing the integrity of software supply chains with attestations.
- NIST SP 800-218 (SSDF) — Secure software development practices as an engineering framework.
- SLSA v1.0 Specification — Supply-chain levels and provenance requirements.
- Sigstore — Signing and verifying artifacts at scale with transparency logs.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Jepsen — Fault injection and correctness testing for distributed systems.