Monthly research note. Theme: DevSecOps & Resilience Engineering.
TL;DR
Kubernetes Hardening: RBAC, NetworkPolicy, and Pod Security as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
Correctness is cheaper to enforce at interfaces than to repair in production data.
Key takeaways
- Short-lived credentials (OIDC) beat long-lived tokens in pipelines.
- Policy-as-code needs tests, rollout, and rollback like any other production system.
- Provenance is a cryptographic statement; ship evidence with artifacts.
- Bind security decisions to evidence (audit, invariants, telemetry).
- Make failure modes explicit and observable.
Why this matters
- Supply-chain attacks target your CI/CD because it has keys and reach.
- Policy drift is the default; guardrails must be automated and enforced.
- Rollouts are where incidents happen; safe rollback is a security feature.
- Runtime security needs evidence pipelines, not just dashboards.
Key questions
- How do you prevent “break glass” from becoming the standard path?
- How do you rehearse incident response as code (runbooks, chaos, drills)?
- How do you manage secrets without long-lived credentials in CI?
- Where do you enforce policy (pre-merge, build, deploy, runtime)?
- How do you do safe rollouts (canary, blast-radius, rapid rollback)?
- What is your supply-chain threat model (dependency poisoning, CI compromise)?
Assumptions
- CI runners are exposed to untrusted code (PRs, dependencies).
- Policy enforcement must be consistent across environments.
- Dependencies can be compromised upstream (typosquatting, maintainer takeover).
- Rollbacks must be executed under time pressure.
Non-goals
- Long-lived credentials embedded in pipelines.
- Assuming deploy equals success without runtime evidence.
Any unbounded work per request becomes a DoS primitive under adversaries.
Model & invariants
Build provenance is a cryptographic statement:
Make provenance verifiable: “what built this” must be cryptographically bound.
Policy should be code with diffs and reviews—guardrails, not guidelines.
If the system can enter an invalid state, it eventually will—usually during an incident.
Security properties
- Downgrade resistance: negotiation can’t silently weaken security posture.
- Evidence: critical actions emit verifiable audit events.
- Least authority: privileges are scoped by purpose and time.
- Replay resistance: duplicated inputs do not change outcomes.
Failure modes
- Timeout ambiguity causing double-apply or partial state transitions.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Mixed-version behavior that violates assumptions silently.
- Config drift that weakens security posture over time.
Caches tend to become sources of truth unless you can recompute and validate them.
Design sketch
flowchart TD
pr["PR"] --> checks["Checks"]
checks --> merge["Merge"]
merge --> release["Release"]
release --> canary["Canary"]
canary --> prod["Prod"]
prod --> rollback["Rollback Plan"]Implementation notes
Build systems that can prove what happened after an incident.
Make rollbacks boring: if rollback is a hero move, it will fail.
CI hardening checklist:
- No long-lived secrets in CI
- OIDC to obtain short-lived creds
- Pin dependencies and verify integrity
- Reproducible builds + provenance attestation
- Policy-as-code gates (deploy blocked on evidence)Verification strategy
- Policy tests: unit tests for policy-as-code rules.
- Rollback tests as part of release (not “if needed”).
- Dependency tampering drills: lockfile changes, integrity failures.
- Pipeline attack simulations: compromise a runner and measure blast radius.
- Runtime conformance: detect drift between desired and actual state.
Operational notes
- Treat policy changes as security-sensitive deploys (review + rollout).
- Continuously scan and inventory dependencies; prioritize by exposure.
- Audit who can ship and how; remove implicit paths.
- Rehearse incident response for the pipeline itself.
- Keep a provenance trail for every artifact deployed to production.
Keep audit and config history queryable during incidents—evidence beats intuition.
What to monitor
- Admission-control / rate-limit rejections (by reason).
- Error budget burn + tail latency under load.
- Authz failures and policy denials (unexpected spikes).
- Retry/timeout rates by endpoint and client cohort.
- Rollback events and the conditions that triggered them.
Rollback plan
- Keep dual-write / dual-verify windows where appropriate.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Define an explicit rollback trigger (metrics + thresholds).
- Use canaries and staged rollout; stop early when signals degrade.
Evidence
- Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
- Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Open questions
- What is the smallest CI compromise that becomes a prod compromise today?
- How quickly can you revoke all pipeline credentials in an incident?
- Which deploy actions are irreversible and how do you mitigate that?
- Can you answer “what code is running” with cryptographic evidence?
Checklist
- Failure modes enumerated with mitigations.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Assumptions listed and reviewed.
- Safety properties stated as invariants.
- Telemetry captures correctness signals.
- Rollback plan rehearsed and automated.
Further reading
- SLSA v1.0 Specification — Supply-chain levels and provenance requirements.
- in-toto — Securing the integrity of software supply chains with attestations.
- Sigstore — Signing and verifying artifacts at scale with transparency logs.
- NIST SP 800-218 (SSDF) — Secure software development practices as an engineering framework.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.