Runtime Security: eBPF, Policy, and Drift Detection

Monthly research note. Theme: DevSecOps & Resilience Engineering.

TL;DR

A focused memo on Runtime Security: eBPF, Policy, and Drift Detection: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Make rollback a first-class operation with explicit triggers and rehearsal.
Treat CI/CD as attacker-controlled until proven otherwise; minimize secrets and privileges.
Policy-as-code needs tests, rollout, and rollback like any other production system.
Make failure modes explicit and observable.
Measure correctness signals, not only latency/throughput.

Why this matters

Rollouts are where incidents happen; safe rollback is a security feature.
Reproducibility is how you know what you shipped is what you built.
Runtime security needs evidence pipelines, not just dashboards.
Secrets in CI turn “one compromised job” into “full compromise.”

Key questions

Which signals prove correctness (not just availability) in production?
Where do you enforce policy (pre-merge, build, deploy, runtime)?
How do you do safe rollouts (canary, blast-radius, rapid rollback)?
How do you manage secrets without long-lived credentials in CI?
How do you rehearse incident response as code (runbooks, chaos, drills)?
How do you prevent “break glass” from becoming the standard path?

Assumptions

Rollbacks must be executed under time pressure.
Policy enforcement must be consistent across environments.
Dependencies can be compromised upstream (typosquatting, maintainer takeover).
CI runners are exposed to untrusted code (PRs, dependencies).

Non-goals

Trusting CI environments by default.
Long-lived credentials embedded in pipelines.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Build provenance is a cryptographic statement:

\mathrm{attest} \leftarrow \mathrm{Sign}_{k_\text{build}}(\mathrm{hash}(\text{artifact})\ \Vert\ \text{metadata}).

Make provenance verifiable: “what built this” must be cryptographically bound.

Treat CI as attacker-controlled until proven otherwise; minimize secrets and privileges.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

Least authority: privileges are scoped by purpose and time.
Integrity: invalid transitions are rejected (and detectable).
Replay resistance: duplicated inputs do not change outcomes.
Evidence: critical actions emit verifiable audit events.

Failure modes

Recovery paths that only work when nothing is broken.
Mixed-version behavior that violates assumptions silently.
Observability gaps during incidents (missing evidence).
Timeout ambiguity causing double-apply or partial state transitions.

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart LR
  src["Source"] --> build["Build (reproducible)"]
  build --> attest["Attestation"]
  attest --> scan["SAST/DAST/SCA"]
  scan --> deploy["Deploy (policy gates)"]
  deploy --> runtime["Runtime Policy + Observability"]

Implementation notes

The pipeline is production: it has credentials, network reach, and authority.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

CI hardening checklist:
- No long-lived secrets in CI
- OIDC to obtain short-lived creds
- Pin dependencies and verify integrity
- Reproducible builds + provenance attestation
- Policy-as-code gates (deploy blocked on evidence)

Verification strategy

Dependency tampering drills: lockfile changes, integrity failures.
Policy tests: unit tests for policy-as-code rules.
Pipeline attack simulations: compromise a runner and measure blast radius.
Runtime conformance: detect drift between desired and actual state.
Rollback tests as part of release (not “if needed”).

Operational notes

Treat policy changes as security-sensitive deploys (review + rollout).
Keep a provenance trail for every artifact deployed to production.
Audit who can ship and how; remove implicit paths.
Continuously scan and inventory dependencies; prioritize by exposure.
Rehearse incident response for the pipeline itself.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Admission-control / rate-limit rejections (by reason).
Retry/timeout rates by endpoint and client cohort.
Authz failures and policy denials (unexpected spikes).
Rollback events and the conditions that triggered them.
Invariant violation rate (should be ~0).

Rollback plan

Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).
Keep dual-write / dual-verify windows where appropriate.

Evidence

Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Jepsen (2) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

How quickly can you revoke all pipeline credentials in an incident?
Can you answer “what code is running” with cryptographic evidence?
What is the smallest CI compromise that becomes a prod compromise today?
Which deploy actions are irreversible and how do you mitigate that?

Checklist

Failure modes enumerated with mitigations.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Assumptions listed and reviewed.
Telemetry captures correctness signals.
Rollback plan rehearsed and automated.
Safety properties stated as invariants.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading