Backup/Restore as a Protocol: RPO/RTO with Adversaries

Monthly research note. Theme: DevSecOps & Resilience Engineering.

TL;DR

A focused memo on Backup/Restore as a Protocol: RPO/RTO with Adversaries: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Provenance is a cryptographic statement; ship evidence with artifacts.
Treat CI/CD as attacker-controlled until proven otherwise; minimize secrets and privileges.
Short-lived credentials (OIDC) beat long-lived tokens in pipelines.
Write assumptions down; treat them as interfaces.
Measure correctness signals, not only latency/throughput.

Why this matters

Rollouts are where incidents happen; safe rollback is a security feature.
Supply-chain attacks target your CI/CD because it has keys and reach.
Runtime security needs evidence pipelines, not just dashboards.
Secrets in CI turn “one compromised job” into “full compromise.”

Key questions

How do you prevent “break glass” from becoming the standard path?
Which signals prove correctness (not just availability) in production?
How do you rehearse incident response as code (runbooks, chaos, drills)?
How do you manage secrets without long-lived credentials in CI?
What is your supply-chain threat model (dependency poisoning, CI compromise)?
Where do you enforce policy (pre-merge, build, deploy, runtime)?

Assumptions

Rollbacks must be executed under time pressure.
Dependencies can be compromised upstream (typosquatting, maintainer takeover).
CI runners are exposed to untrusted code (PRs, dependencies).
Policy enforcement must be consistent across environments.

Non-goals

Assuming deploy equals success without runtime evidence.
Manual policy enforcement or manual security review as the only control.

Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

A policy gate is a predicate over metadata:

\mathrm{allow}(\text{deploy}) \Leftrightarrow P(\text{attestation},\ \text{scan},\ \text{env}).

Treat CI as attacker-controlled until proven otherwise; minimize secrets and privileges.

Make provenance verifiable: “what built this” must be cryptographically bound.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Evidence: critical actions emit verifiable audit events.
Authenticity: actions are bound to identity and purpose.
Replay resistance: duplicated inputs do not change outcomes.
Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

Recovery paths that only work when nothing is broken.
Mixed-version behavior that violates assumptions silently.
Config drift that weakens security posture over time.
Timeout ambiguity causing double-apply or partial state transitions.

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  pr["PR"] --> checks["Checks"]
  checks --> merge["Merge"]
  merge --> release["Release"]
  release --> canary["Canary"]
  canary --> prod["Prod"]
  prod --> rollback["Rollback Plan"]

Implementation notes

Prefer short-lived credentials (OIDC) and explicit policy gates.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

CI hardening checklist:
- No long-lived secrets in CI
- OIDC to obtain short-lived creds
- Pin dependencies and verify integrity
- Reproducible builds + provenance attestation
- Policy-as-code gates (deploy blocked on evidence)

Verification strategy

Runtime conformance: detect drift between desired and actual state.
Pipeline attack simulations: compromise a runner and measure blast radius.
Policy tests: unit tests for policy-as-code rules.
Dependency tampering drills: lockfile changes, integrity failures.
Rollback tests as part of release (not “if needed”).

Operational notes

Treat policy changes as security-sensitive deploys (review + rollout).
Rehearse incident response for the pipeline itself.
Keep a provenance trail for every artifact deployed to production.
Continuously scan and inventory dependencies; prioritize by exposure.
Audit who can ship and how; remove implicit paths.

Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

Rollback events and the conditions that triggered them.
Authz failures and policy denials (unexpected spikes).
Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Admission-control / rate-limit rejections (by reason).

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.

Evidence

Learn TLA+ (1) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
Jepsen (2) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

Can you answer “what code is running” with cryptographic evidence?
Which deploy actions are irreversible and how do you mitigate that?
How quickly can you revoke all pipeline credentials in an incident?
What is the smallest CI compromise that becomes a prod compromise today?

Checklist

Telemetry captures correctness signals.
Safety properties stated as invariants.
Assumptions listed and reviewed.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Rollback plan rehearsed and automated.
Failure modes enumerated with mitigations.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading