Sandbox Escapes: Isolation Boundaries as a Design Input

Monthly research note. Theme: Adversarial Infrastructure & Global Systems.

TL;DR

A focused memo on Sandbox Escapes: Isolation Boundaries as a Design Input: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Evidence pipelines (audit/config history) are part of incident response correctness.
Protect observability: you can’t respond blind, and telemetry can be attacked.
Dependencies (DNS, routing, PKI) are shared attack surfaces—plan containment.
Write assumptions down; treat them as interfaces.
Make failure modes explicit and observable.

Why this matters

Incident response is a protocol: practice it, automate it, validate it.
Logs are only useful if they remain trustworthy under compromise.
Attackers exploit cost asymmetry: make abuse cheap and defense expensive.
Privacy failures often come from metadata, not plaintext.

Key questions

How do you detect attacks that look like “normal traffic spikes”?
Where is the attacker’s leverage (routing, DNS, dependency, identity, time)?
How do you make abuse expensive (proof-of-work, quotas, pricing, friction)?
Which controls fail first under load: auth, rate limits, storage, or observability?
What is your degraded-mode behavior (and is it safe)?
Which logs are trustworthy under compromise (append-only, signed, isolated)?

Assumptions

Attackers can manipulate routing and DNS indirectly (upstream failures, BGP issues).
Traffic spikes can be malicious or accidental; you must handle both.
Observability pipelines can be attacked (cardinality explosions, log injection).
Operators are human and will make mistakes under pressure.

Non-goals

Assuming WAF/rate limits are sufficient without architecture changes.
Relying on dashboards that vanish during the incident.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

Defense is about cost asymmetry. If the attacker spends $1$ and you spend $100$ , you lose.

\mathrm{Cost}_\text{defense} \ll \mathrm{Cost}_\text{attack}\ \text{(per unit of damage prevented)}.

Treat observability as a dependency: protect it from overload and manipulation.

Define which operations fail closed vs fail open. Do it before an incident.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Downgrade resistance: negotiation can’t silently weaken security posture.
Evidence: critical actions emit verifiable audit events.
Authenticity: actions are bound to identity and purpose.
Replay resistance: duplicated inputs do not change outcomes.

Failure modes

Config drift that weakens security posture over time.
Observability gaps during incidents (missing evidence).
Recovery paths that only work when nothing is broken.
Mixed-version behavior that violates assumptions silently.

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

flowchart TD
  edge["Edge (rate limits + WAF)"] --> core["Core Services"]
  core --> data["Data Plane"]
  data --> control["Control Plane"]
  control --> edge
  siem["Detection/Response"] --> core
  siem --> edge

Implementation notes

Keep evidence pipelines alive: you can’t respond blind.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

Evidence checklist:
- Immutable logs (append-only)
- Signed audit events
- Time sync monitoring
- Dependency health snapshots
- Config change history

Verification strategy

Incident replay: reconstruct timeline from evidence pipelines.
Dependency chaos: DNS issues, cert failures, upstream outages.
Game days: simulate DDoS, dependency failure, and credential abuse.
Observability stress: cardinality explosions and sampling under attack.
Policy tests: fail closed/open behaviors are unit-tested.

Operational notes

Document and rehearse degraded-mode policy with on-call rotations.
Instrument cost: which defenses become expensive and when.
Make emergency controls quick: feature flags, circuit breakers, safe defaults.
Protect the edge and the evidence: rate limits + SIEM + log integrity.
Keep recovery paths simple: restore from known-good, rotate secrets, reissue certs.

Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

Authz failures and policy denials (unexpected spikes).
Rollback events and the conditions that triggered them.
Error budget burn + tail latency under load.
Admission-control / rate-limit rejections (by reason).
Invariant violation rate (should be ~0).

Rollback plan

Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.
Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Use canaries and staged rollout; stop early when signals degrade.

Evidence

Learn TLA+ (1) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
Designing Data-Intensive Applications (Kleppmann) (2) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.

Open questions

How do you keep control-plane access during widespread incidents?
Where do you pay cost asymmetry today—and can you flip it?
What is your ‘safe mode’ when dependencies fail?
Which operation, if abused, causes irreversible damage?

Checklist

Telemetry captures correctness signals.
Failure modes enumerated with mitigations.
Safety properties stated as invariants.
Rollback plan rehearsed and automated.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Assumptions listed and reviewed.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading