Monthly research note. Theme: Deep Systems Notes.
TL;DR
Composable Security: Where Proofs Break in Real Systems as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Operational behavior is part of correctness: rollout, rollback, and evidence.
- Contracts need enforcement: tests, assertions, and monitoring—not documentation.
- Integration boundaries are where proofs evaporate; treat them as first-class.
- Bind security decisions to evidence (audit, invariants, telemetry).
- Treat retries, reordering, and partial failure as default conditions.
Why this matters
- Security becomes optional through configuration drift unless enforced.
- Most real failures happen at integration boundaries, not inside components.
- Resilience requires making failure modes explicit and bounded.
- Operational behavior is part of correctness (rollouts, rollbacks, drift).
Key questions
- Which assumptions leak across boundaries (time, randomness, identity, ordering)?
- Which proofs are worth maintaining vs replacing with tests and monitoring?
- What are your compositional failure modes (partial deploys, mixed versions)?
- How do you keep ‘security properties’ visible to operators and SREs?
- How do you prevent 'optional security' from appearing via config drift?
- Where does 'correctness' become an operational contract (SLOs, budgets, policy)?
Assumptions
- Components are built by different teams with different threat models.
- Integration happens under time pressure; defaults become de facto policy.
- Adversaries exploit ambiguity between systems, not within them.
- Observability is imperfect; you debug from partial evidence.
Non-goals
- Relying on “tribal knowledge” to connect assumptions across layers.
- Allowing config to silently weaken security properties.
Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.
Model & invariants
Interface contracts are predicates:
Make assumptions executable: encode them as assertions, tests, and run-time checks.
Treat config as code: version it, review it, and monitor drift.
Invariants must be checkable from evidence you actually have (state + logs + counters).
Security properties
- Authenticity: actions are bound to identity and purpose.
- Replay resistance: duplicated inputs do not change outcomes.
- Integrity: invalid transitions are rejected (and detectable).
- Downgrade resistance: negotiation can’t silently weaken security posture.
Failure modes
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Timeout ambiguity causing double-apply or partial state transitions.
- Recovery paths that only work when nothing is broken.
- Mixed-version behavior that violates assumptions silently.
Mixed-version deployments create states you never tested—plan for them explicitly.
Design sketch
flowchart LR
boundary["Boundary"] --> contract["Contract (P -> Q)"]
contract --> test["Tests"]
test --> monitor["Monitoring"]
monitor --> incident["Incident"]
incident --> contractImplementation notes
Operational constraints are part of the design: deploy, rollback, and drift.
Acknowledge only after durability (or make “ack” explicitly best-effort).
// Integration note: treat FFI/service boundaries as an API with invariants.
// Encode invariants as types where possible, assertions otherwise.Verification strategy
- End-to-end property tests for the smallest meaningful workflow.
- Invariant monitoring tied to incident response playbooks.
- Upgrade tests for mixed-version and rollback scenarios.
- Contract tests at boundaries with adversarial inputs and skew.
- Fault injection at seams (queues, caches, RPC) not only components.
Operational notes
- Use canaries for protocol and crypto changes; define rollback triggers.
- Maintain runbooks that reference invariants, not just symptoms.
- Store evidence: audit logs, config diffs, and deployment metadata.
- Treat config drift as an incident: detect, alert, and remediate.
- Make security and correctness properties observable (metrics + alerts).
Make degraded modes explicit: fail closed vs fail open is a policy choice.
What to monitor
- Authz failures and policy denials (unexpected spikes).
- Rollback events and the conditions that triggered them.
- Error budget burn + tail latency under load.
- Invariant violation rate (should be ~0).
- Retry/timeout rates by endpoint and client cohort.
Rollback plan
- Keep dual-write / dual-verify windows where appropriate.
- Define an explicit rollback trigger (metrics + thresholds).
- Prefer backward-compatible changes; avoid “flag day” upgrades.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Use canaries and staged rollout; stop early when signals degrade.
Evidence
- Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
- Jepsen (2) — Integration-focused fault testing and correctness thinking.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Open questions
- Where can config silently weaken security properties today?
- Which assumptions do you currently enforce only through convention?
- What boundary is most likely to be bypassed under incident pressure?
- Which properties can be proven locally vs only tested end-to-end?
Checklist
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Telemetry captures correctness signals.
- Rollback plan rehearsed and automated.
- Assumptions listed and reviewed.
- Safety properties stated as invariants.
- Failure modes enumerated with mitigations.
Further reading
- RFC 1122: Requirements for Internet Hosts — A classic example of operational constraints becoming protocol reality.
- Jepsen — Integration-focused fault testing and correctness thinking.
- End-to-End Arguments in System Design — A foundational argument about where to enforce correctness properties.
- Learn TLA+ — Practical entry point for specification and model checking.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.