Monthly research note. Theme: Deep Systems Notes.
TL;DR
Verifiable Computation as Infrastructure: Proof Systems at Scale as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Interfaces must carry assumptions: time, randomness, identity, and ordering.
- Operational behavior is part of correctness: rollout, rollback, and evidence.
- Integration boundaries are where proofs evaporate; treat them as first-class.
- Define safety properties before performance goals.
- Write assumptions down; treat them as interfaces.
Why this matters
- Most real failures happen at integration boundaries, not inside components.
- Operational behavior is part of correctness (rollouts, rollbacks, drift).
- Mixed-version operation creates states you didn’t model.
- Resilience requires making failure modes explicit and bounded.
Key questions
- How do you keep ‘security properties’ visible to operators and SREs?
- Where does 'correctness' become an operational contract (SLOs, budgets, policy)?
- What are your compositional failure modes (partial deploys, mixed versions)?
- What is the smallest integration test that can falsify your assumptions?
- Which proofs are worth maintaining vs replacing with tests and monitoring?
- How do you prevent 'optional security' from appearing via config drift?
Assumptions
- Components are built by different teams with different threat models.
- Upgrades are incremental; compatibility is a security boundary.
- Integration happens under time pressure; defaults become de facto policy.
- Observability is imperfect; you debug from partial evidence.
Non-goals
- Allowing config to silently weaken security properties.
- Relying on “tribal knowledge” to connect assumptions across layers.
Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.
Model & invariants
Composability is the promise that proofs survive integration:
Make assumptions executable: encode them as assertions, tests, and run-time checks.
Choose what to prove and what to monitor. Both are necessary in practice.
Monotonicity beats timestamps: counters and epochs survive clock skew.
Security properties
- Integrity: invalid transitions are rejected (and detectable).
- Authenticity: actions are bound to identity and purpose.
- Evidence: critical actions emit verifiable audit events.
- Downgrade resistance: negotiation can’t silently weaken security posture.
Failure modes
- Mixed-version behavior that violates assumptions silently.
- Timeout ambiguity causing double-apply or partial state transitions.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Config drift that weakens security posture over time.
A recovery plan that isn’t exercised will fail when you need it.
Design sketch
flowchart LR
boundary["Boundary"] --> contract["Contract (P -> Q)"]
contract --> test["Tests"]
test --> monitor["Monitoring"]
monitor --> incident["Incident"]
incident --> contractImplementation notes
If it’s not enforced, it’s not a contract.
Acknowledge only after durability (or make “ack” explicitly best-effort).
Boundary contract template:
Preconditions (P):
- input validation, size limits, auth context
- monotonic versions / idempotency keys
Postconditions (Q):
- durable state transitions
- evidence emitted (audit/metrics)
Failure modes:
- explicit, typed, and observableVerification strategy
- End-to-end property tests for the smallest meaningful workflow.
- Upgrade tests for mixed-version and rollback scenarios.
- Fault injection at seams (queues, caches, RPC) not only components.
- Invariant monitoring tied to incident response playbooks.
- Contract tests at boundaries with adversarial inputs and skew.
Operational notes
- Treat config drift as an incident: detect, alert, and remediate.
- Maintain runbooks that reference invariants, not just symptoms.
- Store evidence: audit logs, config diffs, and deployment metadata.
- Make security and correctness properties observable (metrics + alerts).
- Use canaries for protocol and crypto changes; define rollback triggers.
Keep audit and config history queryable during incidents—evidence beats intuition.
What to monitor
- Error budget burn + tail latency under load.
- Rollback events and the conditions that triggered them.
- Admission-control / rate-limit rejections (by reason).
- Retry/timeout rates by endpoint and client cohort.
- Authz failures and policy denials (unexpected spikes).
Rollback plan
- Use canaries and staged rollout; stop early when signals degrade.
- Define an explicit rollback trigger (metrics + thresholds).
- Keep dual-write / dual-verify windows where appropriate.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
Evidence
- Learn TLA+ (1) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
- Site Reliability Engineering (Google) (2) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Open questions
- What boundary is most likely to be bypassed under incident pressure?
- Which assumptions do you currently enforce only through convention?
- Which properties can be proven locally vs only tested end-to-end?
- Where can config silently weaken security properties today?
Checklist
- Safety properties stated as invariants.
- Assumptions listed and reviewed.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Rollback plan rehearsed and automated.
- Telemetry captures correctness signals.
- Failure modes enumerated with mitigations.
Further reading
- End-to-End Arguments in System Design — A foundational argument about where to enforce correctness properties.
- Jepsen — Integration-focused fault testing and correctness thinking.
- RFC 1122: Requirements for Internet Hosts — A classic example of operational constraints becoming protocol reality.
- Learn TLA+ — Practical entry point for specification and model checking.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.