Monthly research note. Theme: Deep Systems Notes.

TL;DR

A focused memo on Secure Distributed Storage: Erasure Coding Under Adversaries: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

  • Operational behavior is part of correctness: rollout, rollback, and evidence.
  • Interfaces must carry assumptions: time, randomness, identity, and ordering.
  • Contracts need enforcement: tests, assertions, and monitoring—not documentation.
  • Treat retries, reordering, and partial failure as default conditions.
  • Automate guardrails; humans are for judgment, not for consistent enforcement.

Why this matters

  • Operational behavior is part of correctness (rollouts, rollbacks, drift).
  • Resilience requires making failure modes explicit and bounded.
  • Mixed-version operation creates states you didn’t model.
  • Security becomes optional through configuration drift unless enforced.

Key questions

  • What are your compositional failure modes (partial deploys, mixed versions)?
  • How do you prevent 'optional security' from appearing via config drift?
  • Which proofs are worth maintaining vs replacing with tests and monitoring?
  • What is the smallest integration test that can falsify your assumptions?
  • Which assumptions leak across boundaries (time, randomness, identity, ordering)?
  • Where does 'correctness' become an operational contract (SLOs, budgets, policy)?

Assumptions

  • Observability is imperfect; you debug from partial evidence.
  • Upgrades are incremental; compatibility is a security boundary.
  • Integration happens under time pressure; defaults become de facto policy.
  • Components are built by different teams with different threat models.

Non-goals

  • Relying on “tribal knowledge” to connect assumptions across layers.
  • Assuming proofs automatically survive composition.
Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

Interface contracts are predicates:

caller obeys Pcallee guarantees Q.\text{caller obeys } P \Rightarrow \text{callee guarantees } Q.

Treat config as code: version it, review it, and monitor drift.

Make assumptions executable: encode them as assertions, tests, and run-time checks.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

  • Downgrade resistance: negotiation can’t silently weaken security posture.
  • Authenticity: actions are bound to identity and purpose.
  • Evidence: critical actions emit verifiable audit events.
  • Integrity: invalid transitions are rejected (and detectable).

Failure modes

  • Config drift that weakens security posture over time.
  • Timeout ambiguity causing double-apply or partial state transitions.
  • Recovery paths that only work when nothing is broken.
  • Observability gaps during incidents (missing evidence).
Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart TD
  spec["Spec"] --> impl["Impl"]
  impl --> proofs["Proofs / Tests"]
  proofs --> ops["Ops"]
  ops --> incidents["Incidents"]
  incidents --> spec

Implementation notes

Treat integration boundaries (FFI, services, queues) as formal interfaces.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

// Integration note: treat FFI/service boundaries as an API with invariants.
// Encode invariants as types where possible, assertions otherwise.

Verification strategy

  • End-to-end property tests for the smallest meaningful workflow.
  • Upgrade tests for mixed-version and rollback scenarios.
  • Contract tests at boundaries with adversarial inputs and skew.
  • Fault injection at seams (queues, caches, RPC) not only components.
  • Invariant monitoring tied to incident response playbooks.

Operational notes

  • Use canaries for protocol and crypto changes; define rollback triggers.
  • Treat config drift as an incident: detect, alert, and remediate.
  • Maintain runbooks that reference invariants, not just symptoms.
  • Make security and correctness properties observable (metrics + alerts).
  • Store evidence: audit logs, config diffs, and deployment metadata.
Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

  • Rollback events and the conditions that triggered them.
  • Error budget burn + tail latency under load.
  • Retry/timeout rates by endpoint and client cohort.
  • Admission-control / rate-limit rejections (by reason).
  • Authz failures and policy denials (unexpected spikes).

Rollback plan

  • Use canaries and staged rollout; stop early when signals degrade.
  • Prefer backward-compatible changes; avoid “flag day” upgrades.
  • Define an explicit rollback trigger (metrics + thresholds).
  • Keep dual-write / dual-verify windows where appropriate.
  • Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

  • Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
    • Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
  • Learn TLA+ (2) — Practical entry point for specification and model checking.
    • Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

  • Which assumptions do you currently enforce only through convention?
  • Where can config silently weaken security properties today?
  • Which properties can be proven locally vs only tested end-to-end?
  • What boundary is most likely to be bypassed under incident pressure?

Checklist

  • Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
  • Telemetry captures correctness signals.
  • Safety properties stated as invariants.
  • Failure modes enumerated with mitigations.
  • Assumptions listed and reviewed.
  • Rollback plan rehearsed and automated.

Further reading

1.
Kleppmann M. Designing Data-Intensive Applications [Internet]. O’Reilly Media; 2017. Available from: https://dataintensive.net/
2.
LearnTLA. Learn TLA+ [Internet]. Web; Available from: https://learntla.com/