Secure Distributed Storage: Erasure Coding Under Adversaries

Monthly research note. Theme: Deep Systems Notes.

TL;DR

A focused memo on Secure Distributed Storage: Erasure Coding Under Adversaries: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Operational behavior is part of correctness: rollout, rollback, and evidence.
Interfaces must carry assumptions: time, randomness, identity, and ordering.
Contracts need enforcement: tests, assertions, and monitoring—not documentation.
Treat retries, reordering, and partial failure as default conditions.
Automate guardrails; humans are for judgment, not for consistent enforcement.

Why this matters

Operational behavior is part of correctness (rollouts, rollbacks, drift).
Resilience requires making failure modes explicit and bounded.
Mixed-version operation creates states you didn’t model.
Security becomes optional through configuration drift unless enforced.

Key questions

What are your compositional failure modes (partial deploys, mixed versions)?
How do you prevent 'optional security' from appearing via config drift?
Which proofs are worth maintaining vs replacing with tests and monitoring?
What is the smallest integration test that can falsify your assumptions?
Which assumptions leak across boundaries (time, randomness, identity, ordering)?
Where does 'correctness' become an operational contract (SLOs, budgets, policy)?

Assumptions

Observability is imperfect; you debug from partial evidence.
Upgrades are incremental; compatibility is a security boundary.
Integration happens under time pressure; defaults become de facto policy.
Components are built by different teams with different threat models.

Non-goals

Relying on “tribal knowledge” to connect assumptions across layers.
Assuming proofs automatically survive composition.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

Interface contracts are predicates:

\text{caller obeys } P \Rightarrow \text{callee guarantees } Q.

Treat config as code: version it, review it, and monitor drift.

Make assumptions executable: encode them as assertions, tests, and run-time checks.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Downgrade resistance: negotiation can’t silently weaken security posture.
Authenticity: actions are bound to identity and purpose.
Evidence: critical actions emit verifiable audit events.
Integrity: invalid transitions are rejected (and detectable).

Failure modes

Config drift that weakens security posture over time.
Timeout ambiguity causing double-apply or partial state transitions.
Recovery paths that only work when nothing is broken.
Observability gaps during incidents (missing evidence).

Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

flowchart TD
  spec["Spec"] --> impl["Impl"]
  impl --> proofs["Proofs / Tests"]
  proofs --> ops["Ops"]
  ops --> incidents["Incidents"]
  incidents --> spec

Implementation notes

Treat integration boundaries (FFI, services, queues) as formal interfaces.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

// Integration note: treat FFI/service boundaries as an API with invariants.
// Encode invariants as types where possible, assertions otherwise.

Verification strategy

End-to-end property tests for the smallest meaningful workflow.
Upgrade tests for mixed-version and rollback scenarios.
Contract tests at boundaries with adversarial inputs and skew.
Fault injection at seams (queues, caches, RPC) not only components.
Invariant monitoring tied to incident response playbooks.

Operational notes

Use canaries for protocol and crypto changes; define rollback triggers.
Treat config drift as an incident: detect, alert, and remediate.
Maintain runbooks that reference invariants, not just symptoms.
Make security and correctness properties observable (metrics + alerts).
Store evidence: audit logs, config diffs, and deployment metadata.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Rollback events and the conditions that triggered them.
Error budget burn + tail latency under load.
Retry/timeout rates by endpoint and client cohort.
Admission-control / rate-limit rejections (by reason).
Authz failures and policy denials (unexpected spikes).

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Define an explicit rollback trigger (metrics + thresholds).
Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Learn TLA+ (2) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

Which assumptions do you currently enforce only through convention?
Where can config silently weaken security properties today?
Which properties can be proven locally vs only tested end-to-end?
What boundary is most likely to be bypassed under incident pressure?

Checklist

Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Telemetry captures correctness signals.
Safety properties stated as invariants.
Failure modes enumerated with mitigations.
Assumptions listed and reviewed.
Rollback plan rehearsed and automated.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading