Gossip & Epidemic Dissemination: Fast, Probabilistic, and Weird

Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

Gossip & Epidemic Dissemination: Fast, Probabilistic, and Weird as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Backpressure and admission control are correctness mechanisms under load.
Treat membership changes and compaction as protocol events—not operational details.
Write the safety property first; liveness is always conditional on timing assumptions.
Bind security decisions to evidence (audit, invariants, telemetry).
Define safety properties before performance goals.

Why this matters

Tail latency is a protocol input: it changes who retries and when.
Safety failures are permanent; liveness failures are (sometimes) recoverable.
Global systems fail in correlated ways (regions, dependencies, routing).
Observability must explain protocol state, not just latency.

Key questions

Where do you pay for liveness (timeouts, leader election, reconfiguration)?
Which safety property is non-negotiable (no double-commit, no forks, no split brain)?
What is the compaction story (snapshots, log truncation, state transfer)?
How do clients discover leaders safely (and what happens during flaps)?
What is the failure model (crash, byzantine, partitions, reordering)?
What is your reconfiguration model (joint consensus, epochs, leases)?

Assumptions

Clients retry and amplify load right when the system is weakest.
Packets can be duplicated and reordered; acks can be lost.
Nodes restart with partial state unless you prove durability.
Delays are unbounded during incidents; timeouts are guesses.

Non-goals

Assuming the network eventually behaves “nicely” under load.
Pretending backpressure is an implementation detail.

Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

A common safety shape for replicated logs:

\forall i:\ \text{Committed}(i)\Rightarrow \forall r:\ \text{Log}_r[i] = \text{Log}_\text{leader}[i].

Treat membership changes as protocol events, not control-plane side effects.

Liveness is always conditional: specify when progress is expected and what you do otherwise.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

Evidence: critical actions emit verifiable audit events.
Replay resistance: duplicated inputs do not change outcomes.
Authenticity: actions are bound to identity and purpose.
Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

Config drift that weakens security posture over time.
Timeout ambiguity causing double-apply or partial state transitions.
Recovery paths that only work when nothing is broken.
Mixed-version behavior that violates assumptions silently.

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

stateDiagram-v2
  [*] --> Follower
  Follower --> Candidate: timeout
  Candidate --> Leader: win quorum
  Candidate --> Follower: lose
  Leader --> Follower: stepdown

Implementation notes

Protocols fail at the boundaries: timeouts, membership, compaction, and overload.

Rule of thumb

Bound work per request: parse, validate, and cap cost before you allocate heavy resources.

Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_control

Verification strategy

Stress + skew tests: hot keys, slow disks, noisy neighbors.
Upgrade tests: mixed versions and rolling deploy invariants.
Jepsen-style fault injection: partitions + reordering + client retries.
Deterministic replay of network traces to reproduce rare failures.
Linearizability checks for read/write APIs that claim it.

Operational notes

Rehearse region failover and reconfiguration under load.
Expose protocol state: term/epoch, leader, commit index, config version.
Make client behavior part of the system: document retry semantics.
Treat compaction and snapshot install as first-class SLOs.
Rate-limit retries and apply admission control before saturation.

Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

Authz failures and policy denials (unexpected spikes).
Admission-control / rate-limit rejections (by reason).
Rollback events and the conditions that triggered them.
Retry/timeout rates by endpoint and client cohort.
Error budget burn + tail latency under load.

Rollback plan

Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.
Define an explicit rollback trigger (metrics + thresholds).
Use canaries and staged rollout; stop early when signals degrade.

Evidence

Jepsen (1) — Testing correctness under partitions and faults.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Time, Clocks, and the Ordering of Events (Lamport) (2) — Causality, ordering, and why clocks are tricky.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.

Open questions

Which invariants are violated first under overload: latency, availability, or correctness?
What is the worst-case recovery time after a leader + disk failure?
Where does your protocol assume synchrony without admitting it?
How do you prevent “operator fixes” from changing safety properties?

Checklist

Failure modes enumerated with mitigations.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Assumptions listed and reviewed.
Rollback plan rehearsed and automated.
Telemetry captures correctness signals.
Safety properties stated as invariants.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading