State Machine Replication: Log Design, Snapshots, and Compaction

Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

A focused memo on State Machine Replication: Log Design, Snapshots, and Compaction: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

Expose protocol state (epoch/term/commit index) as first-class telemetry.
Treat membership changes and compaction as protocol events—not operational details.
Backpressure and admission control are correctness mechanisms under load.
Treat retries, reordering, and partial failure as default conditions.
Bind security decisions to evidence (audit, invariants, telemetry).

Why this matters

Safety failures are permanent; liveness failures are (sometimes) recoverable.
Global systems fail in correlated ways (regions, dependencies, routing).
If your protocol isn’t testable under reordering, it isn’t deployable.
Observability must explain protocol state, not just latency.

Key questions

What does “read” mean under replication lag?
What is the unit of ordering (per key, per partition, global)?
What is your reconfiguration model (joint consensus, epochs, leases)?
How do clients discover leaders safely (and what happens during flaps)?
Where do you pay for liveness (timeouts, leader election, reconfiguration)?
How do you prevent overload from becoming inconsistency?

Assumptions

Workload is skewed: hot keys exist and dominate.
Nodes restart with partial state unless you prove durability.
Clients retry and amplify load right when the system is weakest.
Packets can be duplicated and reordered; acks can be lost.

Non-goals

Pretending backpressure is an implementation detail.
Assuming the network eventually behaves “nicely” under load.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

A common safety shape for replicated logs:

\forall i:\ \text{Committed}(i)\Rightarrow \forall r:\ \text{Log}_r[i] = \text{Log}_\text{leader}[i].

Liveness is always conditional: specify when progress is expected and what you do otherwise.

Treat membership changes as protocol events, not control-plane side effects.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Downgrade resistance: negotiation can’t silently weaken security posture.
Evidence: critical actions emit verifiable audit events.
Authenticity: actions are bound to identity and purpose.
Replay resistance: duplicated inputs do not change outcomes.

Failure modes

Mixed-version behavior that violates assumptions silently.
Observability gaps during incidents (missing evidence).
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Timeout ambiguity causing double-apply or partial state transitions.

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

flowchart TD
  client["Client"] --> leader["Leader"]
  leader --> log["Replicated Log"]
  log --> snap["Snapshot"]
  snap --> recover["Recovery / Catch-up"]
  leader --> reconfig["Reconfiguration"]

Implementation notes

Your protocol is an interface between failures and invariants. Encode both.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_control

Verification strategy

Jepsen-style fault injection: partitions + reordering + client retries.
Upgrade tests: mixed versions and rolling deploy invariants.
Deterministic replay of network traces to reproduce rare failures.
Model checking the smallest core (timeouts, election, reconfiguration).
Stress + skew tests: hot keys, slow disks, noisy neighbors.

Operational notes

Expose protocol state: term/epoch, leader, commit index, config version.
Make client behavior part of the system: document retry semantics.
Treat compaction and snapshot install as first-class SLOs.
Rehearse region failover and reconfiguration under load.
Rate-limit retries and apply admission control before saturation.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Rollback events and the conditions that triggered them.
Error budget burn + tail latency under load.
Authz failures and policy denials (unexpected spikes).
Admission-control / rate-limit rejections (by reason).
Invariant violation rate (should be ~0).

Rollback plan

Keep dual-write / dual-verify windows where appropriate.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

Time, Clocks, and the Ordering of Events (Lamport) (1) — Causality, ordering, and why clocks are tricky.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.
Jepsen (2) — Testing correctness under partitions and faults.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

Which invariants are violated first under overload: latency, availability, or correctness?
What is the worst-case recovery time after a leader + disk failure?
How do you prevent “operator fixes” from changing safety properties?
Where does your protocol assume synchrony without admitting it?

Checklist

Telemetry captures correctness signals.
Rollback plan rehearsed and automated.
Assumptions listed and reviewed.
Safety properties stated as invariants.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Failure modes enumerated with mitigations.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading