Consistency Models: Linearizability, Serializability, and What You Actually Need

Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

A focused memo on Consistency Models: Linearizability, Serializability, and What You Actually Need: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Mixed-version operation is the default; upgrades must preserve invariants.
Treat membership changes and compaction as protocol events—not operational details.
Backpressure and admission control are correctness mechanisms under load.
Bind security decisions to evidence (audit, invariants, telemetry).
Make boundaries boring: validate inputs, cap costs, and be deterministic where needed.

Why this matters

Observability must explain protocol state, not just latency.
Tail latency is a protocol input: it changes who retries and when.
Backpressure and fairness are part of correctness when resources are finite.
Global systems fail in correlated ways (regions, dependencies, routing).

Key questions

What is the failure model (crash, byzantine, partitions, reordering)?
Which safety property is non-negotiable (no double-commit, no forks, no split brain)?
How do clients discover leaders safely (and what happens during flaps)?
Where do you pay for liveness (timeouts, leader election, reconfiguration)?
What is your reconfiguration model (joint consensus, epochs, leases)?
What is the unit of ordering (per key, per partition, global)?

Assumptions

Nodes restart with partial state unless you prove durability.
Clocks drift; leases can be violated under GC pauses or VM stalls.
Reconfigurations happen mid-incident (the worst time).
Partitions happen at multiple layers (network, DNS, LB, service mesh).

Non-goals

Pretending backpressure is an implementation detail.
Assuming the network eventually behaves “nicely” under load.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

A common safety shape for replicated logs:

\forall i:\ \text{Committed}(i)\Rightarrow \forall r:\ \text{Log}_r[i] = \text{Log}_\text{leader}[i].

Write down the safety property first. If it’s not written, it’s not implemented.

Treat membership changes as protocol events, not control-plane side effects.

Invariant

Make the “impossible state” observable: a metric or alert that fires when invariants drift.

Security properties

Replay resistance: duplicated inputs do not change outcomes.
Authenticity: actions are bound to identity and purpose.
Least authority: privileges are scoped by purpose and time.
Downgrade resistance: negotiation can’t silently weaken security posture.

Failure modes

Observability gaps during incidents (missing evidence).
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Config drift that weakens security posture over time.
Mixed-version behavior that violates assumptions silently.

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

sequenceDiagram
  participant C as Client
  participant L as Leader
  participant F1 as Follower 1
  participant F2 as Follower 2
  C->>L: propose(cmd)
  L->>F1: appendEntries
  L->>F2: appendEntries
  F1-->>L: ack
  F2-->>L: ack
  L-->>C: commit(result)

Implementation notes

Your protocol is an interface between failures and invariants. Encode both.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

type LogIndex = u64;

#[derive(Clone, Debug)]
struct Entry {
  index: LogIndex,
  term: u64,
  bytes: Vec<u8>,
}

// Persist(term, vote, log) before acknowledging anything.

Verification strategy

Model checking the smallest core (timeouts, election, reconfiguration).
Linearizability checks for read/write APIs that claim it.
Stress + skew tests: hot keys, slow disks, noisy neighbors.
Upgrade tests: mixed versions and rolling deploy invariants.
Deterministic replay of network traces to reproduce rare failures.

Operational notes

Prefer monotonic time sources for leases; alert on clock discontinuities.
Treat compaction and snapshot install as first-class SLOs.
Rate-limit retries and apply admission control before saturation.
Rehearse region failover and reconfiguration under load.
Expose protocol state: term/epoch, leader, commit index, config version.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Rollback events and the conditions that triggered them.
Authz failures and policy denials (unexpected spikes).
Error budget burn + tail latency under load.

Rollback plan

Keep dual-write / dual-verify windows where appropriate.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).

Evidence

Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
Learn TLA+ (2) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

Which invariants are violated first under overload: latency, availability, or correctness?
What is the worst-case recovery time after a leader + disk failure?
How do you prevent “operator fixes” from changing safety properties?
Where does your protocol assume synchrony without admitting it?

Checklist

Safety properties stated as invariants.
Assumptions listed and reviewed.
Telemetry captures correctness signals.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Rollback plan rehearsed and automated.
Failure modes enumerated with mitigations.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading