Transactions: 2PC, 3PC, and Coordinators You Can't Trust

Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

A focused memo on Transactions: 2PC, 3PC, and Coordinators You Can't Trust: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Mixed-version operation is the default; upgrades must preserve invariants.
Backpressure and admission control are correctness mechanisms under load.
Write the safety property first; liveness is always conditional on timing assumptions.
Automate guardrails; humans are for judgment, not for consistent enforcement.
Define safety properties before performance goals.

Why this matters

Operational simplicity is a security property: fewer modes, fewer surprises.
If your protocol isn’t testable under reordering, it isn’t deployable.
State compaction and snapshots are where correctness goes to die quietly.
Observability must explain protocol state, not just latency.

Key questions

What is the failure model (crash, byzantine, partitions, reordering)?
What does “read” mean under replication lag?
Which safety property is non-negotiable (no double-commit, no forks, no split brain)?
How do clients discover leaders safely (and what happens during flaps)?
Which components require determinism for reproducibility?
What is your reconfiguration model (joint consensus, epochs, leases)?

Assumptions

Delays are unbounded during incidents; timeouts are guesses.
Nodes restart with partial state unless you prove durability.
Workload is skewed: hot keys exist and dominate.
Packets can be duplicated and reordered; acks can be lost.

Non-goals

Pretending backpressure is an implementation detail.
Assuming the network eventually behaves “nicely” under load.

Attack surface

Observability pipelines can be attacked (cardinality explosions, log injection). Protect them.

Model & invariants

A common safety shape for replicated logs:

\forall i:\ \text{Committed}(i)\Rightarrow \forall r:\ \text{Log}_r[i] = \text{Log}_\text{leader}[i].

Write down the safety property first. If it’s not written, it’s not implemented.

Liveness is always conditional: specify when progress is expected and what you do otherwise.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Replay resistance: duplicated inputs do not change outcomes.
Authenticity: actions are bound to identity and purpose.
Evidence: critical actions emit verifiable audit events.
Least authority: privileges are scoped by purpose and time.

Failure modes

Mixed-version behavior that violates assumptions silently.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Observability gaps during incidents (missing evidence).
Recovery paths that only work when nothing is broken.

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

flowchart TD
  client["Client"] --> leader["Leader"]
  leader --> log["Replicated Log"]
  log --> snap["Snapshot"]
  snap --> recover["Recovery / Catch-up"]
  leader --> reconfig["Reconfiguration"]

Implementation notes

Make the state machine explicit; then make persistence and networking boring.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_control

Verification strategy

Model checking the smallest core (timeouts, election, reconfiguration).
Jepsen-style fault injection: partitions + reordering + client retries.
Upgrade tests: mixed versions and rolling deploy invariants.
Stress + skew tests: hot keys, slow disks, noisy neighbors.
Linearizability checks for read/write APIs that claim it.

Operational notes

Expose protocol state: term/epoch, leader, commit index, config version.
Treat compaction and snapshot install as first-class SLOs.
Rehearse region failover and reconfiguration under load.
Prefer monotonic time sources for leases; alert on clock discontinuities.
Make client behavior part of the system: document retry semantics.

Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

Invariant violation rate (should be ~0).
Retry/timeout rates by endpoint and client cohort.
Rollback events and the conditions that triggered them.
Admission-control / rate-limit rejections (by reason).
Error budget burn + tail latency under load.

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Use canaries and staged rollout; stop early when signals degrade.
Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

Site Reliability Engineering (Google) (1) — Error budgets, incident response, and reliability as an engineering discipline.
- Evidence: Error budgets and incident response are correctness controls; tie monitoring and rollback triggers to SLO burn.
In Search of an Understandable Consensus Algorithm (Raft) (2) — Consensus with explicit state machines and practical tradeoffs.
- Evidence: Track term/commitIndex as explicit evidence; test leader changes and log conflicts as part of rollback behavior.

Open questions

Which invariants are violated first under overload: latency, availability, or correctness?
Where does your protocol assume synchrony without admitting it?
How do you prevent “operator fixes” from changing safety properties?
What is the worst-case recovery time after a leader + disk failure?

Checklist

Telemetry captures correctness signals.
Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.
Safety properties stated as invariants.
Assumptions listed and reviewed.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading