Geo-Replication: Latency Budgets and Cross-Region Failure Modes

Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

Geo-Replication: Latency Budgets and Cross-Region Failure Modes as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Write the safety property first; liveness is always conditional on timing assumptions.
Expose protocol state (epoch/term/commit index) as first-class telemetry.
Backpressure and admission control are correctness mechanisms under load.
Make failure modes explicit and observable.
Treat retries, reordering, and partial failure as default conditions.

Why this matters

Tail latency is a protocol input: it changes who retries and when.
Mixed-version operation is the default state of real deployments.
Operational simplicity is a security property: fewer modes, fewer surprises.
State compaction and snapshots are where correctness goes to die quietly.

Key questions

Which safety property is non-negotiable (no double-commit, no forks, no split brain)?
What is the compaction story (snapshots, log truncation, state transfer)?
What is the failure model (crash, byzantine, partitions, reordering)?
Which components require determinism for reproducibility?
What is the unit of ordering (per key, per partition, global)?
How do you prevent overload from becoming inconsistency?

Assumptions

Partitions happen at multiple layers (network, DNS, LB, service mesh).
Clients retry and amplify load right when the system is weakest.
Nodes restart with partial state unless you prove durability.
Reconfigurations happen mid-incident (the worst time).

Non-goals

Pretending backpressure is an implementation detail.
Relying on global time for ordering without strong synchronization assumptions.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

For quorum-based protocols, the intersection property is the backbone of safety:

\text{Crash-fault: } |Q| > \frac{n}{2}\qquad\qquad \text{Byzantine: } n \ge 3f+1,\ |Q| \ge 2f+1.

Treat membership changes as protocol events, not control-plane side effects.

Write down the safety property first. If it’s not written, it’s not implemented.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Integrity: invalid transitions are rejected (and detectable).
Evidence: critical actions emit verifiable audit events.
Downgrade resistance: negotiation can’t silently weaken security posture.
Replay resistance: duplicated inputs do not change outcomes.

Failure modes

Config drift that weakens security posture over time.
Recovery paths that only work when nothing is broken.
Observability gaps during incidents (missing evidence).
Mixed-version behavior that violates assumptions silently.

Pitfall

Mixed-version deployments create states you never tested—plan for them explicitly.

Design sketch

sequenceDiagram
  participant C as Client
  participant L as Leader
  participant F1 as Follower 1
  participant F2 as Follower 2
  C->>L: propose(cmd)
  L->>F1: appendEntries
  L->>F2: appendEntries
  F1-->>L: ack
  F2-->>L: ack
  L-->>C: commit(result)

Implementation notes

Protocols fail at the boundaries: timeouts, membership, compaction, and overload.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Operational invariants to monitor:
- leader_changes_per_minute
- commit_index_monotonic
- snapshot_install_failures
- quorum_acks_latency_p99
- rejected_requests_due_to_admission_control

Verification strategy

Jepsen-style fault injection: partitions + reordering + client retries.
Model checking the smallest core (timeouts, election, reconfiguration).
Stress + skew tests: hot keys, slow disks, noisy neighbors.
Upgrade tests: mixed versions and rolling deploy invariants.
Linearizability checks for read/write APIs that claim it.

Operational notes

Treat compaction and snapshot install as first-class SLOs.
Rehearse region failover and reconfiguration under load.
Expose protocol state: term/epoch, leader, commit index, config version.
Make client behavior part of the system: document retry semantics.
Rate-limit retries and apply admission control before saturation.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Invariant violation rate (should be ~0).
Retry/timeout rates by endpoint and client cohort.
Admission-control / rate-limit rejections (by reason).
Rollback events and the conditions that triggered them.
Error budget burn + tail latency under load.

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Keep dual-write / dual-verify windows where appropriate.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Use canaries and staged rollout; stop early when signals degrade.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.

Evidence

Learn TLA+ (1) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.
Time, Clocks, and the Ordering of Events (Lamport) (2) — Causality, ordering, and why clocks are tricky.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.

Open questions

How do you prevent “operator fixes” from changing safety properties?
Where does your protocol assume synchrony without admitting it?
What is the worst-case recovery time after a leader + disk failure?
Which invariants are violated first under overload: latency, availability, or correctness?

Checklist

Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Rollback plan rehearsed and automated.
Assumptions listed and reviewed.
Safety properties stated as invariants.
Failure modes enumerated with mitigations.
Telemetry captures correctness signals.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading