Queues & Streams: Exactly-Once Semantics Without Lying to Yourself

Monthly research note. Theme: Distributed Systems Under Failure.

TL;DR

A focused memo on Queues & Streams: Exactly-Once Semantics Without Lying to Yourself: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Backpressure and admission control are correctness mechanisms under load.
Write the safety property first; liveness is always conditional on timing assumptions.
Mixed-version operation is the default; upgrades must preserve invariants.
Write assumptions down; treat them as interfaces.
Define safety properties before performance goals.

Why this matters

Operational simplicity is a security property: fewer modes, fewer surprises.
Observability must explain protocol state, not just latency.
State compaction and snapshots are where correctness goes to die quietly.
If your protocol isn’t testable under reordering, it isn’t deployable.

Key questions

What does “read” mean under replication lag?
What is the unit of ordering (per key, per partition, global)?
Which safety property is non-negotiable (no double-commit, no forks, no split brain)?
Where do you pay for liveness (timeouts, leader election, reconfiguration)?
What is your reconfiguration model (joint consensus, epochs, leases)?
How do you prevent overload from becoming inconsistency?

Assumptions

Delays are unbounded during incidents; timeouts are guesses.
Reconfigurations happen mid-incident (the worst time).
Clocks drift; leases can be violated under GC pauses or VM stalls.
Partitions happen at multiple layers (network, DNS, LB, service mesh).

Non-goals

Pretending backpressure is an implementation detail.
Assuming the network eventually behaves “nicely” under load.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

For quorum-based protocols, the intersection property is the backbone of safety:

\text{Crash-fault: } |Q| > \frac{n}{2}\qquad\qquad \text{Byzantine: } n \ge 3f+1,\ |Q| \ge 2f+1.

Make overload explicit: admission control is a protocol boundary.

Treat membership changes as protocol events, not control-plane side effects.

Invariant

Invariants must be checkable from evidence you actually have (state + logs + counters).

Security properties

Integrity: invalid transitions are rejected (and detectable).
Evidence: critical actions emit verifiable audit events.
Replay resistance: duplicated inputs do not change outcomes.
Authenticity: actions are bound to identity and purpose.

Failure modes

Recovery paths that only work when nothing is broken.
Timeout ambiguity causing double-apply or partial state transitions.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Config drift that weakens security posture over time.

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

stateDiagram-v2
  [*] --> Follower
  Follower --> Candidate: timeout
  Candidate --> Leader: win quorum
  Candidate --> Follower: lose
  Leader --> Follower: stepdown

Implementation notes

Make the state machine explicit; then make persistence and networking boring.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

type LogIndex = u64;

#[derive(Clone, Debug)]
struct Entry {
  index: LogIndex,
  term: u64,
  bytes: Vec<u8>,
}

// Persist(term, vote, log) before acknowledging anything.

Verification strategy

Stress + skew tests: hot keys, slow disks, noisy neighbors.
Model checking the smallest core (timeouts, election, reconfiguration).
Upgrade tests: mixed versions and rolling deploy invariants.
Jepsen-style fault injection: partitions + reordering + client retries.
Deterministic replay of network traces to reproduce rare failures.

Operational notes

Make client behavior part of the system: document retry semantics.
Prefer monotonic time sources for leases; alert on clock discontinuities.
Rate-limit retries and apply admission control before saturation.
Expose protocol state: term/epoch, leader, commit index, config version.
Rehearse region failover and reconfiguration under load.

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Admission-control / rate-limit rejections (by reason).
Error budget burn + tail latency under load.
Rollback events and the conditions that triggered them.
Retry/timeout rates by endpoint and client cohort.
Authz failures and policy denials (unexpected spikes).

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Use canaries and staged rollout; stop early when signals degrade.
Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Time, Clocks, and the Ordering of Events (Lamport) (2) — Causality, ordering, and why clocks are tricky.
- Evidence: Use this as the baseline for happens-before vs wall-clock; avoid embedding clock assumptions into safety properties.

Open questions

How do you prevent “operator fixes” from changing safety properties?
What is the worst-case recovery time after a leader + disk failure?
Where does your protocol assume synchrony without admitting it?
Which invariants are violated first under overload: latency, availability, or correctness?

Checklist

Safety properties stated as invariants.
Assumptions listed and reviewed.
Rollback plan rehearsed and automated.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Failure modes enumerated with mitigations.
Telemetry captures correctness signals.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading