Monthly research note. Theme: Correctness & Foundations.
TL;DR
A focused memo on Time Is a Lie: Clocks, Causality, and Ordering: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.
If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.
Key takeaways
- Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
- Crash points are part of the design; specify recovery after each state mutation.
- Ack semantics must be explicit: durable, best-effort, or ambiguous.
- Write assumptions down; treat them as interfaces.
- Prefer protocols and APIs that make invalid states hard to express.
Why this matters
- Undefined behavior is an attack surface when inputs are adversarial.
- Correctness is a property you enforce at boundaries: parsing, persistence, concurrency, RPC.
- Performance work that changes semantics is a correctness regression with a nicer latency chart.
- Your on-call runbook is part of the specification—make it match the code.
Key questions
- Where does concurrency create “double spend” style failures in your domain?
- Which transitions are allowed, and which are impossible by construction?
- What does a client learn after a timeout: success, failure, or ambiguity?
- What is your ordering model: FIFO per key, per partition, or none at all?
- Which invariants must hold across crashes, restarts, and partial deployments?
- What exactly is the state, and what is derived or cached?
Assumptions
- Concurrency is adversarial: races appear only in production schedules.
- Observability is incomplete: you will debug from partial evidence.
- Deployments are mixed-version for longer than you think.
- Partial failure is normal: one replica slow, one unavailable, one returning stale data.
Non-goals
- Relying on “best effort” client behavior for safety properties.
- Treating retries as a transport detail rather than a semantic constraint.
Parsing is an attacker-controlled interface—validate early and fail fast.
Model & invariants
A common pattern is splitting state into durable vs derived:
If you can’t define what a timeout means, you can’t implement retries safely. Make ambiguity explicit in the API.
Crash points matter: define what happens if the process stops after each line that mutates state or acknowledges work.
If the system can enter an invalid state, it eventually will—usually during an incident.
Security properties
- Authenticity: actions are bound to identity and purpose.
- Least authority: privileges are scoped by purpose and time.
- Evidence: critical actions emit verifiable audit events.
- Integrity: invalid transitions are rejected (and detectable).
Failure modes
- Recovery paths that only work when nothing is broken.
- Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
- Timeout ambiguity causing double-apply or partial state transitions.
- Mixed-version behavior that violates assumptions silently.
Sampling hides the rare schedule that breaks your invariants.
Design sketch
stateDiagram-v2
[*] --> Init
Init --> Ready: bootstrap()
Ready --> Processing: event(e)
Processing --> Ready: commit()
Processing --> Error: violate(Inv)
Error --> Ready: recover()Implementation notes
Treat every boundary (RPC, DB, queue, cache) as a semantic interface with explicit contracts.
Acknowledge only after durability (or make “ack” explicitly best-effort).
use core::fmt;
#[derive(Clone, Debug)]
pub enum Event {
Input(Vec<u8>),
Tick,
Fault(&'static str),
}
pub trait StateMachine {
type State: Clone + fmt::Debug;
type Error: fmt::Debug;
fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
fn invariant(state: &Self::State) -> bool;
}
// Time Is a Lie: Clocks, Causality, and Ordering: invariants are part of the API contract.Verification strategy
- Crash/restart tests: persist mid-transition and validate recovery correctness.
- Metamorphic tests: same operation applied twice must not change the result.
- Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
- Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
- Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).
Operational notes
- Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
- Instrument ambiguity: measure “unknown outcome” responses separately from failures.
- Design “degraded modes” explicitly (fail closed vs fail open per operation).
- Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
- Track invariant violations as pages, not dashboards.
Keep audit and config history queryable during incidents—evidence beats intuition.
What to monitor
- Rollback events and the conditions that triggered them.
- Invariant violation rate (should be ~0).
- Retry/timeout rates by endpoint and client cohort.
- Error budget burn + tail latency under load.
- Admission-control / rate-limit rejections (by reason).
Rollback plan
- Use canaries and staged rollout; stop early when signals degrade.
- Define an explicit rollback trigger (metrics + thresholds).
- Keep dual-write / dual-verify windows where appropriate.
- Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
- Prefer backward-compatible changes; avoid “flag day” upgrades.
Evidence
- Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
- Jepsen (2) — Failure testing focused on correctness under partitions and reordering.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Open questions
- Which correctness properties can be enforced at compile time (types/capabilities)?
- Which invariant, if violated, would silently corrupt state for weeks?
- Where does your API currently allow ambiguous outcomes, and how will clients cope?
- Which operations need monotonic versioning vs idempotency keys vs both?
Checklist
- Telemetry captures correctness signals.
- Rollback plan rehearsed and automated.
- Failure modes enumerated with mitigations.
- Assumptions listed and reviewed.
- Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
- Safety properties stated as invariants.
Further reading
- Learn TLA+ — A pragmatic workflow for invariants and model checking.
- RFC 9110: HTTP Semantics — Defines method semantics including idempotency and safety—useful for API contracts.
- Jepsen — Failure testing focused on correctness under partitions and reordering.
- Paxos Made Simple (Lamport) — A clean reference for agreement and invariants.
- Site Reliability Engineering (Google) — Error budgets, incident response, and reliability as an engineering discipline.
- Designing Data-Intensive Applications (Kleppmann) — The systems-engineering baseline for correctness, replication, and failure.