Time Is a Lie: Clocks, Causality, and Ordering

Monthly research note. Theme: Correctness & Foundations.

TL;DR

A focused memo on Time Is a Lie: Clocks, Causality, and Ordering: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

If the spec is implicit, the implementation becomes the spec—and you’ll learn it during incidents.

Key takeaways

Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
Crash points are part of the design; specify recovery after each state mutation.
Ack semantics must be explicit: durable, best-effort, or ambiguous.
Write assumptions down; treat them as interfaces.
Prefer protocols and APIs that make invalid states hard to express.

Why this matters

Undefined behavior is an attack surface when inputs are adversarial.
Correctness is a property you enforce at boundaries: parsing, persistence, concurrency, RPC.
Performance work that changes semantics is a correctness regression with a nicer latency chart.
Your on-call runbook is part of the specification—make it match the code.

Key questions

Where does concurrency create “double spend” style failures in your domain?
Which transitions are allowed, and which are impossible by construction?
What does a client learn after a timeout: success, failure, or ambiguity?
What is your ordering model: FIFO per key, per partition, or none at all?
Which invariants must hold across crashes, restarts, and partial deployments?
What exactly is the state, and what is derived or cached?

Assumptions

Concurrency is adversarial: races appear only in production schedules.
Observability is incomplete: you will debug from partial evidence.
Deployments are mixed-version for longer than you think.
Partial failure is normal: one replica slow, one unavailable, one returning stale data.

Non-goals

Relying on “best effort” client behavior for safety properties.
Treating retries as a transport detail rather than a semantic constraint.

Attack surface

Parsing is an attacker-controlled interface—validate early and fail fast.

Model & invariants

A common pattern is splitting state into durable vs derived:

S = S_\text{durable} \times S_\text{derived}\qquad\text{and}\qquad S_\text{derived} = f(S_\text{durable}).

If you can’t define what a timeout means, you can’t implement retries safely. Make ambiguity explicit in the API.

Crash points matter: define what happens if the process stops after each line that mutates state or acknowledges work.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

Authenticity: actions are bound to identity and purpose.
Least authority: privileges are scoped by purpose and time.
Evidence: critical actions emit verifiable audit events.
Integrity: invalid transitions are rejected (and detectable).

Failure modes

Recovery paths that only work when nothing is broken.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Timeout ambiguity causing double-apply or partial state transitions.
Mixed-version behavior that violates assumptions silently.

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

stateDiagram-v2
  [*] --> Init
  Init --> Ready: bootstrap()
  Ready --> Processing: event(e)
  Processing --> Ready: commit()
  Processing --> Error: violate(Inv)
  Error --> Ready: recover()

Implementation notes

Treat every boundary (RPC, DB, queue, cache) as a semantic interface with explicit contracts.

Rule of thumb

Acknowledge only after durability (or make “ack” explicitly best-effort).

use core::fmt;

#[derive(Clone, Debug)]
pub enum Event {
    Input(Vec<u8>),
    Tick,
    Fault(&'static str),
}

pub trait StateMachine {
    type State: Clone + fmt::Debug;
    type Error: fmt::Debug;

    fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
    fn invariant(state: &Self::State) -> bool;
}

// Time Is a Lie: Clocks, Causality, and Ordering: invariants are part of the API contract.

Verification strategy

Crash/restart tests: persist mid-transition and validate recovery correctness.
Metamorphic tests: same operation applied twice must not change the result.
Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
Invariant monitoring in prod: encode safety properties as metrics (rate of impossible states).

Operational notes

Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
Instrument ambiguity: measure “unknown outcome” responses separately from failures.
Design “degraded modes” explicitly (fail closed vs fail open per operation).
Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
Track invariant violations as pages, not dashboards.

Operational note

Keep audit and config history queryable during incidents—evidence beats intuition.

What to monitor

Rollback events and the conditions that triggered them.
Invariant violation rate (should be ~0).
Retry/timeout rates by endpoint and client cohort.
Error budget burn + tail latency under load.
Admission-control / rate-limit rejections (by reason).

Rollback plan

Use canaries and staged rollout; stop early when signals degrade.
Define an explicit rollback trigger (metrics + thresholds).
Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.

Evidence

Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Jepsen (2) — Failure testing focused on correctness under partitions and reordering.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

Which correctness properties can be enforced at compile time (types/capabilities)?
Which invariant, if violated, would silently corrupt state for weeks?
Where does your API currently allow ambiguous outcomes, and how will clients cope?
Which operations need monotonic versioning vs idempotency keys vs both?

Checklist

Telemetry captures correctness signals.
Rollback plan rehearsed and automated.
Failure modes enumerated with mitigations.
Assumptions listed and reviewed.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Safety properties stated as invariants.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading