Memory Models and Concurrency: Reasoning About Races

Monthly research note. Theme: Correctness & Foundations.

TL;DR

Memory Models and Concurrency: Reasoning About Races as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Correctness is cheaper to enforce at interfaces than to repair in production data.

Key takeaways

Ack semantics must be explicit: durable, best-effort, or ambiguous.
Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
Make retries semantic: idempotency keys, monotonic versions, and explicit ambiguity.
Write assumptions down; treat them as interfaces.
Treat retries, reordering, and partial failure as default conditions.

Why this matters

Most outages are “state management” failures: partial writes, ambiguous outcomes, invalid transitions.
Correctness bugs are indistinguishable from security incidents when the system is adversarial.
Undefined behavior is an attack surface when inputs are adversarial.
The cost of unclear invariants is paid in production, under load, during an incident.

Key questions

Where does concurrency create “double spend” style failures in your domain?
What exactly is the state, and what is derived or cached?
What must be durable before you acknowledge?
What does a client learn after a timeout: success, failure, or ambiguity?
How do you make “unsafe defaults” impossible to ship?
What is your ordering model: FIFO per key, per partition, or none at all?

Assumptions

Deployments are mixed-version for longer than you think.
Input is hostile: malformed, oversized, boundary values, protocol confusion.
Observability is incomplete: you will debug from partial evidence.
Crashes happen mid-write (torn state) unless you prove otherwise.

Non-goals

Letting recovery be “restart the service and hope.”
Perfect exactly-once semantics across an untrusted network without coordination.

Attack surface

Negotiation and fallbacks are where security silently becomes optional—treat them as hostile.

Model & invariants

We want a transition function $\delta$ and invariant $\mathrm{Inv}$ such that:

s_{t+1} = \delta(s_t, e_t)\qquad\wedge\qquad \mathrm{Inv}(s_t)\Rightarrow \mathrm{Inv}(s_{t+1}).

Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.

If you can’t define what a timeout means, you can’t implement retries safely. Make ambiguity explicit in the API.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Least authority: privileges are scoped by purpose and time.
Integrity: invalid transitions are rejected (and detectable).
Evidence: critical actions emit verifiable audit events.
Authenticity: actions are bound to identity and purpose.

Failure modes

Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Observability gaps during incidents (missing evidence).
Recovery paths that only work when nothing is broken.
Mixed-version behavior that violates assumptions silently.

Pitfall

Sampling hides the rare schedule that breaks your invariants.

Design sketch

sequenceDiagram
  participant C as Client
  participant API as API
  participant DB as Durable Store
  C->>API: request(op, idempotency_key)
  API->>DB: check_or_reserve(key)
  DB-->>API: miss | hit(result)
  API->>DB: commit(result)
  API-->>C: ack(result)

Implementation notes

Correctness lives in the seams: encoding, persistence, concurrency, and retries.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

Correctness checklist:
1) Define state (durable vs derived).
2) Enumerate transitions.
3) Write invariants (safety) and progress conditions (liveness).
4) Pick crash points and specify recovery.
5) Make retries part of semantics (idempotency keys, monotonic versions).

Verification strategy

Differential tests against a reference model (even a slow one).
Fault injection: latency, partial writes, dropped acks, and duplicated messages.
Metamorphic tests: same operation applied twice must not change the result.
Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.

Operational notes

Make rollbacks safe: schema and protocol compatibility is a security boundary.
Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
Validate time assumptions: alert on clock steps, skew, and monotonicity issues.
Track invariant violations as pages, not dashboards.
Design “degraded modes” explicitly (fail closed vs fail open per operation).

Operational note

Make degraded modes explicit: fail closed vs fail open is a policy choice.

What to monitor

Authz failures and policy denials (unexpected spikes).
Admission-control / rate-limit rejections (by reason).
Retry/timeout rates by endpoint and client cohort.
Invariant violation rate (should be ~0).
Error budget burn + tail latency under load.

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.
Use canaries and staged rollout; stop early when signals degrade.

Evidence

Designing Data-Intensive Applications (Kleppmann) (1) — The systems-engineering baseline for correctness, replication, and failure.
- Evidence: Replication and consistency tradeoffs as engineering constraints; use as reference when naming guarantees.
Learn TLA+ (2) — A pragmatic workflow for invariants and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

Which correctness properties can be enforced at compile time (types/capabilities)?
What is the minimal durable record needed to recover safely?
Which operations need monotonic versioning vs idempotency keys vs both?
Which invariant, if violated, would silently corrupt state for weeks?

Checklist

Safety properties stated as invariants.
Failure modes enumerated with mitigations.
Assumptions listed and reviewed.
Rollback plan rehearsed and automated.
Telemetry captures correctness signals.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading