Fault Injection: Turning Unknown Unknowns into Test Cases

Monthly research note. Theme: Correctness & Foundations.

TL;DR

A focused memo on Fault Injection: Turning Unknown Unknowns into Test Cases: define the model, state the properties, then design the system so those properties remain true under failure and adversaries.

Key insight

Most failures are boundary failures: parsing, persistence, concurrency, retries, and upgrades.

Key takeaways

Prefer monotonic counters/epochs over wall-clock timestamps at correctness boundaries.
Ack semantics must be explicit: durable, best-effort, or ambiguous.
Separate durable state from derived state; derived must be recomputable or reconcilable.
Prefer protocols and APIs that make invalid states hard to express.
Define safety properties before performance goals.

Why this matters

Correctness is a property you enforce at boundaries: parsing, persistence, concurrency, RPC.
The cost of unclear invariants is paid in production, under load, during an incident.
If recovery is not specified, recovery becomes improvisation.
Undefined behavior is an attack surface when inputs are adversarial.

Key questions

How do you make “unsafe defaults” impossible to ship?
What is your ordering model: FIFO per key, per partition, or none at all?
What does a client learn after a timeout: success, failure, or ambiguity?
What exactly is the state, and what is derived or cached?
How do you ensure deduplication is scoped correctly (tenant, resource, operation)?
Where does concurrency create “double spend” style failures in your domain?

Assumptions

Time is untrusted: clock skew, NTP steps, monotonic vs wall-clock confusion.
Deployments are mixed-version for longer than you think.
Errors are lossy: transient vs permanent is often indistinguishable at the boundary.
Crashes happen mid-write (torn state) unless you prove otherwise.

Non-goals

Treating retries as a transport detail rather than a semantic constraint.
Assuming a single authoritative clock that never moves backwards.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

A common pattern is splitting state into durable vs derived:

S = S_\text{durable} \times S_\text{derived}\qquad\text{and}\qquad S_\text{derived} = f(S_\text{durable}).

Crash points matter: define what happens if the process stops after each line that mutates state or acknowledges work.

Avoid “ghost state” in caches that can’t be recomputed or validated. Derived state must be either reproducible or explicitly reconciled.

Invariant

Monotonicity beats timestamps: counters and epochs survive clock skew.

Security properties

Downgrade resistance: negotiation can’t silently weaken security posture.
Evidence: critical actions emit verifiable audit events.
Integrity: invalid transitions are rejected (and detectable).
Authenticity: actions are bound to identity and purpose.

Failure modes

Observability gaps during incidents (missing evidence).
Config drift that weakens security posture over time.
Timeout ambiguity causing double-apply or partial state transitions.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.

Pitfall

Caches tend to become sources of truth unless you can recompute and validate them.

Design sketch

stateDiagram-v2
  [*] --> Init
  Init --> Ready: bootstrap()
  Ready --> Processing: event(e)
  Processing --> Ready: commit()
  Processing --> Error: violate(Inv)
  Error --> Ready: recover()

Implementation notes

The goal isn’t cleverness—it’s eliminating ambiguity at boundaries and making recovery boring.

Rule of thumb

Make rollbacks boring: if rollback is a hero move, it will fail.

use core::fmt;

#[derive(Clone, Debug)]
pub enum Event {
    Input(Vec<u8>),
    Tick,
    Fault(&'static str),
}

pub trait StateMachine {
    type State: Clone + fmt::Debug;
    type Error: fmt::Debug;

    fn step(state: &Self::State, event: Event) -> Result<Self::State, Self::Error>;
    fn invariant(state: &Self::State) -> bool;
}

// Fault Injection: Turning Unknown Unknowns into Test Cases: invariants are part of the API contract.

Verification strategy

Deterministic schedulers (e.g., Loom-like) to force rare interleavings.
Metamorphic tests: same operation applied twice must not change the result.
Fuzzing at the boundary: parsers, schema evolution, and “unknown field” handling.
Property-based tests: generate adversarial sequences and assert invariants after every step.
Crash/restart tests: persist mid-transition and validate recovery correctness.

Operational notes

Design “degraded modes” explicitly (fail closed vs fail open per operation).
Instrument ambiguity: measure “unknown outcome” responses separately from failures.
Expose idempotency semantics explicitly (headers, keys, retention windows, error codes).
Run chaos drills focused on state: partial DB outages, replica lag, cache poisoning.
Log as evidence: append-only where possible; isolate logs from compromised workloads.

Operational note

Design playbooks as protocols: predictable steps, bounded risk, and clear ownership.

What to monitor

Admission-control / rate-limit rejections (by reason).
Retry/timeout rates by endpoint and client cohort.
Error budget burn + tail latency under load.
Authz failures and policy denials (unexpected spikes).
Invariant violation rate (should be ~0).

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Use canaries and staged rollout; stop early when signals degrade.

Evidence

RFC 9110: HTTP Semantics (1) — Defines method semantics including idempotency and safety—useful for API contracts.
- Evidence: Method semantics (safe/idempotent) are contracts; tie retries and dedupe behavior to these semantics, not timeouts.
Jepsen (2) — Failure testing focused on correctness under partitions and reordering.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.

Open questions

What would you do if you had to replay a month of traffic into a rebuilt system?
Which operations need monotonic versioning vs idempotency keys vs both?
Which correctness properties can be enforced at compile time (types/capabilities)?
Where does your API currently allow ambiguous outcomes, and how will clients cope?

Checklist

Safety properties stated as invariants.
Assumptions listed and reviewed.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Telemetry captures correctness signals.
Failure modes enumerated with mitigations.
Rollback plan rehearsed and automated.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading