Finality and Reorgs: What Users Think vs What Protocols Provide

Monthly research note. Theme: Blockchain Protocols.

TL;DR

Finality and Reorgs: What Users Think vs What Protocols Provide as an engineering constraint: write down assumptions, make invariants executable, and design operational recovery as part of correctness.

Key insight

Treat “timeouts” as a third outcome: not success, not failure—ambiguity you must model.

Key takeaways

Mempools are adversarial schedulers: admission and fairness are protocol concerns.
Consensus safety is meaningless if execution is nondeterministic across nodes.
Finality guarantees are user security guarantees—document and enforce them.
Bind security decisions to evidence (audit, invariants, telemetry).
Make boundaries boring: validate inputs, cap costs, and be deterministic where needed.

Why this matters

Bridges reintroduce trust; you must model it explicitly.
Light clients shift assumptions; they must be written down.
Consensus safety is meaningless if execution is nondeterministic across nodes.
State growth is a security problem: it impacts decentralization and verification.

Key questions

What is the finality guarantee users can rely on (and when does it break)?
Where is the economic/DoS pressure applied (mempool, gossip, execution, storage)?
What is the determinism story (byte-for-byte re-execution across platforms)?
How do upgrades change security assumptions (fork choice, state transition rules)?
How do you defend against topology attacks (eclipse, partition, sybil)?
What is the reorg budget for applications and how do you communicate it?

Assumptions

Nodes are heterogeneous; determinism must survive platform differences.
Users and apps rely on probabilistic finality until proven otherwise.
Peers are untrusted; gossip can be manipulated for delay or isolation.
Upgrades happen under partial adoption; mixed-version is inevitable.

Non-goals

Allowing execution nondeterminism for performance convenience.
Assuming honest majority without defining the adversary’s budget.

Attack surface

Any unbounded work per request becomes a DoS primitive under adversaries.

Model & invariants

A ledger is a replicated state machine. Safety is uniqueness of finalized history:

\forall h_1,h_2:\ \mathrm{Final}(h_1)\wedge \mathrm{Final}(h_2)\Rightarrow h_1 \preceq h_2 \ \vee\ h_2 \preceq h_1.

Model the mempool as an adversarial scheduler: it chooses which work gets executed.

Explicitly model upgrade boundaries: old rules vs new rules during transition.

Invariant

If the system can enter an invalid state, it eventually will—usually during an incident.

Security properties

Downgrade resistance: negotiation can’t silently weaken security posture.
Authenticity: actions are bound to identity and purpose.
Replay resistance: duplicated inputs do not change outcomes.
Integrity: invalid transitions are rejected (and detectable).

Failure modes

Mixed-version behavior that violates assumptions silently.
Resource exhaustion (CPU/bandwidth/storage) turning into correctness failures.
Recovery paths that only work when nothing is broken.
Observability gaps during incidents (missing evidence).

Pitfall

A recovery plan that isn’t exercised will fail when you need it.

Design sketch

sequenceDiagram
  participant U as User
  participant N as Node
  participant P as Peers
  U->>N: submit(tx)
  N->>P: gossip(tx)
  P-->>N: gossip(more tx)
  Note over N: admission + ordering
  N-->>U: inclusion/finality signal

Implementation notes

Encode resource accounting and limits early; retrofits are painful.

Rule of thumb

If you can’t explain a timeout outcome, you can’t make retries safe.

Mempool hardening checklist:
- Per-peer rate limits + global admission budget
- Duplicate detection and eviction policy
- Signature verification batching with caps
- Anti-DoS: bounded decode/parse cost
- Fairness: per-sender quotas (avoid hot-account starvation)

Verification strategy

Cross-implementation tests when multiple clients exist.
Determinism tests across architectures (x86/ARM) and OSes.
Adversarial mempool tests: spam, pinning, worst-case signature patterns.
Formal invariants for supply/balance conservation where appropriate.
Fork/reorg simulations: application-facing invariants under reorgs.

Operational notes

Protect peer tables against eclipse attempts (diversity, scoring, rotation).
Monitor reorg depth and frequency; treat increases as incidents.
Keep execution resource limits explicit and enforced.
Rehearse upgrades with mixed versions and rollback paths.
Measure invalid tx rejection reasons and rates (spam signature).

Operational note

Attach explicit rollout/rollback triggers to changes that touch security or correctness.

What to monitor

Authz failures and policy denials (unexpected spikes).
Rollback events and the conditions that triggered them.
Invariant violation rate (should be ~0).
Admission-control / rate-limit rejections (by reason).
Error budget burn + tail latency under load.

Rollback plan

Define an explicit rollback trigger (metrics + thresholds).
Preserve evidence (configs, artifacts, audit logs) to reconstruct what changed.
Prefer backward-compatible changes; avoid “flag day” upgrades.
Keep dual-write / dual-verify windows where appropriate.
Use canaries and staged rollout; stop early when signals degrade.

Evidence

Jepsen (1) — Fault injection and correctness testing for distributed systems.
- Evidence: Turn faults into test cases; prioritize partition and clock-skew scenarios that violate user-visible guarantees.
Learn TLA+ (2) — Practical entry point for specification and model checking.
- Evidence: Model the smallest thing that can break; use model checking to validate invariants before optimizing.

Open questions

Where does your implementation accidentally depend on local wall-clock time?
How do you communicate finality uncertainty to users without lying?
Which invariants should be proven vs tested vs monitored?
What is the worst-case work a single transaction can force?

Checklist

Rollback plan rehearsed and automated.
Costs bounded (CPU/memory/bandwidth) under adversarial inputs.
Safety properties stated as invariants.
Failure modes enumerated with mitigations.
Telemetry captures correctness signals.
Assumptions listed and reviewed.

TL;DR

Key takeaways

Why this matters

Key questions

Assumptions

Non-goals

Model & invariants

Security properties

Failure modes

Design sketch

Implementation notes

Verification strategy

Operational notes

What to monitor

Rollback plan

Evidence

Open questions

Checklist

Further reading