Stateful Signatures Are a Distributed Systems Problem: XMSS/LMS Without Index Reuse

Paper/spec-driven systems note. Theme: PQC that fails because of systems engineering, not math.

TL;DR

Stateful hash-based signatures (XMSS, LMS/HSS) are attractive in post-quantum migrations because their security rests on hash functions and conservative assumptions. But they hide a non-negotiable constraint: each one-time signing key must be used at most once. In practice that means: your signature scheme is only as strong as your crash-consistency and concurrency control. If you cannot guarantee “no index reuse” under retries, rollbacks, snapshots, and partial deployment, you are not deploying PQC — you are deploying a latent signing-key compromise.

Key insight

For XMSS/LMS, correctness is an invariant on durable state. The cryptography is only the leaf function; the security boundary is the state machine that allocates and commits leaf indices.

Key takeaways

Index reuse is catastrophic, not “degraded security”. Treat it like key exfiltration.
The signing counter is a replicated state machine. Build it with linearizability, not best-effort databases.
Crash consistency beats cleverness. Burning indices is acceptable; reusing indices is not.
Rollback attacks are real in cloud/edge fleets. Snapshots, restores, and imaging are an adversary primitive.
Operational evidence is part of the scheme. If you can’t prove which indices were used, you can’t prove you’re still secure.

Introduction (pragmatic abstract: why you should care today)

The supply-chain incident is not “someone broke SHA-256”. It’s usually one of:

a compromised CI signer,
a leaked code-signing key,
a rollback to an old firmware image and a forged update chain,
or an operator restoring a “known good” backup that accidentally rewinds signing state.

In PQC migration programs, stateful hash-based signatures are often proposed for the conservative path: they are standardized, their assumptions are narrow, and they are plausible even under aggressive quantum timelines. (1) (2) (3)

But stateful signatures demand that you treat the signing key as a protocol state.

If you are signing firmware for IIoT devices, you’re signing into an adversarial lifecycle: devices get cloned, images get restored, regional partitions happen, and “just retry” becomes policy. That is exactly the environment where index reuse happens unless you engineer against it.

I’m writing this the way I operate in Chile: fewer slogans, more invariants. If the system can’t fail, you don’t “enable PQC” — you prove that your allocator cannot reuse a leaf index under the failure model you actually have.

Key questions

What is your signing state exactly (counter, tree id, subtree, key epoch)?
Is index allocation linearizable across all signers?
What happens if a signer crashes after producing a signature but before persisting state?
Can an attacker force a rollback of signing state (snapshot restore, disk imaging, DB restore)?
Do you have evidence (logs, receipts, transparency) that binds each signature to a unique index?

Assumptions

I’ll be explicit because “implicit assumptions” become production incidents.

Hash functions behave as modeled (preimage/second-preimage resistance).
Adversary can observe signatures and choose messages (EUF-CMA setting).
Operators can and will restore from backups; edge devices can and will be imaged.
Failures include process crashes, disk-full, partial writes, timeouts, and retries.
Some components may be malicious or compromised (CI worker, signing host), but we still require the non-reuse invariant to hold.

Non-goals

Designing a brand new signature scheme. Use standardized constructions. (1) (2) (3)
Proving detailed cryptographic bounds here. I focus on the systems invariant the proofs depend on.
Solving global supply chain security. I’m isolating the signer state problem.

Security properties

This is what “secure deployment” means in operational terms.

P1 — Unforgeability (EUF-CMA, in the intended threat model)

An attacker who sees signatures $\sigma_1,\dots,\sigma_q$ for chosen messages should not be able to produce a valid signature $\sigma^{*}$ for a new message $m^{*}$ with non-negligible probability.

P2 — No index reuse (deployment invariant)

For each keypair and each leaf index $i$ , at most one signature is ever produced using the one-time key at $i$ .

In words: the key is stateful. If you cannot enforce P2, P1 is not a meaningful claim.

Invariant

NoReuse: For a given signing key kid, every leaf index i is used at most once, across all replicas, across all time, including after crash recovery and restores.

P3 — Rollback resistance (or rollback detection)

You must prevent or detect state rollback that could cause index reuse. “Detect” is acceptable only if your response is “treat as compromise; rotate; revoke”.

Failure modes

These are the places where teams get hurt: not on paper, but in production.

Concurrent signers racing on the same counter (eventual-consistency DB, stale caches).
Crash after signing but before committing state → the system “forgets” it used an index.
Backup restore / snapshot rollback rewinds the counter.
Partial deployment where old/new versions interpret state differently (range reservation, burn semantics).
Sharded state without coordination (two regions allocate overlapping index ranges).
Opaque evidence: you cannot answer “which indices were used?” during incident response.

Pitfall

“We store the counter in Postgres” is not a design. The question is: what isolation level, what recovery semantics, what rollback story, what evidence?

What to monitor

Operability is part of correctness for stateful signatures: you need signals that correspond to proof obligations.

Current index / remaining capacity (per key id, per subtree).
Signature rate vs index burn rate (burn spikes indicate retries/crashes).
Allocator linearizability signals: leader changes, term changes, commit lag (if Raft/Paxos).
Duplicate detection: any reuse event must page immediately (treat as key compromise).
State durability health: fsync latency, WAL lag, disk-full events, snapshot restore events.
Fleet drift: which signer version and which state schema is active.

The Mathematical Anatomy of the Problem

Stateful signatures are not hard because Merkle trees are hard. They are hard because one-time signatures are not “one-time-ish”.

I’ll use the XMSS/LMS family (Merkle tree over OTS keys) because that’s the shared shape. (1) (2)

Merkle signatures in one page

You have:

A Merkle tree of height $h$ with $2^h$ leaves.
Each leaf $i$ commits to a one-time public key $\mathrm{pk}_i$ .
The global public key is the Merkle root $\mathrm{root}$ .

A signature on message $m$ at index $i$ contains:

an OTS signature $\sigma_i$ proving knowledge of $\mathrm{sk}_i$ for message $m$ ,
an authentication path $\pi_i$ proving that $\mathrm{pk}_i$ is in the tree under $\mathrm{root}$ .

Verification is:

\mathrm{Verify}(\mathrm{root}, m, i, \sigma_i, \pi_i) = \mathrm{VerifyOTS}(\mathrm{pk}_i, m, \sigma_i) \wedge \mathrm{VerifyMerkle}(\mathrm{root}, \mathrm{pk}_i, i, \pi_i).

The only secret that changes across signatures is the choice of leaf index.

Why “one-time” is an invariant, not a suggestion

In WOTS+/LM-OTS-style constructions, the signature leaks structured information about the secret key. The security proof assumes you leak that structure once. Twice is a different game.

Here’s the minimal intuition using a Winternitz-like chain view (XMSS uses WOTS+): (1)

Let $F$ be a one-way function (modeled as a hash). For each chain position $j \in \{1,\dots,\ell\}$ :

secret seed: $x_j$
public value: $y_j = F^{w}(x_j)$ for some chain length $w$

For a message $m$ , you compute a base- $w$ representation that yields digits $a_j(m) \in \{0,\dots,w\}$ .

The signature reveals:

s_j = F^{a_j(m)}(x_j).

If the same one-time key is used twice for messages $m_1, m_2$ , the attacker learns:

F^{a_j(m_1)}(x_j),\quad F^{a_j(m_2)}(x_j).

Since hashing forward is easy, the attacker can compute:

F^{\max(a_j(m_1), a_j(m_2))}(x_j)

for every chain $j$ by hashing forward from the smaller revealed value to the larger. With enough reuse and chosen messages, this becomes a practical forging path.

You do not need to memorize the exact attack to engineer correctly. You need to internalize the operational conclusion:

Rule of thumb

For XMSS/LMS, a single index reuse is a “stop the world” event. Rotate keys, revoke certificates, and treat all artifacts since the last known-good index as suspect.

The invariant as a formal predicate

Model the signer as a state machine with durable state:

next : Nat (the next unused index)
used : SUBSET Nat (or, more realistically, an append-only log)

The deployment invariant is:

\mathrm{NoReuse} \equiv \forall i.\; i \in \mathrm{used} \Rightarrow i < \mathrm{next}\;\;\wedge\;\; \mathrm{used}\ \text{is monotone}.

In TLA+-style pseudocode:

VARIABLES next, used

Init ==
  /\ next = 0
  /\ used = {}

Reserve ==
  /\ LET i == next IN
     /\ next' = next + 1
     /\ used' = used \cup {i}

Inv_NoReuse ==
  /\ used \subseteq 0..(next-1)
  /\ used' \supseteq used

This looks trivial until you map it onto real failures:

Reserve must be linearizable across signers.
used must be durable across crashes.
state must not roll back.

That is where systems engineering starts.

From Proofs to Binaries: The Implementation Challenge

Formal models talk about “steps”. Your deployment talks about:

scheduler jitter,
fsync latency,
retries under timeouts,
backups and restores,
region-level partitions,
and adversaries who turn those into weapons.

1) Concurrency: allocation must be linearizable

If you have more than one signing worker, you need a single source of truth for next.

Correct solutions:

a dedicated allocator replicated with Raft/Paxos (linearizable log) (4),
an HSM with an internal monotonic counter (if it exists and is trustworthy),
a single leader signer with strict fencing + durable WAL.

Incorrect solutions (common in the wild):

eventually consistent caches,
“best effort” database updates without serializable semantics,
“allocate ranges per region” without a global coordination story.

Attack surface

Index reuse is not only a bug. It is an adversary primitive: force retries + partitions + restores until your allocator violates linearizability.

2) Crash consistency: durability must happen before you return success

The hardest bug is:

signer produces signature $\sigma$ for index $i$ ,
process crashes before persisting “i was used”,
on restart, the signer reuses $i$ .

The safe pattern is intentionally boring:

Reserve index $i$ by appending to durable log and fsync.
Sign message using $i$ .
Record signature receipt (message hash, artifact id, timestamp, index) for evidence.
If signing fails mid-flight, burn the index anyway.

Burning indices reduces capacity. Reusing an index destroys security.

3) Rollback attacks: snapshots are an adversary tool

If your signing state lives on disk and you restore an old snapshot, your counter goes backwards. That is equivalent to index reuse.

Mitigations, in increasing order of strength:

Detect rollbacks: remote transparency log of (kid, index, artifact-hash); alert on non-monotone indices.
Prevent rollbacks: store the counter in tamper-resistant hardware (TPM monotonic counters, HSM state) — with skepticism about vendor semantics.
Make rollback irrelevant: run the allocator as a replicated state machine with quorum persistence; do not restore it from point-in-time backups without a protocol.

The formal model’s state is (next, used). The implementation’s state becomes:

a WAL segment with committed reservations,
an allocator term/leader epoch,
a signer’s local reservation lease,
and a set of receipts that can be audited.

Write the refinement mapping down:

next ↔ last committed reservation in the allocator log.
used ↔ committed reservation set (or ranges) + receipts.

If you can’t express that mapping, you can’t convincingly argue you implemented the invariant.

Implementation sketch (Rust)

Treat index allocation as an interface with explicit failure semantics:

index_allocator.rs

pub struct Reservation {
    pub key_id: String,
    pub start: u64,
    pub len: u32,
    pub epoch: u64, // fencing token
}

pub trait IndexAllocator {
    fn reserve(&self, key_id: &str, len: u32) -> Result<Reservation, AllocError>;
    fn commit_receipt(&self, receipt: Receipt) -> Result<(), AllocError>;
}

The invariants the implementation must preserve are not “Rust safety” invariants. They are protocol invariants:

{ Linearizable(next) * DurableLog(kid) }
reserve(kid, len)
{ Disjoint(reservation, prior) ∧ Monotone(next) }

If you cannot test linearizability under adversarial schedules, you are guessing. Use deterministic concurrency testing where possible (e.g., Loom for the local state machine) and fault-injection for the allocator boundary.

Rollback plan

This is incident response, not wishful thinking. If you don’t rehearse it, you don’t have it.

Trigger: any evidence of index reuse, rollback, or allocator split-brain.
Immediate action: stop signing; quarantine signing workers; preserve disks/logs for forensics.
Containment: rotate signing key; revoke code-signing certificate; publish incident notice if artifacts shipped.
Recovery: re-issue artifacts signed under new key; enforce monotonic counter storage or RSM allocator before resuming.
Postmortem: add a forced test that reproduces the failure (snapshot restore + retry storm + crash at worst point).

Evidence

RFC 8554: LMS/HSS (2)
- Evidence (spec constraint): “An LM-OTS private key MUST NOT be used to sign more than one message.”
RFC 8391: XMSS (1)
- Evidence (deployment reality): the security story explicitly assumes one-time use of WOTS+ keys.
NIST SP 800-208 (3)
- Evidence (operationalization): stateful signature schemes require secure state management; rollback is a first-class hazard.
Raft (4)
- Evidence (engineering pattern): linearizable replicated logs are the standard way to enforce “exactly-once allocation” under failures.

Open questions

What is your hard boundary: “prevent rollback” or “detect rollback and rotate”?
Can you justify a single-region allocator for your threat model, or do you need cross-region quorum?
What is your evidence story: can you prove non-reuse to an auditor after an incident?
If the allocator is compromised, what are your containment mechanisms (fencing, transparency, revocation)?

Checklist

NoReuse invariant is written as code + tests, not a wiki sentence.
Allocation is linearizable across signers (not “usually correct”).
Reservations are durable before success is returned (fsync/WAL semantics).
Snapshot/backup restore cannot rewind state without detection/rotation.
Duplicate detection pages immediately and blocks signing.
Key rotation + certificate revocation playbook is rehearsed.

Stateful Signatures Are a Distributed Systems Problem: XMSS/LMS Without Index Reuse

TL;DR

Key takeaways

Introduction (pragmatic abstract: why you should care today)

Key questions

Assumptions

Non-goals

Security properties

P1 — Unforgeability (EUF-CMA, in the intended threat model)

P2 — No index reuse (deployment invariant)

P3 — Rollback resistance (or rollback detection)

Failure modes

What to monitor

The Mathematical Anatomy of the Problem

Merkle signatures in one page

Why “one-time” is an invariant, not a suggestion

The invariant as a formal predicate

From Proofs to Binaries: The Implementation Challenge

1) Concurrency: allocation must be linearizable

2) Crash consistency: durability must happen before you return success

3) Rollback attacks: snapshots are an adversary tool

4) Refinement mapping: keep the spec-to-code bridge explicit

Implementation sketch (Rust)

Rollback plan

Evidence

Open questions

Checklist

Further reading