July 8, 2025 Stall Post-Mortem

On Tuesday, July 8th, the Stacks network experienced an extended stall, with Stacks block 2,122,849 timestamped at 22:16:02 UTC, five hours after its predecessor (timestamped at 17:07:44 UTC). Consistent block production resumed when the network fully stabilized around 00:41:56 UTC.

The stall was caused by a bug that triggered if the Stacks node encountered a communication error with the Bitcoin node while processing a Bitcoin reorg. Such a bug had been previously identified and fixed in stacks-core release 3.1.0.0.13, released on July 2nd. Unfortunately: 1. The development team had underestimated the likelihood of the bug and therefore did not label the upgrade as “critical”; 2. A few node operators had not yet updated to the .13 release and were thus impacted by the bug. In particular, two large signers were affected and stopped signing blocks. This prevented the remaining signers from reaching the 70% approval required for generating a valid block.

Technical Details

When processing a Bitcoin reorg, the Stacks node drops the reorged blocks from its databases and then proceeds to download the new blocks. If an error occurs while downloading those new blocks, for example due to a temporary communication error with the Bitcoin node, this function would exit before downloading the blocks and leave the system in an incoherent state. Either a genesis sync or a snapshot restore is then required to recover from this corrupted chainstate. Nodes that hit this error would see a NonContiguousBurnchainBlock error in their logs. For complete details, see PR #6214 which fixed this problem.

Lessons and Remediation

The experience of this stall highlighted the need to improve several key items:

Upgrade Alerts

All Stacks node operators should be aware of and subscribed to the announcement mailing list. Releases and other important announcements are shared via this list. Releases, especially those including bug fixes, should be more widely broadcast on Discord and social media (see @StacksStatus on X).

Faster recovery

This stall’s duration was mostly due to the time required by affected signers to come back online by restoring their nodes’ chainstate. The process requires downloading and extracting very large snapshots over the internet (~250GB as of today) and is bandwidth-bound (for downloading) and CPU-bound (for extracting).

Time-to-recovery can be significantly reduced by relying on local snapshots and restore processes but, today, snapshotting a chain state requires stopping the node or risking a corrupted snapshot. The Stacks team will 1. Provide guidance for node operators to encourage them to take local snapshots, verify them, and restore from them; 2. Make the necessary improvements to the node to enable snapshotting while it is up and running.

4 Likes

Could an alternative to #2 (enabling hot snapshotting) be enabling burnchain tip height change to trigger the switching of a node from active (mining or signing, respectively) to inactive (following) or from inactive to active, as needed? An operator could then switch operations from a primary node to a secondary node, take the primary offline and snapshot it, then swap back and snapshot the secondary.

Even if hot snapshotting gets implemented I suggest the ability to trigger a node switch would be useful. The miner creates duplicate leader commits if not started in the interval between burnchain tip height change and first nakamoto block.

3 Likes

Thanks for the update Brice! <3

1 Like