On Friday, May 23, the Stacks network experienced an extended stall, with block 1271291 timestamped at 20:15:48 GMT and block 1271292 at 03:56:35: no blocks were produced for 7 hours and 41 minutes.
Just before the stall, the Stacks network received a large volume of very large mempool transactions — each of these transactions would occupy nearly the entirety of a block’s size budget. Due to a bug in the mempool propagation RPC endpoints, Stacks nodes which held those transactions attempted to construct network packets larger than the maximum allowable packet size for the P2P network. Because of a separate bug, downstream Stacks nodes were unable to continue communicating over the network after attempting to read these too large packets. This prevented signers from receiving block proposals from miners, it prevented miners from receiving other miners’ blocks, and it prevented other nodes in the network from receiving processable blocks. This problem was fixed with release 3.1.0.0.11, and the network resumed operation once a sufficient threshold of signers and miners rolled out the release.
Lessons and Remediation
In addition to resolving the bugs identified above, the following actions are being taken to prevent similar events and improve response times for future incidents:
Testing Improvements
Enhanced testing remains a primary focus of our ongoing strategy. The two bugs that triggered this stall are 5 years old, from release 2.0. Uncovering bugs like this before they are found in mainnet requires a mix of strategies, and the core development team is focusing on expanding existing test coverage, static analysis techniques, and expanding black-box testing techniques for the existing codebase.
Signer and Miner Preparedness
As with some prior stalls, the speed of mitigation depended on how quickly nodes in the network could upgrade with the hotfix. Alongside the technical improvements, we will continue to prioritize the preparedness of signers and miners to handle future incidents swiftly. We recommend that signers and miners implement effective monitoring and alerting solutions, ensuring timely awareness and action when issues arise. One option here may be to set up a community-wide alerts channel to which interested parties can subscribe (details to follow).
Summary
Collectively, these improvements in testing and readiness represent a shared commitment across the developer and community to strengthen the resilience of the Stacks network. While coordination in a decentralized ecosystem poses unique challenges, proactive techniques and tooling and ongoing collaboration can help prevent these issues in the first place and ensure the network responds more effectively to challenging events like this one.