Howdy,
We’ve just released version 2.0.4
. A combination of two bugs led to a network-wide crash of all public-facing nodes, as well as a stalling of most NAT’ed nodes. This was triggered by the registration of a BNS name with an Atlas attachment.
Bug #1: Atlas Paging
The first bug had to do with the code that propagated Atlas zonefile
attachments. There was an error in the way the code queried to see if the
attachment inventory pages were valid. A network request for pages that
corresponded to non-existent state could cause a denial-of-service (node crash).
This affected all public-facing nodes. Stacks nodes regularly try and exchange
inventory lists of which Atlas zonefiles they each have, so they can know who
has which zone files (and who to ask to fetch them). Both public-facing and
NAT’ed nodes will issue these queries to their peers. It turns out, they could
issue a query that would cause the remote peer to crash. The combined effect of
all nodes issuing these Atlas inventory queries led to a network-wide outage.
Bug #2: Resource Exhaustion
The second bug had to do with the way the Atlas downloader code handled the case
where another node did not have a given attachment, even if it reported it
present in its inventory. The downloader did not free up the underlying event
descriptor in the network’s I/O poller in th error case. Over time, this would
cause the node to run out of network event descriptors, thereby preventing it
from downloading blocks or handling network requests.
This was exacerbated by the chronic absence of the attachment. A node would
request and re-request the attachment that no one had due to bug #1,
and in doing so, each time it would allocate but not free an event descriptor.
Mitigation
Release 2.0.4 fixes both bugs by temporarily disabling the Atlas system until we’ve had a chance to conduct a more thorough audit. If you register new BNS names until Atlas is reenabled on nodes your zone files will not propagate to Atlas. It’s better to wait for the Atlas patch to go live so the zone files can propagate for your BNS name registrations.
All of Hiro PBC’s and the Stacks Foundation’s nodes are in the process of deploying 2.0.4
to mitigate both bugs.
Anyone running a Stacks node should upgrade to the latest version to avoid the bugs listed above.