API performance hiccups & analysis - 12-20-2021

In the past 2 days, Hiro API experienced significant slowdowns caused by an onslaught of activity. This issue was pervasive among all clients that used Hiro API - including Explorer, Hiro desktop, and web wallets and stacks CLI. The API stayed online, and requests were still being served, but the performance degraded to the point where requests frequently timed out.

It bears noting that the stacks ecosystem is experiencing burgeoning activity levels, which is a sign of exciting times ahead.

The causes that fomented this occurrence and timeline of events are detailed below.

Root cause: Poor performance of specific database queries from requests to the /transfers and /nft_events endpoints.

Starting ~ Dec 20, 2021, 12:00PM ET : All API databases serving public requests started performing numerous unoptimized queries from the /transfers and /nft_events endpoints, depleting all memory resources on the nodes. This slowed each database considerably, affecting all further reads and writes.

~ Dec 20, 2021, 12:41PM ET : A new version of the API 1.0.0 which contained optimizations for the /transfers and /nft_events endpoints was deployed to an API . This upgrade required Hiro to remove its green API and database from serving traffic while it was being upgraded and replay events to the database – a process that may take about 3 - 5 hours. During this time, the blue API and database were still serving traffic.

~ Dec 20, 2021, 5:30PM ET - Dec 21 2021, 9:00 AM ET : The green API’s event-replay to the database failed but did not trigger any alerts, and the API proceeded to boot normally. It promptly hit more errors when trying to sync to the STX tip and effectively stalled.

Error when replaying events:

Could not find transaction 0x5a97316c384ffb565f45e0bedac39ce07706f015f3fb79b5d9b55493c34c4dba associated with attachment

~ Dec 21, 2021, 9:14 AM ET :
Thinking the event replay failed due to an unrelated error, we tried restarting the event replay, which failed the following morning again. We eventually found the error log and created a new API version (1.0.1) which accounted for the bug.

~ Dec 21, 2021, 3:00 PM ET : Restarted the event-replay on this new API version.

~ Dec 21, 2021, 10:00PM ET : Restarting the event-replay on this new API version succeeded.
We switched traffic to the green API, which contained SQL optimizations for the /transfers and /nft_events endpoints. Performance on all /extended endpoints significantly improved as a result; however, /v2 was still showing response times over 30 seconds.

~ Dec 21, 2021, 11:00PM ET : An additional pool of followers behind the green API, allowing the load on /v2 to be more broadly distributed, resulting in improved /v2 performance and a return to normal behavior.

Here is a snapshot of the DB performance improvement before and after the fix: DB-Performance

Times like these, it is reassuring to notice what worked really well:

  • Spot on Alert management and response time; we were aware of the issue and were able to act on it swiftly.
  • We quickly gathered data and metrics that helped us triage the issue and resolve it.
  • We utilized all follower pools without the need for scaling up our total follower count.
  • Overall community rallying around us

Next Up:

  • A known issue with taking periodic snapshots which could have sped up the event replay held back from swiftly publishing a release; a fix is in the works
  • Improve handling the event replays viz. graceful exits, alerts on failed event replays.
  • Samples, Tutorials that span use cases ranging from handling API calls to building a more scalable architecture that avoids computationally expensive calls.
  • Event Calendar to track major community releases
7 Likes

Thanks for the update here! Growing pains for the Hiro API :slight_smile:

Also, important to point out that this issue was entirely around Hiro’s deployment of the API, which is just one company in the ecosystem, the decentralized Stacks blockchain continued to operate.

1 Like

Just for my understanding: if the majority of the apps use their own Stacks node and API, will that help? Has the Hiro API special properties that an API installed per https://docs.hiro.so/get-started/running-api-node for mainnet doesn’t have?

I don’t think there is anything special Hiro offers on their endpoint. To my knowledge it is the same API as in https://github.com/hirosystems/stacks-blockchain-api/. Just be aware of long sync time (1-2 days). Also, ideally you should have a performant bitcoin node nearby to boost sync time. Hiro is running its own which is available to everyone but it busy sometimes.

Yeah, that is what I thought as well. The solution would then simply be: if you have a project where you mint 10k NFTs or something: use your own API.

Of course the web wallet should also use multiple API’s I guess. Sounds doable?