Unplanned downtime on default Gaia hub from September 5, 2018

Yesterday, the default Gaia hub experienced significant downtime due to an expired SSL certificate from September 5, 23:03:09 GMT to September 5, 01:08:00 GMT. This led to a service disruption for blockstack-browser and any applications users attempted to use during that period.

Our monitoring processes failed to detect this issue, which led to much more downtime than we view as acceptable. This monitoring issue was corrected by linking a runscope policy (polled every 5 minutes) to this instance, and so issues like this should be dealt with much more rapidly in the future. Furthermore, Blockstack is in the process of re-architecting our infrastructure and monitoring practices to align it more closely with the very high reliability standards that our community and developers deserve. We already have a dedicated DevOps team that is in the process of upgrading the deployment and monitoring processes.

However, it is important to note that while we operate the majority of the infrastructure in the Blockstack ecosystem today, a key benefit of decentralized ecosystems is that there are no such vitally important entities or infrastructure. That’s the world we’re moving towards. Decentralizing Gaia hubs in particular is a major focus for the Blockstack team in the coming months.

2 Likes

Thanks for writing this up, @aaron. Just a follow up to this post, I’m happy to discuss this further in our weekly engineering meetings and in person. Also, I have written a follow-up post on diversifying the set of Gaia hubs.

Thanks @yukan for raising the issue internally and thanks @aaron for pushing a fix within 10 minutes. Totally agree on (a) better monitoring services that @jwiley are working on, and (b) less reliance on any single party’s infrastructure (this goes totally against the mission of decentralization).

@aaron, We are using Default Gaia hub for developing prototypes during the Hackathon (Sept 8 to 10). Will there be any impact for us. Please let us know.

No there won’t be any impact. The downtime was resolved very quickly after it occurred.

1 Like

Thanks @larry & @aaron