Why use Atlas instead of using Kademlia DHT?

larry · January 13, 2017, 3:58pm

Why does blockstack choose implement the Atlas rather using Kademlia? do you have any specs about Atlas?

larry · January 13, 2017, 4:06pm

One of the problems with Kademlia DHT was ensuring that a given zone file exists as nodes enter and leave the DHT. We used to have special nodes that were “DHT mirrors” - tasked with keeping a long term copy of all items seen in the DHT to try to achieve that.

Another was the problem of performance and slow look up time.

There might have been other issues as well…if we’re nice, perhaps @jude will elaborate on those.

As to how Atlas functions:

muneeb · January 13, 2017, 5:48pm

The Atlas network is very different from a traditional DHT like Kademlia-based DHTs.

a) Atlas nodes have a full replica of all data items. DHT nodes generally keep a subset of data and they keep routing information for a subset of data as well (log N generally). This means that if you can connect to a single Atlas node, you have access to all the data that you’re looking for.

b) Atlas nodes have a global view of the state meaning that they know if they’re missing any data items. This is because we use the blockchain to propagate information about new puts (new data items written to the network). This increases reliability a lot because traditional DHT nodes don’t even know if they’re missing data (there is no global view in traditional DHTs and there are theoretical proofs for that). See this paper by Keshav for how traditional DHTs/peer-networks cope with this lack of global state.

c) Nodes joining and leaving the network, called churn, is less of an issue on the Atlas network. This is because churn is less disruptive (any peer that you can reach will likely have the data you’re looking for). The routing tables of traditional DHTs are disrupted by churn and go out of sync.

d) Atlas nodes are meant to index data where the total data size is small. This is related to (a) i.e., all nodes keep full replica. This means that while the Atlas network can give you better performance and reliability than traditional DHTs, it works on a subset of use cases and is not a generic storage network.

We plan to post more details on the Atlas network in the near future.

muneeb · January 13, 2017, 5:55pm

This is related to © in my response. The Atlas network uses a random sampling approach. Here is a tweet marking the commit of that code

muneeb · January 23, 2017, 1:35am

Here is a perfect example of why the move the Atlas network is important.

Over this weekend (Jan 22), some community members (thanks, Albin!) reported “Data not saved in DHT” error for their profiles. I debugged this issue and turned out that some nodes of our DHT deployment were on a partition. This is very common in DHTs. We’ve been lucky in our deployment (which started in summer 2015 and has been running continuously since) and haven’t experienced partition issues that frequently. This is because of:

Active monitoring of default discovery nodes and throwing more RAM/CPU at the discovery nodes, so it’s hard to overwhelm them with requests.
Use of a caching layer where even if the underlying DHT network is experiencing, and recovering from, a partition read queries going to nodes that use a caching layer will still work for (heavily) cached data.
We can proactively check the blockchain for new data and check if data has propagated on the DHT network (traditional DHTs don’t have any such channel where new data writes get broadcasted).

Even with these additional monitoring and caching services, and extra information about new writes, we still experience issues. And the Atlas network described above helps a lot because it’s a fundamentally new design which, in my view, is much better than using a traditional DHT for our use case.

Anyway, just restored the DHT partition and things are back to normal. Looking forward to phasing out the DHT entirely in the next iterations of our deployment.