Syncing data among Identity Hub instances

daniel · January 6, 2017, 7:15pm

We have been working on a shared, high-level spec with the Blockstack team to define something we’re calling Identity Hubs. Just think of them as secure, provider agnostic, distributed storage instances of user data. Your Hub can live on any host or device, and a user can have multiple active Hubs that all sync data and respond to requests. It’s basically your own secure, encrypted personal data store that is replicated in many places.

This introduces an interesting issue: If a user, Alice, has her Hub active on Cloud Provider X and Cloud Provider Y, then a change of a data value occurs at the same time in different locations, how do we reach a consistent state across all the Hubs? For example: Alice’s favorite color is simultaneously set to red on Cloud Provider X and blue on Cloud Provider Y, how do we insure one value is recorded as first and latest across both instances of the user’s Hub?

daniel · January 6, 2017, 7:29pm

One idea would be to select an existing open source system/protocol for syncing data as THE one that spec adherent implementations must use. Something like Couch/PouchDB could be an option: https://pouchdb.com/guides/replication.html

jude · January 6, 2017, 9:08pm

TL;DR: I think we’ve already partially solved this with the “datastore” concept. A little bit more work will be required for a reliable message queue (see arguments below), but this should be easier and safer than using a shared read/write database.

I think there are two ways to solve this: at the storage layer, and at the application layer.

The storage layer solution is to design a multi-reader multi-writer distributed storage system that all application nodes (i.e. Alice’s computer, cloud provider X’s client, cloud provider Y’s client) can use for a shared storage medium. It would either provide some well-defined consistency model that all applications would simply need to be aware of (the CouchDB/PouchDB approach), or the storage system itself would need to be programmable so the application could get exactly the consistency/durability semantics it needs (the Syndicate approach).

The advantage of the storage layer approach is that we solve the consistency problem once, for everyone. The disadvantages are that it’s going to amount to a lot of complexity in the lower layers of the stack, and it will likely lead to implicit tight coupling between applications and the implementation (in particular, the storage system’s implementation’s consistency model will dictate how we build applications that use it). I say “implementation” instead of “specification” here because it is notoriously difficult to implement the desired consistency model for a spec (history has shown we can’t even do it for the lax POSIX consistency model for filesystems, let alone distributed storage systems).

The tight coupling concern can be mitigated with wide-area software-defined storage, where the application may opt to prescribe behaviors to application nodes that act on the storage system’s data plane (and in doing so, enforce its own data consistency, durability, access controls, etc. independent of other applications). However, such systems are still in their infancy, and have complex specifications and complex implementations as it is.

The other strategy (and the one I prefer) is an application layer solution. Each application node maintains its own data store that only it can write to, but anyone can read. An application node does a little bit of extra processing to construct a consistent view over the set of other nodes’ data.

In most cases, including this example of choosing your favorite color, this is very simple (which is why I like it). Suppose that an application wants to know what Alice’s latest color preference is. Suppose also that the application records Alice’s favorite color to a well-defined path in a datastore (e.g. /favorites/color). Alice updates her favorite color on her personal datastore. The application nodes in cloud X and cloud Y periodically refresh /favorites/color from each other and from Alice’s datastore, and in the event of conflicting values, they take the value that was written last. Eventually, in the absence of writes, they all see the latest /favorites/color. It is easy to see how more advanced write-conflict-resolution algorithms can be applied as well.

As an optimization, we can extend the per-node datastore to have a per-record pub/sub interface. This can be achieved simply by giving each node a publicly-routable reliable message queue, and by having the client publish each mutation operation (create, update, delete) to it. Other application nodes can subscribe to this queue to get informed as to when certain mutations happen. In our example, Alice’s datastore client would send a “write” message to her message queue, which cloud X and cloud Y have already subscribed to. Cloud X and cloud Y’s application nodes would each receive the “write” message on /favorites/color, and immediately refresh /favorites/color from Alice’s storage provider. Of course, the message queue would be designed so that messages are end-to-end confidential and authenticated (so Alice could use any queue provider without having to trust it).

Here’s a (badly-drawn) figure that tries to show how Alice, cloud X, and cloud Y work together with their respective storage providers and Alice’s message queue:

Red paths are instigated by Alice; blue paths by cloud X; green paths by cloud Y. Arrows show the flow of bytes. The clouds that hold the data are just commodity cloud storage providers, and the clients (circles) implement a simple one-writer-many-reader filesystem client (i.e. cloud X can use his client to read Alice’s data from her storage provider).

Two key advantages of the application layer approach is that we keep both consistency and security easy to reason about. Only Alice can write to her storage provider, and only Alice can send messages via her queue (similar for cloud X and Y’s data storage). However, anyone who can decrypt Alice’s messages and data can read them (i.e. Alice would grant cloud X and cloud Y the ability to do so via a bootstrapping step). This is nice because it means the application developer doesn’t have to think about storage edge-cases that come with dealing with shared read/write media. At the same time, this approach is probably more secure in practice, because from the get-go we require that only the data store’s owner can write (whereas if we took a shared read/write approach, we’d have to be extra careful that the necessarily complex implementation still ensured that only the data owner could write to their data).

The other less-obvious advantage of the application-layer approach is that it works behind NATs. The clients for Alice, cloud X, and cloud Y do not need to be publicly routable; only the storage providers and message queue providers do. Also, the clients do not have to be online 24/7; they can simply consume their pending messages once they come back online (but before doing anything else). Alice could have a personal computer at home that operates on her data, for example.

Insofar as how to bootstrap this, Alice’s, cloud X’s, and cloud Y’s clients would each discover each other’s data store providers (e.g. Amazon S3, Dropbox, etc.) and message queue providers (e.g. a ZMQ instance running an a VM) via each other’s zonefiles. Alice would grant cloud X and cloud Y the ability to read her messages by sharing a secret with them using cloud X’s and cloud Y’s public keys obtained by her Blockstack node. Similarly, cloud X and cloud Y would get Alice’s public key, so they could verify messages and data written by her. If anyone changed their storage providers or queue providers, the others could easily find out by checking each other’s zonefiles with their local Blockstack nodes.

We have already implemented the data store client functionality in the rc-0.14.1 branch. I still have to create the reliable message queue functionality.

daniel · January 7, 2017, 12:43am

I think there may have been a fundamental misunderstanding about what I meant by Cloud Provider - this is something like Azure, AWS, etc., which is the storage layer. I am not clear on what you see as a Cloud Client here, given there is no need to talk to anything but the URIs of the Cloud Providers (a host of external storage/compute Hub instances) listed in the user’s DNS Zone File records.

jude · January 7, 2017, 1:07am

Ah. By “a user can have multiple active Hubs that all sync data and respond to requests”, I thought you meant that these hubs were asynchronous actors that did things on behalf of Alice autonomously. My comment was built around getting these actors and Alice to get the same view of her data (i.e. the actors and Alice each had a “client” that carried out their actions on top of existing storage providers).

If all hubs were meant to be is just a place for Alice to dump her data and get it later, then we don’t have to worry about getting them to agree on consistent replicas. That’s the client’s job, not the cloud storage provider’s.

Our implementation currently does best-effort replication on write (replication “succeeds” if at least one cloud storage provider driver succeeds), and uses Lamport clocks embedded in each datum to ensure that reads do not consume stale data. That way, if Alice and Bob have a driver for at least one storage provider in common to which Alice’s client can successfully write, then Bob will always get fresh data from Alice.

At a lower level, the driver model requires the implementation to do whatever it takes to implement read-follows-write consistency for gets and puts to the same key. Most storage providers already do this automatically, but you might imagine a driver for an eventually-consistent data store that blocks on put until it can be sure that a subsequent get will return the requisite data (or vice versa–a driver’s get implementation might block until it stops getting stale data).

One nice thing about our driver model that I think you’ll like is that it’s meant to make drivers composible. I could create a “meta-driver” that implemented some more advanced replication policy on top of a set of drivers, like requiring m-of-n drivers to succeed for the meta-driver’s replication to succeed. We haven’t tried this yet, but this is a feature I’d like to have on by default.

1Mx2C9Mj6H4iuAFcHg2c · December 6, 2017, 7:23pm

Jude - beginner, learning more about blockstack and code … so I can help promote and build on Blockstack - I was reading about Gaia storage (looks like you wrote it) used in Blockstack browser (“BSB”) last night and wanted to make sure I understand how it works - from a (very) low level. I believe or as I understand the datastore (“ds”) in Gaia (local virtual file system) is that ds is used to achieve the same goal (for lack of a better description) as e.g., a standard type database ( like a mysql used for say a wordpress site). Is that a fair representation?

Re: pouchDB (“pdb”). I believe or as I understand it, pdb stores data/information in browser (I assume this is also a type of virtual file system too) and was wondering if something like pdb or pdb server is something to use in BSB or applicable since Gaia is available? Thanks again for your help.