RFC: Moving Gaia data to Home Computers

Gaia is a decentralized storage system that’s been in operation for several years, and has been used by a wide variety of Stacks (and Blockstack) applications. As mentioned elsewhere, Gaia is currently run by Hiro, and it’s very expensive to run as deployed.

Gaia was designed with the intention that users would run their own Gaia hubs, and connect them to storage back-ends of their own choosing. The BNS client libraries ensure that applicactions automatically discover other users’ Gaia hubs when loading data: given the user’s BNS name, an application name, and the name of the file to load, the client library is able to find the user’s Gaia hub and fetch the corresponding file.

This of course did not come to pass. In a well-meaning effort to bootstrap usage, Blockstack PBC (now Hiro) ran a default Gaia hub, which of course everyone used in place of running their own. To make matters worse, because the user’s Gaia hub URL is written to a BNS zone file, the act of changing over to a new hub is the act of sending a Stacks transaction. So realistically, users are not going to migrate their Gaia data to their own Gaia hubs anytime soon (if forced, they’ll more likely abandon the data).

Gaia Rail

This has put Hiro in a bind. They don’t want to delete the default Gaia hub because it would break other applications that use it. But they don’t want to be paying to host an ever-growing amount of data either (who does?). Asking users to download a copy of their data might be tenable, but asking them to spin up their own hubs and send Stacks transactions is probably a no-go.

This gives me an idea. Instead of having Hiro host both the hub and pay for the back-end, what if instead Hiro just ran a “request rail” for Gaia? It would serve to give a public, consistent URL for users’ Gaia hubs running at home behind NATs, This service would be stateless, and host no data on its own. Users would host their data at home (or even on their on-the-go laptops), and would run Gaia hubs that would establish a persistent TCP connection to the Gaia rail in order to traverse whatever NATs separate their hubs from the public Internet.

The request flow would look like this:

                        NAT
                         |
request --> Gaia rail -- | -> Gaia hub --> local storage
                         |

When a request for a user’s data arrives at the Gaia rail, the Gaia rail simply forwards the request through the persisted TCP connection to the user’s Gaia hub, which in turn loads and replies the data to the Gaia rail. As the data arrives at the Gaia rail, the Gaia rail simply pushes it back to the requester.

All the Gaia rail operator does here is supply network bandwidth and a public IP/DNS name for clients to access, and perhaps does some rate limiting and caching (possibly by means of a 3rd party CDN) to prevent the service from being overwhelmed. The Gaia rail would implement an authentication protocol for users’ Gaia hubs, such that the Gaia hub must prove that it operates on behalf of a particular BNS name owner (meaning, only users with BNS names can use the Gaia rail).

The Gaia hub itself would be largely unmodified, save for two things:

  • This authentication protocol, whereby it proves to the Gaia rail that it’s the Gaia hub for a particular BNS name
  • A keep-alive protocol, so that when the user’s computer changes its IP address, it can reconnect itself to the Gaia rail. This would permit users to run their Gaia hubs on mobile devices such as laptops, which can go offline or rejoin the system from different IP addresses as the user moves through the world.

Decentralizing the Rail

Running the above rail would not be free of course – it would cost money in terms of the bandwidth required to run at scale. In the future, we could have Stacks nodes implement the rail through a variation of the up-and-coming StackerDB system. Briefly, a StackerDB is a store-and-forward chunk store for storing soft state in the Stacks network on behalf of a smart contract. The smart contract grants a whitelist of users a storage quota (i.e. a fixed number of fixed-sized chunks they can write), and employs a best-effort store-and-forward protocol to ensure that a user’s written chunks get replicated to all Stacks nodes that replicate that particular StackerDB. It’s currently being built for sBTC signers to leverage the Stacks p2p network to exchange FROST DKG information, but it’s otherwise general purpose.

Already, the StackerDB system would enable a set of Stacks nodes to replicate an authorized user’s Gaia writes amongst themselves, as long as the data was small enough to fit into the user’s quota. From there, we could extend the StackerDB system to permit authorized users to open persistent TCP connections to Stacks nodes that replicate that DB, and in doing so, subscribe to new chunk writes and service chunk reads. This would enable users to write to their Gaia hubs at home by means of a Stacks node – the user writes data as one or more StackerDB chunks, which in turn get pushed to the user’s at-home Gaia hub (which is a subscriber).

In addition, we would extend the StackerDB system to allow downstream clients to register themselves as origins for chunk data. Then, the Gaia hub would open a persistent TCP connection to one or more public Stacks nodes that replicated the hub’s StackerDB, and register itself as the origin for the user’s chunks. Then, when the user asks the Stacks node for a particular file, a Stacks node would simply forward the request to the hub and ferry back the data it returns (just as the rail would do).

If we implement the Gaia rail this way, then we open up a way to fund Gaia operation to Stacks node operators: for a fee paid in STX, they would agree to run a StackerDB replica for a user’s Gaia hub. The node operator does not store data persistently; it just provides transit for the user’s reads and writes to and from their (NAT’ed) Gaia hub. The StackerDB chunk store acts as both a write-back cache for a user’s Gaia writes, and a read cache for other users’ Gaia reads.

1 Like

Instead of all this complication, could the servers hosting the apps themselves hold the files? App files are encrypted so who cares where they’re stored. I’m not entirely sure how this works, but if the profile url isn’t pointing to gaia.blockstack.org, then no change, otherwise change the url to point back to where the app came from.

This is undesirable, since it brings us right back to the same power dynamics of Web 2.0 that we are trying to avoid. Post Web 2.0, the user, not the app, must decide where their data is kept.

The app could simply delete your files or hold them for ransom. Better to control where your data is stored and how it is stored so you can de-risk the possibility that a malicious app destroys your data.

Nothing proposed prevents the user from using their own gaia server.

You said:

But then you said:

Then you also said:

I am all for users having control over where their data is stored, and I completely understand why this is needed.

My concerns though are how this impacts Scaling and Availability.

A users laptop on WiFi behind a DSL connection won’t be able to respond to but only a hand full of calls at a time, further more it can’t respond at all whilst its offline.

Would utilizing the suggested StackerDB system help remedy these concerns?

Wouldn’t a GAIA Rail providers want to do rate limiting to avoid hub host from being DDOS’d?

Shouldn’t the Gaia Rail provider be doing caching of frequently called content to ensure greater scaling and availability based on demand too?

The existence of a Gaia rail does not preclude someone else from running a Gaia hub which simply replicates your home Gaia hub. In fact, the rail operator could do this as an add-on service: on write, your data gets copied to both your home hub and to the rail operator’s public hub. Then, the rail could simply redirect HTTP GETs to the hub replica based on a user-specified policy in their profile.json (which the hub operator would poll every so often).

Yes, it could. The StackerDB system is essentially a way to attach a fixed amount of off-chain data to a smart contract, such that all Stacks nodes that subscribe to that smart contract will get a copy of that data. Unlike Atlas, the data can be written and re-written without sending a transaction. This could serve as the back-end of a Gaia rail service: writes would get replicated to both the rail service’s StackerDB in addition to being sent to your home hub, and reads could be serviced from the StackerDB replica instead of your home hub.

Because the ability to write data to a StackerDB is contingent on authorization of a signing key in the StackerDB’s smart contract, the system would provide native support for encoding SLAs and payment for service in Clarity. Users would pay the Gaia rail operator in STX (or some other on-chain-tracked asset) in exchange for registering with the rail’s StackerDB instance. The rail operator would run a fleet of Stacks nodes to keep the StackerDB instance online. Also, anyone can create a replica of any StackerDB by simply configuring their Stacks node to search for other Stacks nodes that replicate the same data, and then query them for newly-written state. So, the rail system not only scales up in the number of replicas; it also does so in an open-membership way.

Yes, they would.

That would be a good idea.

Getting users to create and run their own GAIA hubs seems like poor UX. I would say its as problematic as sending a single tx to update the zone file.

Secondly, am I correct in assuming that if a user doesn’t run their own GAIA hub, they actually will face a potential denial of service on applications? If this is the case then we would need to get a 100% of all app users to run their own gaia hub, which again seems pretty difficult, considering a previous attempt by blockstack at that has failed.

It does not have to be users. It just can’t be Hiro footing the bill for everyone, and it can’t be one Gaia hub per app (otherwise what’s the point?)

Blockstack (now Hiro) also never really tried to diversify Gaia hubs, nor did they even prioritize helping power users run them in a meaningful way. And now it’s costing them a fortune.