Request for Comments: Gaia Indexing Service

aaron · May 31, 2018, 1:58pm

Overview

The basic idea here is that many applications will have a difficult time implementing sharing and other functionality directly with the decentralized data storage interface provided by Gaia. The solution is to provide a indexing service which is open-source, configurable, and easy to deploy. The exact specifications for this, however, still need to be ironed out.

Easily realizable version

The most directly realizable version is something like a simple search indexer, which uses multi-player reads of gaia-stored data. The indexer would consume a data schema, and be provided with a filename to regularly index (this is a pretty direct generalization of the profile indexer that powers our currently deployed search service (see https://github.com/kantai/blockstack-search-indexer/)

This could emit data in a number of different formats: JSON, by default, but also importantly it should be able to POST to an elasticsearch endpoint. However, the repository should have a pretty easy to deploy setup which will initialize both the indexer, and a search endpoint.

Open questions

Support for notifications?

Can this service be used to enable notifications (i.e., pushed from the indexer to a client)?
If so, is there a standard that we’d like to implement? How should this interact with the notion of a “gaia inbox”?

jude · May 31, 2018, 3:23pm

Support for notifications?

I think this is going to be an app-specific thing, regardless of whether or not it touches the indexer design and/or implementation.

To a first approximation, the indexer could implement a set of GET and POST endpoints to receive re-indexing hints. Apps would use this endpoint to tell the indexer that the data in the user’s Gaia hub has changed, and that the indexer should go re-crawl it. The endpoint would have the following properties and exhibit the following behaviors:

The GET endpoint serves a description of how to POST hints. It includes:
- A version number
- A challenge text
- Which app(s) it indexes
- The maximum payload size the POST hint will accept
The POST endpoint authenticates the POSTer as being sent by a particular Blockstack user. This can be achieved much the same way it is with Gaia—the indexer gives the user a challenge text, and the user replies a signature over the challenge text, a random salt, and the posted payload.
The POST endpoint accepts a posted payload that encodes the following information:
- Version number
- The user’s app address
- a list of files that the user has changed (and need to be re-crawled)
- OPTIONAL: the new data, if it’s small enough
The POST endpoint requires that the posted payload’s user app address must correspond to the public key used to authenticate the POST header—i.e. the user can only signal that they have modified their own files for a specific application.

The indexer would need to have access to each user’s profile so it can identify and authenticate the POSTed hint. I think this is fine–getting the set of profiles is going to be necessary anyway, since the indexer will need to know where the set of Gaia hubs are for a particular app.

The indexer implementation can decide what to do when a user POSTs to it. It can do things like:

Rate-limit or throttle user requests
Queue files for reindexing at a particular date, or at a particular rate
Synchronously update its index with new data
Ignore the user
etc.

I’m not sure this has any interaction with the Gaia inbox proposal—I think the Gaia inbox proposal is less about maintaining an index over an application’s data, and more about bootstrapping a social graph between users. Thoughts?

aaron · May 31, 2018, 3:31pm

Here’s a mockup of a social network bootstrap:

Alice adds user Bob as a desired contact
Bob searches for all users which indicate Bob as a desired contact
Bob adds Alice back

This could be done more simply with the gaia hub inboxes, and the design of the gaia hub inboxes was to prevent the above use-case exactly, but I want to explore why the above use case is a bad one, because this is significantly simpler of an architecture (and doesn’t involve users writing to each other’s gaia hubs).

jude · May 31, 2018, 3:36pm

Here’s a mockup of a social network bootstrap:
Alice adds user Bob as a desired contact
Bob searches for all users which indicate Bob as a desired contact
Bob adds Alice back

Totally agree that Gaia inboxes make this particular problem simpler!

My protocol sketch above was more for the case of:

Alice and Bob are already connected
Bob’s signs in, and in doing so, subscribes to push notifications from the indexer for Bob-specific events.
Alice writes a new file, or updates an existing one (like a status or profile picture). This pushes a notification to the indexer that the relevant file(s) have changed.
The indexer sends Bob an immediate notification that Alice has written new data, and his client refreshes it.
The indexer marks its cached files for Alice as stale, until it can fetch them from her Gaia hub and re-process them.

aaron · May 31, 2018, 6:37pm

Right – I think this is what I was trying to get an idea of. The above use case actually doesn’t sound like an indexer, but instead a notification service, with the difference being how important the specific service is to the normal functioning of the application. A search index should (in theory) be a completely replaceable part – Alice could run an “Alice Index” and Bob could run a “Bob Index” and have the exact same data. This is not true for the proposed notification system. With such a system, it’s very important which notification service Bob pushes his updates to, because Alice should subscribe for updates for Bob’s events from that service. In that case, I think this is something that would need to be user-specified and published, i.e., Bob’s profile contains an entry that says “For notifications about Bob’s files, ask server X”. This, at least to me, seems like a separate thing from a search index.

jude · May 31, 2018, 7:16pm

The use-case I had in mind was something like a decentralized Facebook, which needs to implement both an indexer and a notification service. The indexer aggregates your friends’ status updates into a Wall, which can be fetched with a single HTTP request. The notification service informs you when one of your friends posts something, so your client sees the update without having to poll the indexer all the time. Both the indexer and notification service are app-specific.

How separable are indexers and notification services? I think the deployments are separable—Alice and Bob could run their own indexers and share a notification service. However, the code probably isn’t separable—handling notifications in a way that achieves the above effect sounds like a cross-cutting concern to me.

I’m not convinced that Alice and Bob need to share a notification service per se, although a simple implementation of a notification service could be a logically centralized one. But, if the indexer and notification service turn out to be logically inseparable, we’ll need to think hard about the design of the notification service to allow for multiple cooperating deployments.

One idea I had for this problem is to have a namespace for the decentralized Facebook whereby people who run indexers and notification services can list their services’ DNS or IP addresses. Then, users subscribe to one or more such services. While indexers don’t need to communicate, the notification services can ensure all-to-all notification transmission by enumerating the set of notification services via the namespace, and forwarding notifications along to other notification servers (kind of like how Matrix works today).

aaron · May 31, 2018, 8:11pm

I think this is an example where the assembly of the wall could easily be done client-side, where the indexer is just used to aggregate — it’s just a search endpoint. So in the example of a wall, you’d have a search like “find all posts with wall identifier = X”, and the client would be responsible for assembling.

They wouldn’t need to share the same notification service, but in order to receive notifications from Bob, Alice would need to subscribe for “Bob Events” from a service that Bob’s client is communicating with (say Service 1). Now, Alice could designate some other service (Service 2) as a notification service for “Alice Events”, and then subscribers for notifications on Alice events could use that service. Obviosly there’s other schemes that are possible, but those require either (1) way more infrastructure to be actively crawling or (2) significant latency degradation.

alexc.id · June 13, 2018, 6:03pm

I don’t understand why that use case is a bad one as long as nothing requires step 1 of a user (also, you might want the ability to undo step 1–i.e. remove Bob as a desired contact before Bob gets a chance to search for users indicating him as a desired contact). Is there more background information that I missed?

aaron · June 13, 2018, 6:10pm

I’m fairly convinced that this use case for an indexer is fine. The only downside I can really think of is that it makes the indexer a required component for the application’s normal functionality – though I would argue that is okay as long as the indexer is user-selectable.

alexc.id · June 13, 2018, 6:11pm

This is actually similar to how Stealthy’s offline messaging service works today–the final product it assembles is not a wall, but there is no reason why it couldn’t be. With some planned modifications to our protocol, it scales reasonably well, but probably works best with something like the Gaia inbox proposal.

This discussion about indexing is an interesting way of potentially increasing the efficiency of the protocol in a way similar to the inbox proposal (reducing the number of sources that need to be consulted to check for updates). I wonder if it would be sufficiently fast for offline messaging notifications–though I believe the Gaia inbox is still preferable.

alvesjtiago · June 20, 2018, 4:15am

Hi everyone. First of all, great discussion here and I agree with most of the topics raised.

Without getting into the technical implementation for now, I would just like to raise two points:

From my personal experience with Travelstack and after talking with Justin from Graphite and George from Souq, there is a first indexing component that is common to all our apps: the ability to know which users are already authenticated. This could be integrated directly into core.blockstack (like the endpoint to lookup users) or as a separate service but I believe that for new developers it would be extremely useful to have that information readily available when they start.
This would also allow for cross-app prompts such as querying if a user uses both Graphite and Travelstack, for example, and prompting to add an image to a doc from Travelstack.
Regarding the other features such as the example given of adding a contact I believe both services (indexer and notifications) serve different purposes. The indexer would always work as the aggregator that one could query to reconstruct the overall graph and the notifications service as an ephemeral service to push and receive changes from the users.

Looking forward to hear your opinions.

petec · July 10, 2018, 4:42pm

Was wondering if anyone can share some sample code or point me to other resources on how to create a Gaia Indexing Service. Thanks

alexc.id · July 28, 2018, 12:03am

Hi Petec,

There is an example up from our Feb. 2018 Stealthy website on our github. Look at indexedIO.js

Some things to note:

hierarchy not supported (i.e. one index per directory)
indexedIO.js stores deleted files under ‘inactive’, this is inefficient for systems that create/delete a lot of files
a sharedIndex is also created and encrypted with a separate public key (this is to allow others to understand what files are present)
if you see mention of firebase or firebaseIO, this was a switchable back end that allows for rapid development and debugging

Here’s the github link:

We’ve since designed a number of changes and might open source this eventually with a lot of new features, though we’re looking at Jude’s list files work with GAIA before hand, and that might actually meet your current requirements. Be sure to look there first.

-AC

alexc.id · July 28, 2018, 12:08am

Hi Tiago,

I’m a bit late in replying, but I’m wondering if “the ability to know which users are already authenticated” is met by the apps list in a user’s profile or if you mean something different?

For example, from my profile, the apps list shows that I’ve been on stealthy, graphite, and others (https://gaia.blockstack.org/hub/1GHZbCnbufz53Skb79FwnwuedW4Hhe2VhR/0/profile.json):

"apps": {
            "https://www.stealthy.im": "https://gaia.blockstack.org/hub/1MkrVDKyiPRh4qNXfnMXt67VHQwxLy9CXH/",
...
"https://gaia.blockstack.org/hub/1rAvjmKvtPGnEKd4TEiHzmtFxhxJvJXen/",
            "http://localhost:8080": "https://gaia.blockstack.org/hub/1HAYgicfp9uVZMiYJPCLDiFxv11pqTautH/",
            "http://localhost:3000": "https://gaia.blockstack.org/hub/19ixjyXmbDq9w1VMXCEkrEMytW712rt9eM/",
            "https://app.graphitedocs.com": 
...

-AC

alvesjtiago · July 28, 2018, 2:29pm

Hi Alexander,

Thanks for the reply
I mean indexing that information for every user in order to know which users already authenticated on your application without having to query all 22.000 users (number of users on blockstack right now: https://core.blockstack.org/v1/names?page=219) every time you need to know who’s already using your app. Does that make sense?

The indexer for Travelstack does exactly that and exposes that info at https://user-indexer.travelstack.club/users.json. It would be great to hear your thoughts on that.

Thanks,
Tiago

alexc.id · July 30, 2018, 6:45pm

Thanks for the explanation.

I don’t think I can suggest any alternates you haven’t already considered (i.e.a centralized store like Firebase, etc.). We’ve talked about a decentralized analytics/db service, but that’s a ways off for us to reconsider.

An analytics platform might also be useful for this depending on it’s ability to allow you to incorporate user data and export it manually or better automatically–also, inevitably someone will ask you about DAU/MAU stats.

Here’s the forum discussion on Blockstack Portal analytics: Tracking in Browser Portal