Indexing application users

danielrearden.id.blo · September 15, 2019, 2:27pm

As part of the dapp I’m developing, I’d like to be able to maintain a list of users that are actually using the dapp server-side. Moreover, my intent is to make the server API configurable – anyone will be able to run their own server and point the dapp at it if they don’t trust or can’t access the public server. That means in order to maintain a list of users, the server needs to iterate through all possible users’ profiles to compile the list of actual users.

This approach was originally discussed in this post.

My question is two-fold:

From the perspective of my dapp, are all names and subdomains potential users whose profile I need to process and possibly add? There are currently about 1,350 pages of subdomains (or 135,000 subdomains) returned by the API. If I could whitelist/blacklist particular TLDs or subdomains, it would somewhat speed things up.
Has anyone dealt with doing indexing like this, and if so, what have you done to improve performance? As it stands, the current process requires thousands of requests to both the Blockstack core node and the Gaia hub. Running a core node locally helps considerably, but fetching the actual profile data from Gaia is still pretty slow due to rate limiting.

EDIT: After some additional thought, I think federation between the different servers may be an appropriate compromise:

Each server would maintain a list of peers.
A peer could be bootstrapped by either getting a list of usernames from an existing peer or by going through the process described above.
When a user registers, an event would be broadcast to all peers to add the username.

The downside is that there’s now a level of trust required in assuming that the events are in fact being fired. On the other hand, this would also provide a mechanism for broadcasting other events, (like the user adding content) which would lessen the need to poll Gaia for that data as well.

I would still love to hear if anyone has additional thoughts around any of this though!

mikecohen.id · September 15, 2019, 7:54pm

Hi @danielrearden.id.blo I do this in radicle apps by indexing users when they visit into a lucene index - ie the index is limited by app domain rather than user name domains - it has endpoints to build the index which allow looping over users by pulling pages from a local blockstack node but it’s a lot more data and would probably take a day or two to build initially - because of the Gaia profile reads as Jude points out. Much easier to just index users when they visit from a local blockstack node to get elastic search logic.

I’d like to extend this at some point by having a common search object schema allowing apps in general to add elastic search logic to their app via a rest api.

I don’t think there is a trust issue - I may be misunderstanding - as this is just a stateless cache of publicly available data - its a bit like radiks server in this respect but with a different elastic search implementation - the worst would be it becoming stale but this can be mitigated by say scheduled refresh.

Our indexer runs in a docker container so can be deployed and replicated and partitioned fairly easily.

danielrearden.id.blo · September 15, 2019, 8:24pm

Thanks for the feedback @mikecohen.id! I think I’m trying to do pretty much what you described, except utilizing Neo4j. The idea is, starting from a list of app users, for each user, we get the relevant files from Gaia and dump the data into the db and then expose some API for querying the data. It sounds like that’s what Radicle is doing too, which is pretty cool!

My hangup with “indexing users when they visit” concerns bootstrapping a new indexer. Imagine as, a user, I want to spin up my own indexer and use the app with that instead. Maybe the original indexer is temporarily unavailable, for example. Or the developer gets hit by a bus. The new indexer would be able to index any new visitors to the site but would have no knowledge of the previously indexed users. Likewise, if the user decides the indexer, the data would be fragmented across all indexers as not every user would talk to the same indexer.

That’s the crux of the problem as I see it… unless I misunderstood your response and missed some insight.

mikecohen.id · September 15, 2019, 9:19pm

Hey - yes understand - I don’t think you can tell if a users profile and apps list has changed until you pull the profile json so it’s not like indexing a bitcoin node where you can use a snapshot to seed another node.

I guess it depends on the question you’re asking of the indexer. If it’s ‘which apps has a user logged into?’ then you’ll have to pull the profile to answer but if it’s what discoverable data has the user uploaded on a specific domain then the index can be kept current for that domain and can be used to seed/spin up other instances and then you can do stuff like replication and sharding.

BTW I used to work with neo4j - really nice nosql db. Good luck!