Proposal: List files in Gaia Hub

Motivation

There’s a number of reasons we’d want a list-files (ls) command in gaia hubs. The biggest is data migration and portability. In order for a user to truly be able to migrate all their data (or see all their data), they’d need to be able to see all the files that the gaia hub is storing on their behalf (because an application may not show them that information). This has also been requested a number of times (example: Feature Request: Gaia `ls` functionality)

Overview

Supporting this requires 2 spec changes:

  1. Gaia hub driver model
  2. Gaia hub API

Gaia Hub Driver Model Changes

This proposal would extend the gaia hub driver model to support listing files with a given prefix:

listFiles(prefix: string) : Promise<Array<string>>

This returns all of the files which begin with the given prefix.

API Changes

This would add a new HTTP GET route to the gaia hub:

GET /list-files/<hub-address>

This requires a valid authentication token for <hub-address> (see the Gaia authentication docs: https://github.com/blockstack/gaia#address-based-access-control)

Open Questions

How should we best support pagination?

If we want to limit the return size of these requests, we’ll need to support pagination, but many backend drivers will have different methods of doing their own pagination, and we wouldn’t want to act a buffer here.

Work Branch

Tagging @jude for comments

4 Likes

Great write-up @aaron!

Just a couple clarifications:

  • The GET /list-files/<hub-address> endpoint is implemented by the Gaia hub, not the storage endpoint. It must only be accessed via https. This is crucial because user data shouldn’t be enumerable by default by just anyone, and (see below) we’ll need to pass some sensitive data to the Gaia hub to do pagination efficiently.
  • Not all storage systems support pagination natively, since they have no notion of pages. Instead, they expect clients to pass a cursor-like object in the relevant listFiles()-like API endpoint.

Regarding pagination, one idea is to allow the Gaia hub to pass back to the client a cookie that contains any driver-specific state, like pagination cursors. Then, successive calls to GET /list-files/<hub-address>?page=XXX would preserve the pagination cursor, thereby allowing file scans to operate efficiently (the alternative would be to force each page query to scan “up to” the page requested, which would have O(n^2) time complexity for n files).

Also, regarding pagination, getting prefix matches to work at the API level is going to be a lot of work since not all drivers support it. Do we have a real case where match-by-prefix will be necessary? If so, should we just pass a prefix as a driver-specific hint in the query string, so we don’t have to commit to supporting it in all future drivers?

1 Like

Definitely excited about this feature @aaron. I don’t have anything meaningful to add regarding pagination or prefix matches that @jude mentions.

Stealthy currently writes each offline message to a file that is indexed by our own js module. There are some performance concerns when writing many files and because deletion of a file is not possible, we’re currently storing deleted file handles (essentially empty files) in an index for the day when deletion is possible. I’m curious if true file deletion is on the roadmap?

As an FYI–and probably specific to our use case, Stealthy’s indexing optionally writes two index files–one encrypted for the user and another encrypted for the recipient. This way the intended recipient is able to see what new files (messages) are available for processing.

Okay – so I think an updated proposal to deal with the issues of continuation tokens / pagination, the model should be:

POST /list-files/<hub-address>

Request JSON data:
{ continuation?: string }

Returns JSON object:

{
  files: ["a.json", "b.json", "c.json"]
  continuation?: string
}

And the driver should implement:

listFiles(prefix: string, continuation: string) : Promise<{files: Array<string>, continuation?: string} )

Hey @aaron, I think I independently came to this conclusion and wrote a PR for it here. The interface is nearly identical:

listFiles(prefix: string, page: string) : Promise<{entries: Array<string>, page: string}>

Also, the request body in my code is {page?: string} instead of {continuation?: string} (but they’re semantically equivalent).

Happy to tweak it to conform to your interface.

Also, just to amend to my comments earlier:

  • We’re settling on POST /list-files/<hub-address>, not GET
  • The request does not take any query string arguments. Everything the method needs will be sent via the request body (which shall be application/json).
1 Like

Pull request opened: https://github.com/blockstack/gaia/pull/128

The blockstack-cli tool now supports gaia_getfile, gaia_putfile, gaia_listfiles, and get_app_keys directives, making this somewhat straightfoward to test. Will deploy this code to the testnet as well.