WRFS

by Josh Patterson and Josh Lewis
Todo:

When Can I Use That

During development of any application, typically the team sits down with users and does a testing session, asking questions and gauging responses to see how well the given utility presented satisfies market demand, or --- does it "do the job?". While developing floe.tv, a media mashup system, one day I was demoing the beta to 2 videographers, showing off features, and asking their opinion about what was important to them. We came to the subject of data storage, local hardrives, and getting media online, and just as a thought exercise I asked "well, what if floe.tv just knew about all your online media by your login name, and referenced it automatically in your libraries the first time you logged in --- just as if it was an app installed locally on your hd?" and immediately both of them became excited and one asked "can I do that right now? when can I use that?" and I knew from experience that the market was speaking very loudly and clearly in my direction, and that I had better listen very closely.

The very next meeting I posed this question to our team:

What if our app was "inherently installed" in the internet? What if someone logged in, and the app just acted like a desktop app that "knew" about your flickr images, your youtube videos, it knew about your myspace friends, facebook friends, and automatically treated them as one logical database, one logical social graph? And someone started right into an app tutorial right off the bat with their contacts, files, and assets already referenced (but fully respecting privacy, control, etc)?

So the next question naturally becomes that all sounds really great, but ... how do we get there?

In order to even think about "getting there", we need to map out and discuss a few things first. Ideally we need have to at least sketch out or develop a reasonable strategy for:
  1. What are some of the ways we can execute a strategy that satisfies the stated needs?
  2. What are some existing technologies that exhibit the properties we want?
  3. Condensing a layered abstraction of the utility from the vapor of its required properties
  4. Defining which current technologies fit the properties of the current abstraction so that we dont reinvent the wheel.
  5. Being able to describe how a prototype of the proposed system might execute
  6. Define the obstacles, opportunities, strengths, and weaknesses for a proposed system such as this

A Rough Sketch

So exactly what are we trying to construct here?

From a user's perspective we want this system to:

As developers we want a system that:

Hey, while we're dreaming big, let's go for it --- We might also want it to:

So exactly what are you saying here?

Basically, we want an api that allows us to view, query, and aggregate a user's data, regardless of location (restriction: data must be accessible through a webserver) basically in the same way that we do with a local disk based filesystem (at least in most ways). Say, how does a filesystem work, anyway? That sounds like a good place to go for a start on our model!

That Same Ol Song

So we said we want to be able to:

our data? Huh. What other systems do these same things? So really we want to do some things that have already been done quite well in computer science. So Let's take a look at how they do it, and build a roadmap/model of how we might create an abstraction.

The Filesystem as a Metaphor

What are some interesting properties of a filesystem that are very applicable to our situation?

The Database as a Metaphor

I think really the aspects of a database that are interesting in this context are Protect, Relate, and Query.

DNS as a Metaphor

DNS allows us to take a domain name and translate it into an IP address; This is interesting from the standpoint of our need to resolve an openID-like token into a set of data container uris for a given data type { social graph, images, videos, ? }

Abstract Art

So at this point we've talked about what we want, listed some properties of those requirements, and talked about existing technology that performs similar function. It might be a good time to try and develop an abstraction of what we want so we can get a more clear understanding of how it might work.

Application Layer

At the top of most stacks or abstraction layers is the application layer. It is simply the endpoint of where the request begins (most of the time) and the result comes back to. (example: A sql query in a winform application is executed via an ADO.NET connection at the application layer, gets send to the database, checked for validity, translated into a query tree, executed against tables, contructs and relates the data into a set of records, and then returned back to the application layer.)

With something like we are proposing, our application layer could be a number of platforms, but a good example might be a flash application that wants to construct a slide show of all images for openID = 'joe@oid.floe.tv' regardless of if they are stored on smugmug or flickr or wherever.

Query and Aggregation Layer

Processes data, relates data, aggregates data together, returns it back to the application layer.

Needs to be accessible either as an API or as a SQL-Like query language.

Discovery and Translation Layer

Knows how to find data based on an openID-type identifier. This layer might query the Identity provider for the location of the data-indexing service for this user, which for now we will call the "WebInode Server", and its interface might look like this (should return dummy stub data). This mechanism simply tells the discovery layer "hey, this user has data in X, Y, and Z data providers". These results are then used to query each data store for what data the user might have in each of them, and the results are returned to the parent layer for aggregation.

Storage Layer

So now we've hit the bottom layer of our model, where the data itself is actually stored. This layer needs to represent a place where any type of data is stored and be able to be queried about its contents for a specific userID. It also should protect the data so as to protect the user and the data provider, but allow for access, given the correct credentials, to the actual data/file itself to the proper parties.

Filling in the Blanks

So now we've got these nice layers to break things up, abstract away the details, and let us focus on managing the process. But really, if you leave those layers, well, abstact, then all they really do is end up in a research paper, and we are trying to make something work (sooner than later) here. So let's see what we have laying around that might work for our model.

For the time being, we'll call the system "WRFS" as in "Web Relational File System"

Application Layer

Pretty much everything { web-app, flash app, winform, shell script } can be an application, so we don't have to do much here.

Query and Aggregation Layer

To make things simple, let's say javascript is our api language, since this is aimed initially at the web2.0 world. Let's say that we might make a call like:

var oProxy = new WRFSProxy();
oProxy.GetDataFor( strOpenIDIdentifier, WRFS.DATA_TYPE_IMAGE, MyCallbackFunction );


Under the hood of that javascript call this layer would use a http-calling mechanism (depending on implementation) such as: Let's say for the sake of this exercise that we are using openID to manage our identity, and we send off a quick webservice call to the domain in the openID identifier (along with any security tokens) to say "hey, where does this person store their data index?", and let's assume that at this point openID had a mechanism or field in place that pointed to that server with a URI.

Discovery and Translation Layer

The system then calls a service that would return a list of indexed data storage URIs for a given openID identifier. This api might look like this. (Notice the name of the sample server? Inode.asmx --- We make the allusion that the data indexing service is essentially the "inode" of the web file system.) The system then takes each URL/URI and queries that server for a list of relevant files, which are returned possibly as xml in the RDF dialect. There are some issues here, though. Is everyone going to let us coming tromping through their front door just looking for random people's data? More than likely not, but maybe, just maybe, if we knock the right way, they might let us in.

Data Storage Layer

Now, we just established that we look at the data-indexing service as a sort of filesystem "inode", so, if we are storing the actual user's data on a webserver, what does that make it? A disk block, in a lot of ways. Just like with a normal filesystem in linux, a single jpeg image might be stored in 50 different disk blocks scattered around the hard drive, and a single inode points to all the disk blocks. Here, the data index service, or inode, points to all of the servers that a user has data in. And just like how the linux filesystem uses the call "bmap" to take an inode and find disk blocks, our "javascript system call" uses the data-indexing service to find the user's stored data "blocks".

But whoa now --- Who can do what with who's data, and how? Thats a very big issue that a lot of people are looking at, and we believe the emerging OAuth spec might just be the solution for this. OAuth does a lot of cool things, but in the end, it essentially says "yes, the application at floe.tv can mashup the image data at flickr for user X". Sounds like that is very, very handy for what we would like to do here. What if each data storage service, like say a flickr, exposes an open, standard web API that allows for automated OAuth mechanisms (generally, the way we understand it, is that currently a user has to actually be redirected to the data storage page, click "yes", etc, and then be redirected back to the 3rd party application site.) So let's pretend for a moment that OAuth worked with our "Web Inode" and allowed for caching of its security tokens, and allowed us to quickly query the data storage layer via the js/webservice mechanism to get the list of relevant data stored on that particular web server for that particular person. This stub data is then returned to the "Discovery and Translation Layer" via { SOAP, REST, JSON } and then sent back up the stack to the "Query and Aggregation Layer" to be combined or aggregated into a single recordset or data structure { RDF, SIOC, XML } to be passed back to the Application Layer. The one missing piece in this sequence is "well, how did the so-called "web inode" KNOW in the first place that there was data in flickr for openID X"? and to that I'd suggest simply adding a web-method like this to the "web inode" and then requiring each data storage container to register the fact that the user has data in their container. Just a simple, single web-method call the first time a user enters some data in their service.

Example Use Case

Let's say we want to tackle another problem, say like getting all friends of a user regardless of social network (hey! wait a minute! that sounds an awfully like that whole Open Social Graph nonsense. You cant do that! --- but oh wait, we can). In our as3 code we might use an api call, or we might use some sort of WebSql call like

SELECT * FROM [Global Social Graph] WHERE [openID] = 'joe'

The Query and Aggregation Layer would break this query down into a query tree, and then pass on the query to the discovery layer. The discovery layer might find out that userID = 'joe' has social graph data in myspace, and then it sends a REST request to myspace.com (with cached OAuth tokens handling security and permissions), possibly with parameters, to find out which friends 'joe' has stored there, which might be returned as RDF data. This data would then be passed up to the Query and Aggregation layer, to be recombined with the data the process also got from ning.com, and presented to the Application layer as a unified recordset.

Color By Number

So really, if you think about it, most everything we need is either already in use, or sitting in a final spec stage (ok, we still need some standardization, and some minor extensions to openID and OAuth, but those aren't particularly monster obstacles). It would really only take a group of open-minded people that said "you know, that sounds cool, and we can be a part of a sum that is greater than its parts. Let's give that a shot!" and form up into a sort of "open data alliance" and get to work. But ...

Why would they? Where does that takes us? What are the consequences?

No one is ever going to open their data up into some crazy united disk system. How do you make money? You mean we can't lock the user into our site? What? This is completely LUDICROUS. Simply crazy talk.

Markets and Equilibrium

The Funny Thing Is...

There are always pressures on any market, especially markets that are far from mature, and especially one like the internet and data storage.

Let's think about a few things first...

The Equilibrium of the Economics of Data Storage

Because users will tend towards who gives them the best deal, and they can and will move their data to a system that gives them the maximum value for their data, and being able to inter-relate data makes data more valuable. In the end, since the commodity of hosting images, social graphs, and videos is easily setup, the "containers" hold no real power in the long run. To illustrate this, let's take a look at the convenience store business.

Myspace as the Gas-N-Go of the internet

Convenience stores make nearly no money off of selling gasoline, maybe a penny a gallon. However, they continue to sell gasoline --- why? Because it attracts motorists to their location, and provides them with the opportunity to sell milk, candy, bread, etc. Carrying the gas is essentially the overhead of marketing their location and attracting customers. The store owners are simply a third party to the gasoline transaction yet are able to make a profit in the percentage of motorists who also drop into the store. Who holds the power? Ultimately the consumer does, since they can get gas elsewhere if the terms of the station do not meet their needs, although it could be argued that gas prices are controlled the Gas Companies themselves, but I digress. So what does this have to do with the economics of storing, aggregating, and protecting data?

But in the end

It all comes down to the perception of value in the eyes of the customer, what they will bear in terms of restrictions, and what alternative choices they have.

I for one would like to give them a new choice and let the market decide.

Josh Patterson
(email: jpatterson @ [insert floe.tv here ] )
floe.tv