WRFS

by Josh Patterson and Josh Lewis
Todo:

Abstract

We are proposing a stack of abstraction layers that are intended to facilitate the concept of data portability, of open web data. Each layer is an abstraction of functionality based on recurring computer science patterns on how to model a system. In this document we take a look at current practices for data on the web, setup an argument for open web data, look into various scenarios where this model could come to fruition, and then look at practical considerations from a business standpoint as to the economics of open web data.

Sections

When Can I Use That?

During development of any application, typically the team sits down with users and does a testing session, asking questions and gauging responses to see how well the given utility presented satisfies market demand, or --- does it "do the job?". While developing floe.tv, a media mashup system, one day I was demoing the beta to 2 videographers, showing off features, and asking their opinion about what was important to them. We came to the subject of data storage, local hardrives, and getting media online, and just as a thought exercise I asked "well, what if floe.tv just knew about all your online media by your login name, and referenced it automatically in your libraries the first time you logged in --- just as if it was an app installed locally on your hd?" and immediately both of them became excited and one asked "can I do that right now? when can I use that?" and I knew from experience that the market was speaking very loudly and clearly in my direction, and that I had better listen very closely.



The very next meeting I posed this question to our team:

What if our app was "inherently installed" in the internet? What if someone logged in, and the app just acted like a desktop app that "knew" about your flickr images, your youtube videos, it knew about your myspace friends, facebook friends, and automatically treated them as one logical database, one logical social graph? And someone started right into an app tutorial right off the bat with their contacts, files, and assets already referenced (but fully respecting privacy, control, etc)?

So the next question naturally becomes that all sounds really great, but ... how do we get there?

In order to even think about "getting there", we need to map out and discuss a few things first. Ideally we need have to at least sketch out or develop a reasonable strategy for:

  1. What are some of the ways we can execute a strategy that satisfies the stated needs?
  2. What are some existing technologies that exhibit the properties we want?
  3. Condensing a layered abstraction of the utility from the vapor of its required properties
  4. Defining which current technologies fit the properties of the current abstraction so that we dont reinvent the wheel.
  5. Being able to describe how a prototype of the proposed system might execute
  6. Define the obstacles, opportunities, strengths, and weaknesses for a proposed system such as this

A Rough Sketch

Signal in the Noise

For every idea, there generally is 1000 other ideas that preceeded, influenced, or inspired that idea. The more we talked about integrating apis, sharing data, and figuring out just how to make it work, the more we realized that other people were having the same conversations. The Six Apart guys came up with openID in order to solve that pesky log-in issue, and then got onto data permissions with the coming OAuth spec. Brad Fitzpatrick and David Recordon laid out their thoughts on where the open social graph is, and where it needs to go. This writeup really got us thinking, and more importantly, talking to people. Brad and David are obviously smart guys, and their points quickly resonated with us. We digested their writeup and came away with the basic tenant being a problem that plagues the current crop of internet apps:

People are getting sick of registering and re-declaring their friends on every site., but also: Developing "Social Applications" is too much work.[1]

They go on to describe some properties of a potential implementation, which got us to thinking: "Well, thats good stuff. But if we wanted to expand that to all web data to power our web app, how would we do that?". Several other people are looking into this concept, such as Chris Messina, Chris Saad Jeremy Keith, David Recordon, Brian Oberkirch, Wired Magazine, The Social Network Portability Group, and The Microformats Group. We started reading what their take was and tried to really digest what was floating around in the blogosphere. It didn't take long to figure out that we weren't alone, and people were more than ready to begin a dialog about how to make open web data happen, sooner than later.

So exactly what are we trying to construct here?

From a user's perspective we want this system to:

As developers we want a system that:

Hey, while we're dreaming big, let's go for it --- We might also want it to:

So exactly what are you saying here?

Basically, we want an api that allows us to view, query, and aggregate a user's data, regardless of location (restriction: data must be accessible through a webserver) basically in the same way that we do with a local disk based filesystem (at least in most ways). Say, how does a filesystem work, anyway? That sounds like a good place to go for a start on our model!

That Same Ol Song

So we said we want to be able to:

Our data? Huh. What other systems do these same things?

So really we want to do some things that have already been done quite well in computer science. So Let's take a look at how they do it, and build a roadmap/model of how we might create an abstraction.

The Filesystem as a Metaphor

What are some interesting properties of a filesystem that are very applicable to our situation?

What other filesystems are there? How are they similar?

ResourceAggregation UnitStorage UnitsNotes
Linux FSFile LinkInodeDisk Block
Sun's Network File System User File[Inode][Data Block]
DCE User File[Inode][Data Block]
AFS User File[Inode][Data Block]
Plan 9 User File[Inode][Data Block]
SSHFS User File[Inode][Data Block]
Elastic Drive User File[Inode][Data Block]
GmailFS User File[Inode][Data Block]
Google FSResourceMaster NodeChunk Server GFS uses a simple design with a single master server for hosting the entire metadata and where the data is split into chunks and stored in chunkservers.
Amazon's Dynamo Key/Value ResourceMaster NodeChunk Server
HDFS User FileNamenode + Datanode[Data Block]
WRFSUser DataWeb InodeData Provider

The Database as a Metaphor

I think really the aspects of a database that are interesting in this context are Protect, Relate, and Query.

DNS as a Metaphor

DNS allows us to take a domain name and translate it into an IP address; This is interesting from the standpoint of our need to resolve an openID-like token into a set of data container uris for a given data type { social graph, images, videos, ? }

Abstract Art

So at this point we've talked about what we want, listed some properties of those requirements, and talked about existing technology that performs similar function. It might be a good time to try and develop an abstraction of what we want so we can get a more clear understanding of how it might work.

Application Layer

At the top of most stacks or abstraction layers is the application layer. It is simply the endpoint of where the request begins (most of the time) and the result comes back to. (example: A sql query in a winform application is executed via an ADO.NET connection at the application layer, gets send to the database, checked for validity, translated into a query tree, executed against tables, contructs and relates the data into a set of records, and then returned back to the application layer.)

With something like we are proposing, our application layer could be a number of platforms, but a good example might be a flash application that wants to construct a slide show of all images for openID = 'joe@oid.floe.tv' regardless of if they are stored on smugmug or flickr or wherever.

Query and Aggregation Layer

Processes data, relates data, aggregates data together, returns it back to the application layer.

Needs to be accessible either as an API or as a SQL-Like query language.

Discovery and Translation Layer

Knows how to find data based on an openID-type identifier. This layer might query the Identity provider for the location of the data-indexing service for this user, which for now we will call the "WebInode Server", and its interface might look like this (should return dummy stub data). This mechanism simply tells the discovery layer "hey, this user has data in X, Y, and Z data providers". These results are then used to query each data store for what data the user might have in each of them, and the results are returned to the parent layer for aggregation.

Storage Layer

So now we've hit the bottom layer of our model, where the data itself is actually stored. This layer needs to represent a place where any type of data is stored and be able to be queried about its contents for a specific userID. It also should protect the data so as to protect the user and the data provider, but allow for access, given the correct credentials, to the actual data/file itself to the proper parties.

Filling in the Blanks

So now we've got these nice layers to break things up, abstract away the details, and let us focus on managing the process. But really, if you leave those layers, well, abstact, then all they really do is end up in a research paper, and we are trying to make something work (sooner than later) here. So let's see what we have laying around that might work for our model.

For the time being, we'll call the system "WRFS" as in "Web Relational File System"

Application Layer

Pretty much everything { web-app, flash app, winform, shell script } can be an application, so we don't have to do much here.

Query and Aggregation Layer

To make things simple, let's say javascript is our api language, since this is aimed initially at the web2.0 world. Let's say that we might make a call like:

var oProxy = new WRFSProxy();
oProxy.GetDataFor( strOpenIDIdentifier, WRFS.DATA_TYPE_IMAGE, MyCallbackFunction );


Under the hood of that javascript call this layer would use a http-calling mechanism (depending on implementation) such as:

Let's say for the sake of this exercise that we are using openID to manage our identity, and we send off a quick webservice call to the domain in the openID identifier (along with any security tokens) to say "hey, where does this person store their data index?", and let's assume that at this point openID had a mechanism or field in place that pointed to that server with a URI.

Discovery and Translation Layer

The system then calls a service that would return a list of indexed data storage URIs for a given openID identifier. This api might look like this. (Notice the name of the sample server? Inode.asmx --- We make the allusion that the data indexing service is essentially the "inode" of the web file system.) The system then takes each URL/URI and queries that server for a list of relevant files, which are returned possibly as xml in the RDF dialect. There are some issues here, though. Is everyone going to let us coming tromping through their front door just looking for random people's data? More than likely not, but maybe, just maybe, if we knock the right way, they might let us in.

Data Storage Layer

Now, we just established that we look at the data-indexing service as a sort of filesystem "inode", so, if we are storing the actual user's data on a webserver, what does that make it? A disk block, in a lot of ways. Just like with a normal filesystem in linux, a single jpeg image might be stored in 50 different disk blocks scattered around the hard drive, and a single inode points to all the disk blocks. Here, the data index service, or inode, points to all of the servers that a user has data in. And just like how the linux filesystem uses the call "bmap" to take an inode and find disk blocks, our "javascript system call" uses the data-indexing service to find the user's stored data "blocks".



But whoa now --- Who can do what with who's data, and how? Thats a very big issue that a lot of people are looking at, and we believe the emerging OAuth spec might just be the solution for this. OAuth does a lot of cool things, but in the end, it essentially says "yes, the application at floe.tv can mashup the image data at flickr for user X". Sounds like that is very, very handy for what we would like to do here. What if each data storage service, like say a flickr, exposes an open, standard web API that allows for automated OAuth mechanisms (generally, the way we understand it, is that currently a user has to actually be redirected to the data storage page, click "yes", etc, and then be redirected back to the 3rd party application site.) So let's pretend for a moment that OAuth worked with our "Web Inode" and allowed for caching of its security tokens, and allowed us to quickly query the data storage layer via the js/webservice mechanism to get the list of relevant data stored on that particular web server for that particular person. This stub data is then returned to the "Discovery and Translation Layer" via { SOAP, REST, JSON } and then sent back up the stack to the "Query and Aggregation Layer" to be combined or aggregated into a single recordset or data structure { RDF, SIOC, XML } to be passed back to the Application Layer. The one missing piece in this sequence is "well, how did the so-called "web inode" KNOW in the first place that there was data in flickr for openID X"? and to that I'd suggest simply adding a web-method like this to the "web inode" and then requiring each data storage container to register the fact that the user has data in their container. Just a simple, single web-method call the first time a user enters some data in their service.

Example Use Case

Let's say we want to tackle another problem, say like getting all friends of a user regardless of social network (hey! wait a minute! that sounds an awfully like that whole Open Social Graph nonsense. You cant do that! --- but oh wait, we can). In our as3 code we might use an api call, or we might use some sort of WebSql call like



SELECT * FROM [Global Social Graph] WHERE [openID] = 'joe'

The Query and Aggregation Layer would break this query down into a query tree, and then pass on the query to the discovery layer. The discovery layer might find out that userID = 'joe' has social graph data in myspace, and then it sends a REST request to myspace.com (with cached OAuth tokens handling security and permissions), possibly with parameters, to find out which friends 'joe' has stored there, which might be returned as RDF data. This data would then be passed up to the Query and Aggregation layer, to be recombined with the data the process also got from ning.com, and presented to the Application layer as a unified recordset.

Color By Number

So really, if you think about it, most everything we need is either already in use, or sitting in a final spec stage (ok, we still need some standardization, and some minor extensions to openID and OAuth, but those aren't particularly monster obstacles). It would really only take a group of open-minded people that said "you know, that sounds cool, and we can be a part of a sum that is greater than its parts. Let's give that a shot!" and form up into a sort of "open data alliance" and get to work. But ...

Why would they? Where does that takes us? What are the consequences?

No one is ever going to open their data up into some crazy united disk system. How do you make money? You mean we can't lock the user into our site? What? This is completely LUDICROUS. Simply crazy talk.

Mi Casa Su Casa

Stone Soup

From Wikipedia:

According to the story, some travelers come to a village, carrying nothing more than an empty pot. Upon their arrival, the villagers are unwilling to share any of their food stores with the hungry travelers. The travelers fill the pot with water, drop a large stone in it, and place it over a fire in the village square. One of the villagers becomes curious and asks what they are doing. The travelers answer that they are making "stone soup", which tastes wonderful, although it still needs a little bit of garnish to improve the flavor, which they are missing. The villager doesn't mind parting with just a little bit to help them out, so it gets added to the soup. Another villager walks by, inquiring about the pot, and the travelers again mention their stone soup which hasn't reached its full potential yet. The villager hands them a little bit of seasoning to help them out. More and more villagers walk by, each adding another ingredient. Finally, a delicious and nourishing pot of soup is enjoyed by all.

Really, in a lot of ways, our band of startups are just the hungry travelers. Alone, we have a long road to haul to compete with the high traffic sites. We need traffic to survive, and the evolution of the web is highly heterogenous in nature in terms of mixtures of purpose and data. In other words, traffic is our food, and right now a very attractive option for all of us is to share data and promote one another's sites so that we become more than the sum of our parts.

In effect, we are making stone soup, and WRFS is the pot.

Markets and Equilibrium

The Funny Thing Is...

There are always pressures on any market, especially markets that are far from mature, and especially one like the internet and data storage.



Let's think about a few things first...

Brian Oberkirch states it well in his outline:

The keystone thought undergirding this design approach is that your data is yours. Your identity is yours. Web services are blessed to get your attention and they shouldn’t anticipate owning, hoarding or slow walking access to our data and contact lists.[2]

The Equilibrium of the Economics of Data Storage

Because users will tend towards who gives them the best deal, and they can and will move their data to a system that gives them the maximum value for their data, and being able to inter-relate data makes data more valuable. In the end, since the commodity of hosting images, social graphs, and videos is easily setup, the "containers" hold no real power in the long run. To illustrate this, let's take a look at the convenience store business.

Myspace as the Gas-N-Go of the internet

Convenience stores make nearly no money off of selling gasoline, maybe a penny a gallon. However, they continue to sell gasoline --- why? Because it attracts motorists to their location, and provides them with the opportunity to sell milk, candy, bread, etc. Carrying the gas is essentially the overhead of marketing their location and attracting customers. The store owners are simply a third party to the gasoline transaction yet are able to make a profit in the percentage of motorists who also drop into the store. Who holds the power? Ultimately the consumer does, since they can get gas elsewhere if the terms of the station do not meet their needs, although it could be argued that gas prices are controlled the Gas Companies themselves, but I digress. So what does this have to do with the economics of storing, aggregating, and protecting data?

But in the end

It all comes down to the perception of value in the eyes of the customer, what they will bear in terms of restrictions, and what alternative choices they have.

I for one would like to give them a new choice and let the market decide.

Josh Patterson
(email: jpatterson @ [insert floe.tv here ] )
floe.tv

References

  1. http://bradfitz.com/social-graph-problem/
  2. http://www.brianoberkirch.com/2007/08/02/designing-portable-social-networks/