by Josh Patterson and Josh Lewis
Todo:
We are proposing a stack of abstraction layers that are intended to facilitate the concept of data portability, of open web data. Each layer is an abstraction of functionality based on recurring computer science patterns on how to model a system. In this document we take a look at current practices for data on the web, setup an argument for open web data, look into various scenarios where this model could come to fruition, and then look at practical considerations from a business standpoint as to the economics of open web data.
During development of any application, typically the team sits down with users and does a testing session, asking questions and gauging responses to see how well the given utility presented satisfies market demand, or --- does it "do the job?". While developing floe.tv, a media mashup system, one day I was demoing the beta to 2 videographers, showing off features, and asking their opinion about what was important to them. We came to the subject of data storage, local hardrives, and getting media online, and just as a thought exercise I asked "well, what if floe.tv just knew about all your online media by your login name, and referenced it automatically in your libraries the first time you logged in --- just as if it was an app installed locally on your hd?" and immediately both of them became excited and one asked "can I do that right now? when can I use that?" and I knew from experience that the market was speaking very loudly and clearly in my direction, and that I had better listen very closely.
So the next question naturally becomes that all sounds really great, but ... how do we get there?
In order to even think about "getting there", we need to map out and discuss a few things first. Ideally we need have to at least sketch out or develop a reasonable strategy for:
Basically, we want an api that allows us to view, query, and aggregate a user's data, regardless of location (restriction: data must be accessible through a webserver) basically in the same way that we do with a local disk based filesystem (at least in most ways). Say, how does a filesystem work, anyway? That sounds like a good place to go for a start on our model!
Our data? Huh. What other systems do these same things?
So really we want to do some things that have already been done quite well in computer science. So Let's take a look at how they do it, and build a roadmap/model of how we might create an abstraction.
What are some interesting properties of a filesystem that are very applicable to our situation?
We have to be able to store our data in whatever container we want on the internet, and we need to be able to aggregate that data back together again, right? Well, what if we said:
| Resource | Aggregation Unit | Storage Units | Notes | |
| Linux FS | File Link | Inode | Disk Block | |
| Sun's Network File System | User File | [Inode] | [Data Block] | |
| DCE | User File | [Inode] | [Data Block] | |
| AFS | User File | [Inode] | [Data Block] | |
| Plan 9 | User File | [Inode] | [Data Block] | |
| SSHFS | User File | [Inode] | [Data Block] | |
| Elastic Drive | User File | [Inode] | [Data Block] | |
| GmailFS | User File | [Inode] | [Data Block] | |
| Google FS | Resource | Master Node | Chunk Server | GFS uses a simple design with a single master server for hosting the entire metadata and where the data is split into chunks and stored in chunkservers. |
| Amazon's Dynamo | Key/Value Resource | Master Node | Chunk Server | |
| HDFS | User File | Namenode + Datanode | [Data Block] | |
| WRFS | User Data | Web Inode | Data Provider |
I think really the aspects of a database that are interesting in this context are Protect, Relate, and Query.
DNS allows us to take a domain name and translate it into an IP address; This is interesting from the standpoint of our need to resolve an openID-like token into a set of data container uris for a given data type { social graph, images, videos, ? }
So at this point we've talked about what we want, listed some properties of those requirements, and talked about existing technology that performs similar function. It might be a good time to try and develop an abstraction of what we want so we can get a more clear understanding of how it might work.
At the top of most stacks or abstraction layers is the application layer. It is simply the endpoint of where the request begins (most of the time) and the result comes back to. (example: A sql query in a winform application is executed via an ADO.NET connection at the application layer, gets send to the database, checked for validity, translated into a query tree, executed against tables, contructs and relates the data into a set of records, and then returned back to the application layer.)
With something like we are proposing, our application layer could be a number of platforms, but a good example might be a flash application that wants to construct a slide show of all images for openID = 'joe@oid.floe.tv' regardless of if they are stored on smugmug or flickr or wherever.
Processes data, relates data, aggregates data together, returns it back to the application layer.
Needs to be accessible either as an API or as a SQL-Like query language.
Knows how to find data based on an openID-type identifier. This layer might query the Identity provider for the location of the data-indexing service for this user, which for now we will call the "WebInode Server", and its interface might look like this (should return dummy stub data). This mechanism simply tells the discovery layer "hey, this user has data in X, Y, and Z data providers". These results are then used to query each data store for what data the user might have in each of them, and the results are returned to the parent layer for aggregation.
So now we've hit the bottom layer of our model, where the data itself is actually stored. This layer needs to represent a place where any type of data is stored and be able to be queried about its contents for a specific userID. It also should protect the data so as to protect the user and the data provider, but allow for access, given the correct credentials, to the actual data/file itself to the proper parties.
So now we've got these nice layers to break things up, abstract away the details, and let us focus on managing the process. But really, if you leave those layers, well, abstact, then all they really do is end up in a research paper, and we are trying to make something work (sooner than later) here. So let's see what we have laying around that might work for our model.
For the time being, we'll call the system "WRFS" as in "Web Relational File System"
Pretty much everything { web-app, flash app, winform, shell script } can be an application, so we don't have to do much here.
To make things simple, let's say javascript is our api language, since this is aimed initially at the web2.0 world. Let's say that we might make a call like:
var oProxy = new WRFSProxy();
oProxy.GetDataFor( strOpenIDIdentifier, WRFS.DATA_TYPE_IMAGE, MyCallbackFunction );
Under the hood of that javascript call this layer would use a http-calling mechanism (depending on implementation) such as:
Let's say for the sake of this exercise that we are using openID to manage our identity, and we send off a quick webservice call to the domain in the openID identifier (along with any security tokens) to say "hey, where does this person store their data index?", and let's assume that at this point openID had a mechanism or field in place that pointed to that server with a URI.
The system then calls a service that would return a list of indexed data storage URIs for a given openID identifier. This api might look like this. (Notice the name of the sample server? Inode.asmx --- We make the allusion that the data indexing service is essentially the "inode" of the web file system.) The system then takes each URL/URI and queries that server for a list of relevant files, which are returned possibly as xml in the RDF dialect. There are some issues here, though. Is everyone going to let us coming tromping through their front door just looking for random people's data? More than likely not, but maybe, just maybe, if we knock the right way, they might let us in.
Now, we just established that we look at the data-indexing service as a sort of filesystem "inode", so, if we are storing the actual user's data on a webserver, what does that make it? A disk block, in a lot of ways. Just like with a normal filesystem in linux, a single jpeg image might be stored in 50 different disk blocks scattered around the hard drive, and a single inode points to all the disk blocks. Here, the data index service, or inode, points to all of the servers that a user has data in. And just like how the linux filesystem uses the call "bmap" to take an inode and find disk blocks, our "javascript system call" uses the data-indexing service to find the user's stored data "blocks".
But whoa now --- Who can do what with who's data, and how? Thats a very big issue that a lot of people are looking at, and we believe the emerging OAuth spec might just be the solution for this. OAuth does a lot of cool things, but in the end, it essentially says "yes, the application at floe.tv can mashup the image data at flickr for user X". Sounds like that is very, very handy for what we would like to do here. What if each data storage service, like say a flickr, exposes an open, standard web API that allows for automated OAuth mechanisms (generally, the way we understand it, is that currently a user has to actually be redirected to the data storage page, click "yes", etc, and then be redirected back to the 3rd party application site.) So let's pretend for a moment that OAuth worked with our "Web Inode" and allowed for caching of its security tokens, and allowed us to quickly query the data storage layer via the js/webservice mechanism to get the list of relevant data stored on that particular web server for that particular person. This stub data is then returned to the "Discovery and Translation Layer" via { SOAP, REST, JSON } and then sent back up the stack to the "Query and Aggregation Layer" to be combined or aggregated into a single recordset or data structure { RDF, SIOC, XML } to be passed back to the Application Layer. The one missing piece in this sequence is "well, how did the so-called "web inode" KNOW in the first place that there was data in flickr for openID X"? and to that I'd suggest simply adding a web-method like this to the "web inode" and then requiring each data storage container to register the fact that the user has data in their container. Just a simple, single web-method call the first time a user enters some data in their service.
Let's say we want to tackle another problem, say like getting all friends of a user regardless of social network (hey! wait a minute! that sounds an awfully like that whole Open Social Graph nonsense. You cant do that! --- but oh wait, we can). In our as3 code we might use an api call, or we might use some sort of WebSql call like
The Query and Aggregation Layer would break this query down into a query tree, and then pass on the query to the discovery layer. The discovery layer might find out that userID = 'joe' has social graph data in myspace, and then it sends a REST request to myspace.com (with cached OAuth tokens handling security and permissions), possibly with parameters, to find out which friends 'joe' has stored there, which might be returned as RDF data. This data would then be passed up to the Query and Aggregation layer, to be recombined with the data the process also got from ning.com, and presented to the Application layer as a unified recordset.
So really, if you think about it, most everything we need is either already in use, or sitting in a final spec stage (ok, we still need some standardization, and some minor extensions to openID and OAuth, but those aren't particularly monster obstacles). It would really only take a group of open-minded people that said "you know, that sounds cool, and we can be a part of a sum that is greater than its parts. Let's give that a shot!" and form up into a sort of "open data alliance" and get to work. But ...
Why would they? Where does that takes us? What are the consequences?
No one is ever going to open their data up into some crazy united disk system. How do you make money? You mean we can't lock the user into our site? What? This is completely LUDICROUS. Simply crazy talk.
From Wikipedia:
According to the story, some travelers come to a village, carrying nothing more than an empty pot. Upon their arrival, the villagers are unwilling to share any of their food stores with the hungry travelers. The travelers fill the pot with water, drop a large stone in it, and place it over a fire in the village square. One of the villagers becomes curious and asks what they are doing. The travelers answer that they are making "stone soup", which tastes wonderful, although it still needs a little bit of garnish to improve the flavor, which they are missing. The villager doesn't mind parting with just a little bit to help them out, so it gets added to the soup. Another villager walks by, inquiring about the pot, and the travelers again mention their stone soup which hasn't reached its full potential yet. The villager hands them a little bit of seasoning to help them out. More and more villagers walk by, each adding another ingredient. Finally, a delicious and nourishing pot of soup is enjoyed by all.
Really, in a lot of ways, our band of startups are just the hungry travelers. Alone, we have a long road to haul
to compete with the high traffic sites. We need traffic to survive, and the evolution of the web is highly
heterogenous in nature in terms of mixtures of purpose and data. In other words, traffic is our food, and right now a
very attractive option for all of us is to share data and promote one another's sites so that we become more
than the sum of our parts.
In effect, we are making stone soup, and WRFS is the pot.
There are always pressures on any market, especially markets that are far from mature, and especially one like the internet and data storage.
Let's think about a few things first...
Because users will tend towards who gives them the best deal, and they can and will move their data to a system that gives them the maximum value for their data, and being able to inter-relate data makes data more valuable. In the end, since the commodity of hosting images, social graphs, and videos is easily setup, the "containers" hold no real power in the long run. To illustrate this, let's take a look at the convenience store business.
Convenience stores make nearly no money off of selling gasoline, maybe a penny a gallon. However, they continue to sell gasoline --- why? Because it attracts motorists to their location, and provides them with the opportunity to sell milk, candy, bread, etc. Carrying the gas is essentially the overhead of marketing their location and attracting customers. The store owners are simply a third party to the gasoline transaction yet are able to make a profit in the percentage of motorists who also drop into the store. Who holds the power? Ultimately the consumer does, since they can get gas elsewhere if the terms of the station do not meet their needs, although it could be argued that gas prices are controlled the Gas Companies themselves, but I digress. So what does this have to do with the economics of storing, aggregating, and protecting data?
It all comes down to the perception of value in the eyes of the customer, what they will bear in terms of restrictions, and what alternative choices they have.
I for one would like to give them a new choice and let the market decide.
Josh Patterson
(email: jpatterson @ [insert floe.tv here ] )
floe.tv