Thursday, September 03, 2009

Linked Data vs Owned Data - RDF thoughts

the shadow never lies

Finally attempting to get my head round RDF and Linked Data ("Semantic Web" is another of those terms - like Web 2.0 - I'm going to try to avoid using but end up using far too often) in the footsteps of Tim Berners-Lee coming on board with the UK government, giving TED talks and pushing the idea forwards.

I've been fairly sceptical about it all so far, but mostly from a layman's (LAME-man's) point of view - nobody in my networks is really doing anything with RDF seriously - at least not as far as it's been visible to me. Maybe there's some great stuff being done right under my nose, but the RDF is hidden. But a lot of the "hacker" hubbub has been around JSON, Google, Yahoo, even CSV files the current way data is usually handed out (e.g., PDF or Excel files). In other words, there's been no reason to clear out some time to start looking at all this so far.

But potential it has, questions it raises, and thoughts it doth provoke. Of all of these, it is the most basic I guess I have first: Can it fly in the real world?

HTML and HTTP grew up in very different circumstances to the world we have now. The Internet was not established. Communicating digitally was not established. The people interested in doing these things were in a very particular place.

Of course, there are also some similarities to now. In his TED talk (link above), TBL notes that the idea for the WWW came about because software vendors all had their own systems, their own structures and their own formats for accessing documents. Even in those days, competition was abound, and I think the same thing can be said of data these days - where data gets stored is very proprietary - whether the system is paid for, or free like Google spreadsheets, say.

But in innovation, shifting the incumbent is tricky at the best of times. We have digital conversation, and now the encoding of relationships, which people like - and the push now is to get past that, towards something almost less personal, something more akin to a public good. Or at least that's how it feels - the separation of data from the reputation of the source, perhaps. The paradigm of data for data's sake, rather than a data that is owned by someone. That is the cultural - not technological - shift that I think Linked Data is relying on.

Half Hidden Gate. by pdeee454, on Flickr
There are some really interesting hat-tips in this direction: the Open Knowledge Foundation, for example, or the way in which ManyEyes forces you to keep submitted data open to anyone else. But are these just experimental outcrops, or can they somehow be turned into a more ubiquitous paradigm that people are proud to take on board?

Sadly, I think the prevailing force of data ownership is far too large for these attempts to dislodge it - in their current form, at least. Much more useful data is probably being created through other efforts such as Creative Commons licensing on Flickr and OpenStreetMap - data with a purpose, but that people are happy to relinquish "ownership" over - most likely because they don't stand to gain from it anyway.

Forcing governments to open up linked data is a precarious, knife-edged scheme. It could work. I really want it to work, but there are some basic psychological attitudes that are needed first - to detach data as something that is owned. Under a climate of competition and economy this is, IMHO, a long way off.

So what will happen instead? What should happen? The latter is a dangerous question. But if we take the crowd-sourced, low-investment approach of Flickr and OpenStreetMap, then chances are we'll see the creation of some very big, and possibly very personal libraries of data. The real winners will be the ones who, like Facebook and Yahoo, can spin not just the interfaces to these libraries right, but also come out publicly with some very string models of permissions.

What do I mean by this? Take, for example, the approach that Yahoo take to authentication with Flickr and Fire Eagle - security is broken down into a number of tasks and specificities (such as "add images" or "see my location down to GPS/postcode/area level") and assigned to application which must ask permission for each. The data producer (not owner...) has control over this from the beginning, setting a strong precedent for applications to stick to these rules if it wants to survive. As seen with some Facebook widgets, when this trust is abused, people know about it. If there's anything you want as a provider of anything, it's generally not the ire of your customers. (Unless you're the music industry...)

The successful Linked Data service will therefore be:

a) the one that hooks into existing services as and where possible - e.g. using OAuth to get your number of Twitter followers, your geo-tagged images, data from Google spreadsheets, etc. It will also encourage other forms of entry, such as by SMS, Twitter direct message, e-mail, widgets, and direct plug-ins to other services like home energy monitoring kits. Getting data into the system is important.

b) the one that also makes it possible to get data out, but in a trusted way. There are some very good ways of doing this - stripping out personal data is one, but not a very good one (as data can be matched against an individual fairly easily). Aggregating data, randomising it, and other techniques to hide identity is where the true value in a system lies. That, and (to a lesser extent?) keeping the raw data away from prying eyes.

These two things make a data system valuable on the 2 important levels - the individual/user, and the systemic/analyst. And maybe matching up these two levels is the real challenge, whatever technology gets used.