A minor storm in the teacup of the campaining-twitterverse picked up this week, with Malcolm Gladwell talking about the lack of effectiveness of twitter (http://www.newyorker.com/reporting/2010/10/04/101004fa_fact_gladwell) and Read more…
Opentech usually takes me a while to process what came out of it. Knowing what most speakers are going to talk about in advance (I can ask them), it’s what happens next that I’m most interested in.
One thing that came out of Opentech conversations for me was the detail of getting data out. Not the policy, but the practice; Hadley talking about LinkedGov, to some of the OKFN presentations, to the ongoing debate about PDFs. But it seemed there were two conversations going on, between different people and sometimes between the same people, but actually about different things.
Everyone generally acknowledges that open data and transparency is generally good. And more transparency is often better. And that data should be released. The techie types always call’s for XML and structured data. But [in a reference I can’t find] a detailed FoI request for data demanding CSV output. And, that’s what they got; a spreadsheet that contained no rows other than a title/heading because there was no data. There is clearly something wrong there.
The (potential) data users say the only way to get something they want, is in the formats they want, whereas those with the data are, in some ways, arguing that what is being asked for is somewhat crazy. They’re both right at different times, and it turns out that very few are actually satisfied with what’s actually going on.
Again on formats, back in OpenTech 2005, I remember we spent a huge amount of time converting the videos/audio from mp3 to ogg for the people who really wanted free formats only. Despite publishing this year in m4a only, no one has said anything. That argument seems to have gone away. A decade ago, there was an objection to word files, and a statement that everything should be RTF. Again, a conversation that seems to have gone away in some areas, and needs to do in others.
While RDF and XML are fantastic, they’re completely unusable by most people. When Ed Miliband was announced as Labour leader, I wanted to do a quite look at how MPs voted for him – was there a split between older/newer MPs on their Ed/David preferences?
While mySociety is a huge participant in the open data arena, and TheyWorkForYou is the pre-eminent site for this, their open data is split in a load of XML files which make perfect sense for TWFY. But to look at information about current MPs, I needed to match together various structures. The first is the list of current MPs – you need to take the list of current MPs (all-members-2010.xml for this Parliament, and members.xml for all previous Parliaments), and then the list of all people (an MP can be across multiple parliaments), and then find the first parliament of them in people.xml (which requires matching each member record in the people file to find the earliest year that the members file contains for that person). That’s 3 XML files, full of identifiers, all alike.
That’s not a difficult problem for anyone who can write a 10 line perl script to do the building of the hash structures and run it. However, most people can’t. And while not everyone will want to do that, I suspect the number of people who would want to do so, but could do so with a set of excel files, is much higher. I’m not saying that mySociety made the wrong decision to make data available in this way (they also have an API and other methods), but none of them are really options for individuals doing basic analysis (and in fact, while I can write the above code, I had a bug I couldn’t figure out in it, and gave up). And while mySociety can make it available in multiple different formats, that’s not a scalable solution for anyone. There needs to be a better solution.
Out of OpenTech and other developments over the last few months, it may be that everyone is starting, slowly, to realise that formats matter less than content. And we’ll at some point soon the equivalent of what ffmpeg is for video, but for data – something which takes in pretty much anything people have, and gives out whatever it people want.
The other thing that was bubbling around underneath was timescales.
From Gavin Starks and Robin Houston talking about short term action, now, which will give benefits in 50 years. (www.amee.com, www.1010uk.org) to the dear friend of OpenTech who couldn’t be there because of a newscycle measured in a few hours.
LinkedGov.org is starting a long-term project to start to clean up and restructure data to make it usable; Hadley’s looking both towards her feet to avoid stumbling over the stones of short term data questions and getting the thing off the ground, while simultaneously lifting the perspective of many up, to look at the horizon. And the view (over time) is breathtaking.
Bill Thompson and Mia Ridge were also talking about the interactions with museums and archives. Bill working on the BBC Archive project, and Mia works at the Science Museum. Both looking at maintaining the past, and making it usable to people beyond those able to visit the place. Primarily because, if no one can find the material online, it’ll become less valued int he future as what matters is what’s findable online.
And, fundamentally, that ties everything together. If a dataset is open but unusable by those who want to use it for something, that is, effectively, not open. Data.Gov.UK may have released data in formats which aren’t preferred by some, but most of what they do is generally accessible and becoming more so over time. And Government focusses on that long term.
All opentech audio (and slides where offered) are available on the OpenTech schedule page at http://www.ukuug.org/events/opentech2010/schedule/ – we’re also looking for sponsors for 2011. Contact firstname.lastname@example.org and we can chat ☺