I just read a post (title: redaction and technical incompetence) by the Sunlight Foundation about how US Treasury Secretary Geithner’s schedule is only available as a scan in a PDF.
Now, excluding the fact that they didn’t link to the PDF in the first place (they were trying to make a point), they talk about the many reasons that things are as they are.
But, Sunlight, as an organisation which complains about this often enough, has much better tools at their disposal than complaining about it. As people using computers in 2010, we all have better tools to use on PDFs than we currently use. We often complain about how inaccessible PDFs are, without doing the basic, simple, automatable tasks which can make them readable.
Opening the PDF in acrobat, pressing the “Recognise text using OCR” and then it’s searchable, and Sunlight could republish this for everyone to use (or put up a webservice which adds the OCR text in such a way that when you search, what you get highlighted is the relevant bits of the page where the OCRed text matches). That is possible now.
But, as a community, we prefer to stick to the notion that anything in PDF is utterly locked up in a way which no one can get at.
It’s not (really).
It is far from ideal, it’s a bugger to use, and it is not the best format for most things, but it’s what we’ve got. And showing how valuable this data is will get us far further than complaining that we can’t read a file that most people clearly can in the tools they use. It’s the tools we choose to use that are letting us down. And, as a movement, open data has to get better at it, and then it’ll be less of a problem for us, and we can spend more time doing what we claim to be wanting to do.
In that spirit, and for testing, I’ve put together a little thing which takes a PDF of tables, and it has a go at converting them to a spreadsheet (excel, but if you read all the comments, you get JSON, and every language can parse a simple office 2003 excel file). Have a playand see what you think. Comments by email welcome (source code). it works well on some PDFs which are output from excel or databases as tables.
If anyone knows how to made pdf2xml output the hidden OCR searchable info that acrobat adds in, that’d be useful (for now, what you get from my tool on geithner’s schedule is this – not useful, btu can be better)
More generically, our tools have to get better if we want to maintain credibility in using the data to the maximum benefit, not complaining about what we want. We have mediocre tools to handle this data, we just don’t use them and aren’t improving them enough to do what we need them to do. We should start.