The nice thing about #opendata these days I that, occassionally one of the immensely talented people working on code comes up with a nice simple implementation which, by virtue of it’s existance, not just decimates, but completely removes some of the arguments around #opendata.

There’s currently a belief that putting things in PDFs makes it harder to read, and that, in some way, will save some people political hassle.

Not so much any more. Jeni Tennison introduced me on twitter to @kitwallace – who built this this for converting tables in PDFs to csv/html/xml.

It’s worth an aside at this point to remember that tables in PDFs don’t exist. All you have chunks of text at positions x and y on a page; and you have lines that run from x1,y1 to x2,y2; and from those, you can see a table visually.

The most evil PDFs published are the Birmingham spending data. Simple, Long, Verbose, 80+Mb PDFs. Here’s the an html version, with links to CSV and XML:

http://184.73.216.20/exist/rest/db/apps/gtd/pdf2xml.xq?path=/db/apps/gtd/pesa&table=1.6&format=html

Not only will this save huge amounts of time; but it makes a potentially stronger argument to political types about publication: “If you publish as PDF, we’ll assume you’re trying to hide something”

 

posted: 28 Feb 2011