Scanning Word Documents

Each year, the “democracy.org.uk collective” usually meet in someone’s house to meet, talk and work on stuff for a day or two. 2 years ago, they launched www.directionlessgov.com, a year ago, www.hassleme.co.uk was created on day 2. However, on day 1, we were looking at a different project which didn’t actually go anywhere for reasons at the end of the tale. I wonder what we’ll do this year…

The government publishes a lot of documents online. Most of those are PDFs, but there is a huge number of word documents also published.

What we did was fetched a huge swathe from random government departments (all the big ones, some of the smaller ones), along with every press release that www.thegovernmentsays.com every picked up.

This was somewhere between 12,000 and 15,000 documents.

We then ran all of them through a program to look for those which had “Track Changes” turned on, and had a look at the changes that we found.

I think the number that we found was around 15 (it made it into double digits, but it wasn’t by much). And many of those had track changes enabled, but no actual changes. Almost all of the rest had random layout or white space changes made. I think we found one which had a comment left in. It wasn’t something funny.

In short, while it was a fun exercise, there was effectively nothing useful found, but something we could look at again in a few years to see if the standard has regressed.

While I was heavily involved, it wasn’t just me; Francis Irving did the word document analysis, and the London Perl Mongers who were an invaluable source of ideas for doing bits of it.

posted: 01 Dec 2006

News boxes.

Another project I spent some time on, about a year ago now, with significant amounts of help from Richard Pope, is some way of easily putting democracy news onto webpages. Think googleads, but for local democracy information.

Behind the scenes, it again uses the TheGovernmentSays.com model, of pulling in RSS feeds and then spitting them out of the database. Instead of aiming at big webpages and sites, it is targetted at small fragments of html. You visit the site, enter your postcode, pick which information you want included (based on your postcode, it tells you who your mp is and localises what you see), then you get given a fragment of html to paste into your webpages. That’s it.

Whenever someone looks at your website after inclusion of the fragment, they get the latest democracy news for your area, what your mp has been doing, saying, and what pledgebank pledges are happening in your area. We hope to add more sources of information over time, but this is what we’re starting with (as the information is easily accessible).

While this implementation is focussed on democracy, there is very little which is tied to democracy. It could be reused for any area of interest where there are multiple feeds of interest which need aggregation, in a simple fashion.

Over time, an enhancement could be to, where a browser has an override cookie, show information local to the browser itself, rather than to the website owner. But that’s going to be left as an exercise for the reader.

posted: 26 Jun 2006