Scanning Word Documents

Each year, the “ collective” usually meet in someone’s house to meet, talk and work on stuff for a day or two. 2 years ago, they launched, a year ago, was created on day 2. However, on day 1, we were looking at a different project which didn’t actually go anywhere for reasons at the end of the tale. I wonder what we’ll do this year…

The government publishes a lot of documents online. Most of those are PDFs, but there is a huge number of word documents also published.

What we did was fetched a huge swathe from random government departments (all the big ones, some of the smaller ones), along with every press release that every picked up.

This was somewhere between 12,000 and 15,000 documents.

We then ran all of them through a program to look for those which had “Track Changes” turned on, and had a look at the changes that we found.

I think the number that we found was around 15 (it made it into double digits, but it wasn’t by much). And many of those had track changes enabled, but no actual changes. Almost all of the rest had random layout or white space changes made. I think we found one which had a comment left in. It wasn’t something funny.

In short, while it was a fun exercise, there was effectively nothing useful found, but something we could look at again in a few years to see if the standard has regressed.

While I was heavily involved, it wasn’t just me; Francis Irving did the word document analysis, and the London Perl Mongers who were an invaluable source of ideas for doing bits of it.

posted: 01 Dec 2006

“Why I do what Sam says…”

I have no idea how to respond to a plug for TGS that reads like this

posted: 28 Aug 2006