When is St Petersberg (not) the same as Leningrad?

If you’ve not already read it, see Tom Longley’s first post about this before not having a clue what’s below. Although I’d strongly recommend that you don’t read the linked report unless you’re very interested in specifics – it’s really not pleasant reading.

Tom asked some people to take a look at the documents, and see what we could think of doing to fix it up into more useful ways.

The structure of each document is relatively simple, and the formatting mostly fine (three different structures, which can be identified easily with regular expressions). In fact, the whole script handling most of the 5 years (a couple of reports are very different for various reasons) is only 122 lines of code (roughly half parsing and half spreadsheet generation).

The reason for using a spreadsheet is down to simplicity – it can be done fast, and the fact that the data is only semi-structured. Plus, since it’s in Excel, it can be edited in almost anything, on any platform, without a requirement for anything above windows 95. Future stuff can be based either in VB/macros or in a more intelligent spreadsheet, or in something else. But in order to operate in a country with some security concerns which are far from abstract, simplicity is useful.

TomL touched on the issue of geography. Detailed geography is useful – lat/long/date being ideal as the basic item as that is constant over time, and can be rolled into the geography as needed (no, it’s not that easy, but the other options are far harder). The main problem for the existing data, at the moment, is tying together the list of names which exist over time, and putting them together in some way which we can look at from a higher level (e.g. region).

This stuff is fundamentally a qualitative time series dataset on which we then build a (somewhat depressing) quantitative time series dataset. Discontinuities in geography is not a rare or unknown problem for such conversions. But we need to know the names of places before we can start doing things with them. Town names don’t change much and often it doesn’t matter when they do – even if there’s a formal edict changing the name to Leningrad, people will still call it St. Petersberg.

While for some purposes (constituency stuff), knowing you live in Manchester or Harare is nowhere near enough in the abstract; at the current level, we don’t have better data for the past stuff (or, at the moment, any decent data), so we’ll have to live without it. But it does mean that as things move forwards, we can retrofit future (hopefully good) data onto the historic data: want 2008-2010 data? then you can get your own decent geography; want 2003-2010, you can get what we can give you. It’s a very common problem, and there is no single solution. But something is better than nothing, and, in this case, perfect could be the enemy of good. For legal or audit purposes, being on the right street matters; for other things, it’s arguably less so.

At a mySociety meeting before Christmas, Tom Steinberg spoke with his usual eloquence about how those in the room (the leading e-democracy activists from the UK and USA) were extremely fortunate to operate in countries where we can do what we do without concern for ourselves or that of friends and family. Unfortunately, this is all to rare in the world. Democracy is a great gift from the past, and we don’t own it. We’re only custodians for the future; and we should believe in doing the right thing for the right reasons, and in so doing, leave things slightly better than we found them.

It’s easy for someone to claim that they’re doing something in the name of democracy; that should not be confused with actually doing so, especially when they’re running for something.

posted: 08 Feb 2008