Update: Sept 2014: Someone did some further analysis.
It took 10 hours, from the release of ineffectively anonymised driver numbers, to extract the driver IDs of 173 million taxi journeys in New York City. That data, released by the NYC Taxi authority has been also been used to do useful things, but what else will be done with the rest of the fields? What else could have been done?
Prurient munging with other data is of course now happening on the internets, calculating things like the busiest drivers, gross incomes of drivers, tip percentages (20%…!!!!) and so on. What could happen next? If the US Internal Revenue Service (the tax collectors) knew how to use data, they’d be laughing so hard that their clip-on tie might fall off at the enforcement possibilities.
Both are at the expense of the privacy of cab drivers – working people (though often eccentric, let’s be honest) on low to average wages. Will there be an outcry on that? Or will the next outcry be about their passengers, and generate a mad dash towards private providers (who do have a range of problems, of which high levels of transparency is not one) ?
The data released is detailed enough for institutions and large homes to be inferred, which may allow some types of establishments to be derived. Especially if the same trip, happens on the same day, most weeks. It will generally be a different taxi, but it’s the same journey. A train runs at the same time every day, whether the driver is the same or different is irrelevant, it is the routine journey that matters. Humans are creatures of habit, and those habits form patterns, that emerge from the data in bulk.
The problem isn’t just the detailed driver IDs, but the detailed lat/long and times.
What about various establishments, of which NY has many, that maybe, people wouldn’t want the family or colleagues at their journey destinations to know were the origins? What about destinations where the lat/long is enough for a dwelling to be identified, where the other journey points can be seen? One member of a household can potentially see (and know) where others go, based on external information.
Not all possibilities are a good idea for city as a whole.
Can this detail be used to detect when taxis disappear? As one person said, it is possible that, at some point “all licences taxis will be visible at all times on google maps”…”Think how many assaults and worse might be prevented by that sort of visibility”. Prevented as long as no one turns the GPS off. That the information is available to law enforcement is different to it being public. How many crimes, reportedly involving a taxi, have someone with forensic data skills looking at them? There will be no crimes solved by this data, but it may move some crime types to other means which are more noticeable to a potential victim.
How could this have been prevented?
Clearly, someone thought they knew what they were doing, and didn’t. This isn’t new. Normally, it’s not 173 million items of cockup, but NYC doesn’t do things in small doses. Some form of review would have helped.
Who looked at the dataset, not from the perspective of “can we release it”, but from the perspective of “what are we actually giving away”? The taxi authority thought about this as journeys, with the only individual involved being the driver. But in every trip, there was a passenger, and, for passengers, there are regular journeys.
Properly hashing the driver IDs would have been a start, but releasing several files could have led to the same productive outcomes, without some of the the downsides, but with others. Under Freedom of Information law, that is a very hard judgement to make. Which is why considered open data releases, for records that people can get under FoI law is generally a good thing.
It’s not hard, it just requires some thought by difference perspectives. Because 175 million trips is heavy traffic on cockup boulevard.