Always, always, always look closely at raw data before doing any statistics! This was the most important lesson my statistics teacher tried to impress upon me back in undergraduate training. Funny things can go wrong when handling large datasets, so switch on your common sense and compare input with output - or so he said. He has just been proven right once more. I spent two weeks to pay for my negligence, and the following three blog posts had to be corrected:

Mapping Lucknow: party strongholds
Mapping Lucknow: Muslim life
Residential segregation

What happened? Two weeks ago, I decided to wrap up my work with the electoral rolls which kept me occupied for the last so many weeks. While copying all files in a common folder to clean up the mess on my pendrive, I saw an odd irregularity in polling station names. I looked closer. And it all blew up.

In order to create the maps and statistics mentioned above, I had to integrate datasets from four different years: election results from 2007, 2009 and 2012, polling station localities from 2009, and electoral rolls revised in 2011. I knew that 2007 would be tricky, since constituency boundaries were redrawn in the 2008 delimitation exercise. I did not expect 2009, 2011 and 2012 to be a problem though. Consequently, I just integrated these datasets based on the unique polling booth ID assigned by the Election Commission. Silly me.

Turns out: this unique ID changes. Every year. So when I mapped party strongholds, I mistakenly compared the winning party in one locality with the winning party in another locality - because I thought that polling station number 312 in 2009 and polling station number 312 in 2012 would be the same. When I mapped Muslim life, the percentage of voters with Muslim names was correctly calculated, but then mapped on the wrong area - because electoral rolls are from 2011 and GIS data from 2009. And so on. A nightmare.

To resolve this problem, I spent the last two weeks writing yet another algorithm, which integrates my different datasets on a complex combination of factors: similarity of polling station name (to account for typos, abbreviations, etc), relative position on the list of polling stations (the rough order of stations tends to stay the same across years), difference in number of electors, and difference in number of booths within each station. I tested and tweaked, tested and tweaked. I needed to integrate as much as possible, but could not tolerate any mismatch: it would be unfortunate if the resulting integrated dataset would only cover some few stations, but it would be much worse if errors are produced on the way. Add to this that some station names are written in English, some in Hindi unicode, some in Hindi KrutiDev, that some use abbreviations for things such as "primary school XY" while others don't - a complete mess.

In the end, however, I got it sorted out. I managed to correctly integrate data from 2009, 2011 and 2012 for upwards of 95% of all stations in most constituencies.1 Those polling stations which could not be integrated across these three years seem to be distributed fairly randomly - and thus create white space in maps, but do not distort statistics much. Which is a great relief.

Still: while I had hoped to close this chapter for now, I rather spent my days going back to raw mess. The lesson learned is an old one: always, always, always look closely at raw data before doing any statistics!

  • 1. 2007 was far more complicated due to the rupture of delimitation - more on this in a later post