The letter in today's picture almost wrecked my presentation in Oxford last week - and reminded me how easy it is to get quantitative papers spectacularly wrong. What happened? While re-writing my talk two weeks ago (basically extending earlier work from Lucknow to the whole of Uttar Pradesh), I noticed some odd phenomena in my dataset. And freaked out. Oxford is not just any place to be invited to speak at. I had to know quite exactly what had gone wrong. After some fairly tense hours, I discovered it: the "Š", a letter which apparently crept into some electoral rolls of Eastern UP (and no, these are not written in latin script, but in devanagari). I have no idea why they were there in the first place, and there seemed to be no system - but as soon as my name-matching algorithm stumbled across them, it crashed. And left my dataset corrupted. Luckily, I was able to solve the problem (the final dataset arrived literally five minutes before my presentation), and could ditch the "I am truly sorry but my talk just dissolved in a data nightmare" embarassment. Close call!

Similar to earlier data trouble, I thus realized again how easy it is to spectacularly fail in quantitative research. You get one calculation wrong, a data row slips elsewhere - and your analysis is blown. It's much harder to fail equally grand in qualitative research. This is no argument in the quantitative-qualitative debate (which I find silly most of the time anyway). But if you deal with numbers as part of your research: be careful. Very careful. There might be a lurking "Š" around, waiting to destroy your fancy arguments at the most inconvenient hour...

Always, always, always look closely at raw data before doing any statistics! This was the most important lesson my statistics teacher tried to impress upon me back in undergraduate training. Funny things can go wrong when handling large datasets, so switch on your common sense and compare input with output - or so he said. He has just been proven right once more. I spent two weeks to pay for my negligence, and the following three blog posts had to be corrected:

Mapping Lucknow: party strongholds
Mapping Lucknow: Muslim life
Residential segregation

What happened? Two weeks ago, I decided to wrap up my work with the electoral rolls which kept me occupied for the last so many weeks. While copying all files in a common folder to clean up the mess on my pendrive, I saw an odd irregularity in polling station names. I looked closer. And it all blew up.

In order to create the maps and statistics mentioned above, I had to integrate datasets from four different years: election results from 2007, 2009 and 2012, polling station localities from 2009, and electoral rolls revised in 2011. I knew that 2007 would be tricky, since constituency boundaries were redrawn in the 2008 delimitation exercise. I did not expect 2009, 2011 and 2012 to be a problem though. Consequently, I just integrated these datasets based on the unique polling booth ID assigned by the Election Commission. Silly me.

Can one infer the religious community to which an Indian belongs from his or her name? Intuitively, the answer would be yes: Indians and those familiar with the country certainly develop a pretty good sense for such inferences. And even though names remain only one among several clues (including dress, language, etc), names alone are sadly often reason enough to discriminate against people (for instance to deny Muslims housing). But most Indians also know the flurry of probing questions along the lines of "What's your name?" - "X" - "No, your full name?" - "X Y" - "Where are you from?" - "Z" - "No, I mean: Hindu?". Clearly, names are not always good indicators to gauge an individual's community.

Today's post sheds a probabilistic light on this problem. First, I discuss why it could be useful to infer communities from names. Next, I introduce a name matching algorithm which I developed to achieve this task (building on others' earlier efforts, and available for download below under the GNU Affero GPL license). Finally, I give a first indication of how good my algorithm works: what's in a name? Your comments are of course highly appreciated - and I apologize in advance for a rather technical post (which is in fact as much a writeup for my own memory as it is meant for you to read). Once I develop empirical applications of this software, I promise more lively prose...