It's almost a year since I began working on a namematching algorithm to approximate Muslim population share in Lucknow's mohallas by exploiting the religious connotations of names on the electoral rolls of these areas. This has worked out quite well, and since led to a number of follow-up analyses, several conference papers, new collaborations, an article under review, two more in the pipeline - and last but not least the publication of a large dataset on religion and politics in Uttar Pradesh (featured in my second last post).

One thing kept worrying me, though: the scope of the algorithm varied quite a bit. Across UP's assembly constituencies, for instance, it sometimes managed to categorize 95% of the electorate - and sometimes only 70%. While accuracy of those names which were identified seemed alright, missings of up to a third were worrysome. Overwhelmingly, they however occured because names in the electoral rolls were simply not covered by indiachildnames.com. There is little I could do about that, I thought.

And then I thought again - and spoke with Kurt Salentin who introduced me to computational linguistics and the n-gram methodology (thank you!). N-grams are premised on the assumption that specific combinations of words or characters are more frequent in certain languages than in others: if a text contains a good amount of "th" for instance (as in "the" or "atheist" or "bath"), it might well be English. A similar strategy can be used on the character level to match names: Muslim names in North India, for instance would frequently contain the 3-gram "u-la-h" (as in "Abdullah"), or even just the letter "z" (which appears in Urdu but not in Hindi). An algorithm would not need to know that "Abdullah" is a Muslim name, it would suffice if it knew that a double "l" is slightly more frequent in Muslim names as is the "u-la-h". Beyond this, probabilistics would do their magic...

Consequently, I read some more specific literature on short n-gram linguistics 1 and settled on 3-grams with Kneser-Ney smoothing2 and a manual cut-off if I have less than ten characters to work with as the most promising flavour of n-grammology. And I got to work, hacking a way a beautifully sunny day crammed into my office. Using the SRILM toolkit,3 I first created "language profiles" for all seven religious categories covered by my initial algorithm (Hindu, Muslim, Christian, Sikh, Parsi, Buddhist, Jain). To do this, I dissolved all names on the electoral rolls of Lucknow which I had already identified as Muslim (or Hindu, Christian, ...) into their 3-grams, with a token defined as consonant + accompanying matra (Abdullah, for instance, became Abd, bdu, dul, lla and lah). I then counted the frequency of each possible 3-gram to come up with a kind of fingerprint for names hinting at each of the seven religious categories (I fingerprinted separately for each constituency, to account for regional variation in names). From now on, if my algorithm comes across a name it cannot classify based on its internal database, it could likewise dissolve it into 3-grams and compare their frequency to these fingerprints. If the name matches one fingerprint considerably better than the other ones (a threshold empirically determined below), we would have our categorization, even though the name did not show up in the original masterlist!

To see how well this works, I did not feed the n-gram algorithm with unknown names, however - but with those already classified by my original method, which has proven to be fairly accurate. Doing this made it possible to calculate sensitivity, specificity, positive and negative predictive values for the n-gram module at various cutoff thresholds. While higher thresholds would lead to better accuracy, they would also reduce coverage - but I decided that I would want at least roughly similar accuracy compared to my original algorithm, even at the expense of breadth.

This trade-off was achieved with a cutoff of 0.75 log probability between the best and second best bet suggested by the n-gram algorithm (in addition to discarding any names which render, together with their father names, less than ten characters to work with in the first place), resulting in a coverage of 62-78% (varying from constituency to constituency) with only slight dents in accuracy compared to my original algorithm: sensitivity of 89-94%, specificity of 95-97%, positive predictive value of 85-93% and negative predictive value of 96-97%. Quite encouraging!

And indeed: when I let this n-gram module classify whatever names the original algorithm could not identify, I managed to reduce missings by at least 2/3 across UP. In most constituencies, the share of names which could not be identified either way now stands at less than 5%, which makes the resulting estimates of Muslim population share much more robust. While these estimates did not change much through the process (half a percent here and there) - testimony to the fact that missings were unsystematic in the first place - I am now even more confident than before that my dataset accurately captures religious population distribution on the local level.

Now, I shall return to more substantive analyses and a decent junk of writing rather than coding. Meanwhile, the namematching algorithm including n-gram module (and language training data from Lucknow - it does pay off to create your own, though!) is attached below under a GNU Affero GPL license. Let me know if you find it useful...