A few weeks ago, I wrote about additional estimates for the accuracy of my namematching algorithm, and also commented once more on the test corpus of names used to establish these. In the meantime, Gilles let me access his dataset of the social profiles of all MLAs in Uttar Pradesh since independence in full, i.e. including a manually coded variable for religion (thanks again!). Using his data, I was able to alleviate some potential biases in my original test corpus (particularly in terms of non-Muslim names), making my accuracy estimates more robust still.

The new test corpus consists of three raw name lists: the Haj Qurrah for 2012 (which by law only includes Muslim names, and should be fairly representative of those, as argued earlier), the undergraduate admissions list of Lucknow University under SC quota (which by law excludes Muslims, but has a bias towards lower economic strate of non-Muslims as well towards the young), and finally the names of all MLAs since independence (both Muslim and non-Muslim, and arguably with a bias towards higher economic strata as well as older people). The former two lists provide names and father names, the latter has name and gender. In the overall corpus, the ratio of Muslims to non Muslims is roughly 50:50 (since the Qurrah is fairly extensive); the following figures weighed the corpus to reflect the religious demographic of UP (which does not affect sensitivity and specificity, but renders predictive values more meaningful).

Accepting this test corpus as a gold standard, the algorithm demonstrates a sensitivity (rate of "true" Muslims identified as Muslims) of 96%, specificity (rate of "true" non-Muslims identified as non-Muslims) of 99%, positive predictive value (rate of "true" Muslims among all those identified as Muslims) of 95% and negative predictive value (rate of "true" non-Muslims among all those identified as non-Muslims) of 99%. This is up 1% in sensitivity but down 3% in PPV compared to the original test corpus (i.e. without the MLA names). Around 5% of names could not be matched (since they were in latin transliteration, I did not use the additional n-gram module, which would likely reduce missings). In light of these impressive figures, thresholds (which the algorithm allows, based on its certainty index) did not improve the result much: with a threshold of 10, for instance, coverage shrank by 1/3 with only marginal improvements in accuracy.

Finally a gimmick: Gilles also gave me data for MP, Chhatisgarh and Rajasthan. Both MP and Rajasthan had less than 50 Muslim MLAs each, though, and Chhatisgarh only 15, which really makes accuracy calculations a bit of a hit-and-miss game. With this caveat in mind, sensitivity was 83%, specificity 99%, PPV 97% and NPV 94% in MP (adjusted for population shares - unadjusted figures have a much worse PPV owing to the small n). In Rajasthan, sensitivity stands at 79%, specificity at 97%, PPV at 93% and NPV at 89%. Chhatisgarh has had too few Muslim MLAs to make these measurements meaningful. Does that tell us that UP works best, then? I think this would be premature - what it tells us above all is that the fewer names to play with, the higher the risk...