Whenever I discuss my name-matching algorithm and derivative work, one question comes up: how well does it work outside UP, at other times, for other groups of people? And: what if your test corpus of names (Haj pilgrims and SC students) were non-representative of wider names (a concern particularly strong with the SC list)? Unfortunately, I have no hard and fast answer to these questions; they bother me, too.

But now, I have fresh some indicators at least - drawn from work-in-progress by Francesca Jensenius and a team around Christophe Jaffrelot spearheaded by Gilles Verniers. They try to look into social profiles of MLAs in India since independence - and as a prerequisite came up with a list of the names of all contestants in all elections in all states. Unlike the SC names, this corpus is arguably more elite, and it moves beyond UP, thus nicely complementing my own. On the downside, this list neither includes gender nor fathers' names, and first names are frequently abbreviated - much less material for my algorithm to work with. Most importantly, I only have the bare names from them, not the manual classification (which, as I understand, is still work in progress - once this is done, I could calculate actual sensitivity, specificity, PPV and NPV).

But some interesting insights can be drawn already from the extent to which my algorithm was able to classify candidates from various states - and from the average certainty of these classifications (even if actual accuracy might be a different story). Across India, classification was successful for between 84% and 100% of all candidates (with three exceptional cases, which wrongly suggest dismal coverage due to raw data issues). This breadth is comparable to my test corpus, which is remarkable: the algorithm by and large truly covers pan-Indian names. Regional variation is nonetheless observable. Of the ten states with worst coverage, we have four North-Eastern ones (Nagaland, Meghalaya, Mizoram and Arunachal Pradesh) and three Southern ones (Tamil Nadu, Karnataka, Pondicherry), suggesting that naming pattern in these areas are more distinct from the rest of India, and harder to grasp for my algorithm (in the South, many people also abbreviate their caste- or village-names, leading to less data to work with in the first place). The remaining three states less easy too read are Jharkhand, Uttarakhand and Uttaranchal - three new states which few data yet, which might have distorted the outcome. The ten states with best coverage in turn tentatively lie in northern and central India: Goa, Punjab, Himachal Pradesh, J&K, Delhi, West Bengal, UP, Bihar, Orissa and Assam. Overall, however, upwards of 95% of candidate names could be categorized even in the dismal states - the variation ain't all too troublesome.

How reliable are these classifications, though? I obviously can't tell hard and fast, but I have a proxy measurement: the average certainty index produced by my algorithm. This varies from 0.1 to 36.4 (measured in difference between certainty of best and second best bet). Again, some of the higher values suggest raw data issues. Discarding these, the algorithm was most confident of its categorizations in the central and northern states again, with the exception of Tamil Nadu and Assam (the other eight being UP, Maharashtra, Haryana, Bihar, Uttarakhand, Uttaranchal, Delhi and Jharkhand). Those with least certainty in turn are mostly in the South or North-East: Mizoram, Goa, Nagaland, Madras, Arunchal, Mysore, Pondicherry, Manipur, Andhra and Gujarat.

Overall, the algorithm thus fared better than I feared across India - at least when judged on coverage and average certainty. Robust tests would require a name corpus of known classifications, also to rule out effects of general religious demography. The North-East, for instance, has many Christian names, which the algorithm easily confuses with Muslim ones, reducing certainty measures (similar issues could happen in the South with Jain vs Buddhist vs Hindu names, I suppose). Still, areas of concern are the South and North East, which was kind of expected. Firmer assessment of accuracy will have to wait until Francesca and Gilles are done with manual caregorization1...

A final afterthought: going back to my post on Muslim names over time in Lucknow, I thoguht I briefly look into coverage and clarity over the years. Which is quite interesting: detection rates plummet in recent years (from the early nineties onwards), but average certainty of classification rises. I have no idea how to resolve this riddle - yet. I don't even know if its data issues (better data in recent years - but then that should be easier to categorize on both indicators...) or substantial ones. Any clues?

  • 1. Since it's their data, I will also refrain from telling you the Muslim share of MLAs and candidates across states and decades - but it looks very interesting indeed, if I can trust my algorithmic classification. I hope their analyses progress fast, so that we know for sure..