A few weeks ago, I wrote about additional estimates for the accuracy of my namematching algorithm, and also commented once more on the test corpus of names used to establish these. In the meantime, Gilles let me access his dataset of the social profiles of all MLAs in Uttar Pradesh since independence in full, i.e. including a manually coded variable for religion (thanks again!). Using his data, I was able to alleviate some potential biases in my original test corpus (particularly in terms of non-Muslim names), making my accuracy estimates more robust still.

The new test corpus consists of three raw name lists: the Haj Qurrah for 2012 (which by law only includes Muslim names, and should be fairly representative of those, as argued earlier), the undergraduate admissions list of Lucknow University under SC quota (which by law excludes Muslims, but has a bias towards lower economic strate of non-Muslims as well towards the young), and finally the names of all MLAs since independence (both Muslim and non-Muslim, and arguably with a bias towards higher economic strata as well as older people). The former two lists provide names and father names, the latter has name and gender. In the overall corpus, the ratio of Muslims to non Muslims is roughly 50:50 (since the Qurrah is fairly extensive); the following figures weighed the corpus to reflect the religious demographic of UP (which does not affect sensitivity and specificity, but renders predictive values more meaningful).

It's almost a year since I began working on a namematching algorithm to approximate Muslim population share in Lucknow's mohallas by exploiting the religious connotations of names on the electoral rolls of these areas. This has worked out quite well, and since led to a number of follow-up analyses, several conference papers, new collaborations, an article under review, two more in the pipeline - and last but not least the publication of a large dataset on religion and politics in Uttar Pradesh (featured in my second last post).

One thing kept worrying me, though: the scope of the algorithm varied quite a bit. Across UP's assembly constituencies, for instance, it sometimes managed to categorize 95% of the electorate - and sometimes only 70%. While accuracy of those names which were identified seemed alright, missings of up to a third were worrysome. Overwhelmingly, they however occured because names in the electoral rolls were simply not covered by indiachildnames.com. There is little I could do about that, I thought.

Whenever I discuss my name-matching algorithm and derivative work, one question comes up: how well does it work outside UP, at other times, for other groups of people? And: what if your test corpus of names (Haj pilgrims and SC students) were non-representative of wider names (a concern particularly strong with the SC list)? Unfortunately, I have no hard and fast answer to these questions; they bother me, too.

But now, I have fresh some indicators at least - drawn from work-in-progress by Francesca Jensenius and a team around Christophe Jaffrelot spearheaded by Gilles Verniers. They try to look into social profiles of MLAs in India since independence - and as a prerequisite came up with a list of the names of all contestants in all elections in all states. Unlike the SC names, this corpus is arguably more elite, and it moves beyond UP, thus nicely complementing my own. On the downside, this list neither includes gender nor fathers' names, and first names are frequently abbreviated - much less material for my algorithm to work with. Most importantly, I only have the bare names from them, not the manual classification (which, as I understand, is still work in progress - once this is done, I could calculate actual sensitivity, specificity, PPV and NPV).

Today, I follow up on my initial post on names ("What's in a name?"), which later inspired the map of Muslim Lucknow and my ongoing election analyses. The key idea back then was: if micro-level datasets on religion are unavailable, can we not create our own by making informed guesses about the religion of registered voters - lists of which are readily available? This methodology and its surprisingly high accuracy created quite some excitement over the last months, and a "research note" on it is on the way to publication (here). It thus seems to be about time to clarify the limits of this strategy: what is not in a name?

One thing that is not - or at least not clearly enough - is sectarian affiliation. Quite some people who got excited about my earlier posts asked whether the same strategy would also work to separate Shia and Sunni based on their names. This would open interesting analyses in the case of Lucknow in particular (see here), but I honestly did not think it would fly. People insisted, so I gave it a shot - which by and large confirmed my hesitation: inferring sectarian belonging from names is frought with difficulties. That much is clearly not in a name.

One of the conceived wisdoms of my discipline holds hat it's usually women who bear the fallout of groupism.1 Women are told to uphold "traditional values", women have to be protected from honor attacks on men, in short: women are the signifier of community. I was thus surprised when I discovered last week that the rise of groupism in India seems to have an impact on male Muslim names - but not on female ones. Many of the most prominent male names among Muslims have a religious connotation, whereas female names tend not to. We also saw that female names are much more diverse, with less clear trends. Take today's picture as an example, an election hoarding in Lucknow's recent municipal polls: the woman candidate is a Saniya - no religious meaning - but her husband (included here, of course, since he runs the show even if his ward became a woman's reserved seat this time around) is a Mohammad.

  • 1. A term coined by Rogers Brubaker, which I still adore...