Last week at the AAS in Philly, I had an interesting discussion of votebank politics in India and the importance of spatial variation. My contention was that most politics are local, and that electoral dynamics such as Muslim votebanks (i.e. Muslims voting for certain parties) and the extent of ethnic coordination (i.e. Muslims voting for Muslim candidates) depend on largely local factors. Some people disagreed, many agreed - but it remained a gut feeling. Until, on the flight back, I got an idea how to prove my point. This brief post thus explains at which level votebanks form and operate in India (well, in one instance at least)...

A few weeks ago, I wrote about additional estimates for the accuracy of my namematching algorithm, and also commented once more on the test corpus of names used to establish these. In the meantime, Gilles let me access his dataset of the social profiles of all MLAs in Uttar Pradesh since independence in full, i.e. including a manually coded variable for religion (thanks again!). Using his data, I was able to alleviate some potential biases in my original test corpus (particularly in terms of non-Muslim names), making my accuracy estimates more robust still.

The new test corpus consists of three raw name lists: the Haj Qurrah for 2012 (which by law only includes Muslim names, and should be fairly representative of those, as argued earlier), the undergraduate admissions list of Lucknow University under SC quota (which by law excludes Muslims, but has a bias towards lower economic strate of non-Muslims as well towards the young), and finally the names of all MLAs since independence (both Muslim and non-Muslim, and arguably with a bias towards higher economic strata as well as older people). The former two lists provide names and father names, the latter has name and gender. In the overall corpus, the ratio of Muslims to non Muslims is roughly 50:50 (since the Qurrah is fairly extensive); the following figures weighed the corpus to reflect the religious demographic of UP (which does not affect sensitivity and specificity, but renders predictive values more meaningful).

It's almost a year since I began working on a namematching algorithm to approximate Muslim population share in Lucknow's mohallas by exploiting the religious connotations of names on the electoral rolls of these areas. This has worked out quite well, and since led to a number of follow-up analyses, several conference papers, new collaborations, an article under review, two more in the pipeline - and last but not least the publication of a large dataset on religion and politics in Uttar Pradesh (featured in my second last post).

One thing kept worrying me, though: the scope of the algorithm varied quite a bit. Across UP's assembly constituencies, for instance, it sometimes managed to categorize 95% of the electorate - and sometimes only 70%. While accuracy of those names which were identified seemed alright, missings of up to a third were worrysome. Overwhelmingly, they however occured because names in the electoral rolls were simply not covered by There is little I could do about that, I thought.

Whenever I discuss my name-matching algorithm and derivative work, one question comes up: how well does it work outside UP, at other times, for other groups of people? And: what if your test corpus of names (Haj pilgrims and SC students) were non-representative of wider names (a concern particularly strong with the SC list)? Unfortunately, I have no hard and fast answer to these questions; they bother me, too.

But now, I have fresh some indicators at least - drawn from work-in-progress by Francesca Jensenius and a team around Christophe Jaffrelot spearheaded by Gilles Verniers. They try to look into social profiles of MLAs in India since independence - and as a prerequisite came up with a list of the names of all contestants in all elections in all states. Unlike the SC names, this corpus is arguably more elite, and it moves beyond UP, thus nicely complementing my own. On the downside, this list neither includes gender nor fathers' names, and first names are frequently abbreviated - much less material for my algorithm to work with. Most importantly, I only have the bare names from them, not the manual classification (which, as I understand, is still work in progress - once this is done, I could calculate actual sensitivity, specificity, PPV and NPV).

First an apology to my readers: this "weekly blog" turned monthly ever since I started writing up this paper, that one, a resubmit and my PhD in general. Add to this Easter holidays and incessant networking now that I am back in Europe - you get the picture. More: I am afraid this state of affairs is likely to continue for a while. But one particular project reached a milestone worth reporting: sharing my dataset on religion and politics in Uttar Pradesh - under an open license.