Can one infer the religious community to which an Indian belongs from his or her name? Intuitively, the answer would be yes: Indians and those familiar with the country certainly develop a pretty good sense for such inferences. And even though names remain only one among several clues (including dress, language, etc), names alone are sadly often reason enough to discriminate against people (for instance to deny Muslims housing). But most Indians also know the flurry of probing questions along the lines of "What's your name?" - "X" - "No, your full name?" - "X Y" - "Where are you from?" - "Z" - "No, I mean: Hindu?". Clearly, names are not always good indicators to gauge an individual's community.
Today's post sheds a probabilistic light on this problem. First, I discuss why it could be useful to infer communities from names. Next, I introduce a name matching algorithm which I developed to achieve this task (building on others' earlier efforts, and available for download below under the GNU Affero GPL license). Finally, I give a first indication of how good my algorithm works: what's in a name? Your comments are of course highly appreciated - and I apologize in advance for a rather technical post (which is in fact as much a writeup for my own memory as it is meant for you to read). Once I develop empirical applications of this software, I promise more lively prose...
Why would you want to do this? There can be many reasons why you might want to infer religion from one's name. One of the unpleasant ones has already been mentioned: to discriminate against minorities. My reasons are less sinister: I am both continously fascinated by Indian languages - and continously deprived of micro-level C-series census data (i.e. data which would allow me to map where Muslims live in Lucknow). One potential workaround for the latter problem would be to interpolate such C-series data from electoral rolls - but for this to work, I would need to be able to match the names therein to religious community.1 So I decided to develop the software introduced below.
There are a number of caveats to this endeavour, however. I am surely not the only one uncomfortably reminded of colonial caste censuses and tabulations; in fact most of my published work so far has painstakingly argued that infering anything - and especially group identities - from an individual based on statistical abstraction is a highly problematic endeavour. I already hear the voices denouncing the whole idea as outrageously orientalist; one of these voices is indeed my own. Adding to these normative problems are a large number of statistical and computational difficulties. So is it still worth it?
On balance, I think so - for two reasons. Firstly, I intentionally bring the issue in the statistical domain: I am not making essentialist judgments based on dubious grounds, but calculate probabilities based on real-world data. Other than colonial census bureaucrats, the algorithm introduced below does therefore not claim more precision than is statistically warranted. Secondly, I decided that having the resulting data - murky as it may be - is in many cases still better than having no data at all. Which seems pretty much to be the alternative. And while my calculations might not be 100% accurate (and certainly don't claim to be such quality), they are arguably better than relying on hearsay alone.
How can you do this? To infer religious community from names, one basically needs a master name list (coding first- and surnames to community) and an algorithm which matches any given name against this master list and gives its "best bet" as to the religious community of the namebearer. I am not the first one going down this route: two software packages already exist - Nam Pahchan and SANGRA - as do a number of publications based on them, especially in the field of public health (see here). The problem is, however, that the existing software was primarily written to distinguish South Asian from other names (rather than one religious community from another) - and that it is not available under a free license. So I decided to write my own.
Originally, I intended to base my algorithm on a master name list inferred from one of the many Indian matrimonial sites like shaadi.com (which seemed to be a data heaven in many ways). But these sites proved either too hard to crawl technically - or explicitly forbade crawling in their Terms & Conditions. Happily, I then stumbled across another source: those friendly sites who want to help Indian parents find a name for their babies. The most comprehensive one of these is indiachildnames.com, which covers all of India (though my algorithm will later expect North Indian naming conventions, i.e. a first/lastname distinction), and links roughly 23.000 firstnames to religion and gender. As far as surnames go, the picture is more complicated: indiachildnames.com lists 4200 surnames, but apart from Christian and Muslim names - which are marked as such - links these surnames to region/state rather than religious community.2 To at least alleviate this problem (though my surname matching remains overall unsatisfying), I returned to the earlier idea about matrimonial sites - and crawled vivaah.com to identify the religious connotations of those surnames not classified as Christian or Muslim. Even with these corrections, however, my algorithm is way better for these two communities than for all others - so if you know of a good surname corpus replacement, please let me know!
Apart from creating a master name list, I wanted to be prepared both for input in various scripts, especially in both latin transliteration and devanagari (which, by the way, also inspired the picture for today's post - based on the typeface developed by the hilarious Hinglish Project). I also needed a way to deal with the manifold ways of spelling one single name (think for instance of the notorious Chowdhury, which can be spelled in at least nine different ways in transliteration alone). While I decided that the algorithm would internally work with devanagari unicode, it thus uses Google Transliterate in case it is fed with input in any other script. As for the second issue - multiple spellings - I rely on the wonderful IndicSoundex algorithm developed by Santhosh Thottingal at SILPA to be able to match not only how a name is spelled, but also how it sounds.
Now the program itself expects a full or partial Indian name as input, and optionally a gendercode ('m' or 'f'). It then outputs a list of probable communities to which the namebearer might belong, including the statistical likelihood of each option, and a "best bet". To achieve this, the algorithm first transliterates input into devanagari (if necessary) and calculates IndicSoundex codes. It also disintegrates the full name into first- and lastnames (based on both the master name list and the relative position within the full name input). For each name, the software then finds all matches in the respective master name list (surname, firstname male or firstname female), both according to spelling (in devanagari) and according to pronounciation (using the IndicSoundex codes). Finally, each match is assigned a quality factor as probability, based on the clarity and explanatory potential of the respective master name list.3
Here the trouble starts. In many cases, the algorithm will find more than one match (especially if matching pronounciation-wise - this is the whole point of having an soundex matching), and sometimes these matches will indicate more than one community. In this case, the probability needs to be adjusted. Similarly, the algorithm must combine the probabilities individually calculated for each name into onecomposite probability for the full name initially given as input. These calculations are done in two stages: first combine spelling and pronounciation matches, then combine firstnames and lastnames.
For the following mathematical formulas, I will introduce a concrete example: the fictionary "Mohammad Ram Lal Yadav", a person whose gender I assume I don't know (to keep it at least a little simpler than it actually is). To what community would a "Mohammad Ram Lal Yadav" of unknown gender most likely belong? Let us first look at the raw number of matches against the master name lists for spelling and pronounciationy (assuming that Yadav is a lastname, and the rest are firstnames). How often do the four names of that fictionary person match the baby name list?
|Spelling||1x Muslim||1x Hindu|
|Pronounciation||3x Muslim||3x Hindu|
As you can see in the table, this very first step throws up 34 different matches. The first calculation to get towards a "best bet" is to integrate spelling and pronounciation matches for each name-community combination. The formula is simple but long: the probability for a community-name combine X equals (one minus ((spelling matches for all communities for this name minus spelling matches for this community for this name) divided by spelling matches for all communities for this name) multiplied by ((pronounciation matches for all communities for this name minus pronounciation matches for this community for this name) divided by pronounciation matches for all communities for this name)) multiplied by quality factor for spelling matches multiplied by quality factor for pronounciation matches. Now that is a mouthful. If we take the example of a combined probability of "Ram" being a Hindu name, the formula would translate into (one minus ((two minus one) by two) multiplied with ((five minus three) by five)) multiplied with the quality factors. Which would be (one minus (50% multiplied with 40%)) multiplied with the quality factors. Which ends up being 80% multiplied with the quality factors. If we integrate spelling and pronounciation matches for all names in this way, the following table results (ignoring the quality factors for now, to keep it a bit more simple):
|Muslim 100%||Hindu 80%|
This is good already, but not good enough: I don't want to know whether Ram is a Hindu name - I want to know whether the whole of "Mohammad Ram Lal Yadav" is likely to bear a Hindu name or not. In a final step, the algorithm thus integrates the probabilities for each community-name combine into overall probabilities for each community. The formula for this is even longer: combined probability for community X is one minus (((number of entries in table above minus percentage of community X for name A) divided by number of entries in table above) multiplied with ((number of entries in table above minus percentage of community X for name B) divided by number of entries in table above) multiplied with [repeat for each name]). Filled in with our example, the combined probability for "Mohammad Ram Lal Yadav" bearing a Hindu name would thus be one minus (((seven minus zero) by seven) multiplied with ((seven minus 80%) by seven) multiplied with ((seven minus 83%) divided by seven) multiplied with ((seven minus 100%) divided by seven)). Which ends up being 33%. After this second stage is calculated done, the software spills out its final verdict (this time including quality factors):
./community.pl 'Mohammad Ram Lal Yadav'
Best bet: Hindu with likelihood difference to second best bet of 8%
Hindu with likelihood of 22%
Parsi with likelihood of 14%
Muslim with likelihood of 11%
Christian with likelihood of 2%
If you want to further aggregate from here (depending on your needs), you could do so - one way of arriving at a "best bet" is built in (difference between probabiliy of best bet and second best bet in percent), other options would work through divisions, or again through inverted probabilites. But I will leave it at that - too many formulas already for this post.
For those working on Linux (or familiar with running perl and python scripts on Windows), the actual software is available for download below (published, reusable, and modifiable under the GNU Affero GPL license). Because neither indiachildnames.com nor vivaah.com have explicit scraping policies, I have to assume their copyright, and therefore will not redistribute the master name list as such. But the software package contains a script with which you can scrape your own list (downloading is fine, but redistributing is not - that's the assumption here). For the same reason, I can unfortunately not provide an online version of this tool - you will need to run it locally. For a very rough idea, you can however use the indiachildnames search function...
How well does it work? Now the most interesting question is of course: does all this effort end up somewhere meaningful? Does it help to tell what's in a name? Or at least give a reasonably accurate guess of an individual's religious community? There are at least two questions involved: how good is the algorithm at assigning community clearly - and how well does this assignment reflect the real world, discounting false positives and negatives?
As to the first question, the internal consistency of the master name list is surprisingly high - there are very few ambiguous names in there. Which might either be because there are few ambiguous names in India - though I would dispute this in many cases, in particular for surnames - or because the master name list isn't all to accurate. As for the remaining missings, it is very hard to guess what they might be. This basically requires assumptions about the clarity of naming within different religious communities, mediated by the fact that the master name list includes many more Hindu or Muslim names than, for instance, Buddhist ones. Buddhists might thus be over-represented in the missings either because they have less clearer names in the real world or because their master name list is less complete or because of both factors. Clearly, we enter murky territory here - even if the master name list contains close to 30.000 first- and lastnames, which really should cover the most frequent ones.
Or does it? First tests of the algorithm - both on my personal address book and on minor sections of the electoral rolls further demonstrate that close to 90% of names in either corpus are identified clearly (though with varying probability) by the algorithm. Which really means that the master name list covers a high percentage of frequently used names - and which is better than I hoped. Good!
This does not, however, answer the question whether these 90% clearly identified names are actually correctly identified. My personal address book for one has very few wrong assignments - but then this is arguably not a very good benchmark. The only real way to test this would be to run my algorithm against a corpus which includes names (many, many names) as well as self-reported community (or which includes a mono-religious list of names, such as a list of Haj subsidy recipients or RSS members - thanks to Chris Taylor for this clarification). Then I could compare algorithm results with actual community belonging and so identify all these cases were, for instance, parents are giving fancy Persian names to Hindu childs (a recent trend which a friend in Lucknow alerted me to), cross-community marriages, the adoption of "Christian" names not as Christian names but as Western (and thus modern and favoured) names etc. There are many reasons why the accuracy of my algorithm vis-a-vis real-world data will be lower than its internal consistency of around 90%. The problem is - I don't have such a corpus to test my algorithm against. Any suggestions are clearly welcome - especially since, if you read this post so far, you will most certainly be an expert in these things :-).4
So in the meantime, I can only wait, hope to find a good test corpus, tweak the algorithm - and let the Oxford servers do their job. In a week or two, I will know more. Till then, I am curious for your comments, suggestions, or links...
- 1. In fact, as I write this post, a powerful server at Oxford crawls through millions of voter names, at a pace of roughly one constituency per day (yes, it takes that much time). Next week, I will then hopefully be able to let you know how the Muslim landscape of Lucknow looks like. I will also be revisiting last week's election maps to assess the accuracy of the prominent Muslim vote urban legend - and tell a number of other interesting stories. Stay tuned!
- 2. A problematic assumption in itself, of course - tentatively reinforcing the wrong idea that Christians and Muslims are somehow not Indian. but that's another topic...
- 3. Basically the percentage of unambiguous names in this list, i.e. names which are clearly assigned to one community and not another
- 4. The only workaround available to me at the moment is to test aggregated population shares for different communities calculated with my algorithm from electoral rolls against actual census data - though this introduces another set of problems (the fact that electoral rolls exclude minors being one of them, time-lag in when the respective body of data was collected another one...)