India's general elections are coming up, and many data folks are looking forward to analyze and map results spatially (assuming, as I also argued last time, that all politics are local). Until very recently, only few could do this, however, because the basic prerequisite - GIS shapefiles of India's post-delimitation constituencies and polling station localities - were only available commercially (and could easily cost several thousand US dollars). Today, I wish to present a set of draft shapefiles comprising current polling booth localities, assembly constituencies and parliamentary constituencies under an open license, shared in the hope that they enable more visualizations and better spatial analyses of the ongoing elections.

Unlike the only other set of openly licensed shapefiles I am aware of - the handcrafted parliamentary constituency shapefiles recently published by DataMeet after their Bangalore hackathon (which does not yet contain assembly constituencies or polling station localities) - I chose an automated, algorithm-driven approach, working off draft polling station locality data published online by the Election Commission. I processed this data in multiple steps to derive assembly and later parliamentary constituency shapefiles:

1. Using the district code given for each polling station, I discarded polling stations that were clearly out of place (GPS errors, typos, etc) by matching them against the outer extent of GADM's district boundary shapefiles (with some manual adjustment for cases of re-districting). This is by the way one of the reasons why my shapefiles are released under a non-commercial license - they are 'tainted' by GADM material (for which I am tremendously thankful, though).

2. With the help of some friendly people at GIS stackexchange, I calculated heatmaps for each assembly constituency point cloud.

3. Using some GRASS wizardry, I overlaid the heatmaps for each state and calculated the 'hottest' constituency for each and every location, which gave me fairly accurate boundary lines. The two grey shapes in today's post's picture for instance separate the green from the red point cloud (the heatmap approach also effectively dealt with random points out of place, though at the expense of some accuracy).

4. This was followed by a series of cleanup steps, making sure that small 'islands' were deleted, that each assembly constituency was represented by only one contingent area polygon, etc. The last of these steps was to cut the result with GADM's state boundaries - you will instinctively see the exact border line as opposed to the otherwise approximate constituency boundaries.

5. Finally, using the EC's Delimitation order, I matched in parliamentary constituency codes and names, dissolved the assembly shapefiles into parliamentary ones - and packaged everything together.

A key advantage of this algorithm-driven method over manual geocoding is speed (I am a constantly time-strapped grad student after all), but this comes at the disadvantage of reduced accuracy. More specifically, there are two sources of inaccuracy here. Firstly, the outcome's quality depends largely on the quality of the Election Commission's raw point data - which varies from district to district. Secondly, the heatmap approach itself also tends to create 'smoother' boundaries, particularly where there are very few polling stations (in really remote areas, think northern Sikkim) or many points in one place (in crowded urban areas, think Bangalore). But I believe the result still reflects the 'space under influence' of one constituency well enough, definitely good enough for visualization purposes - and better than the alternative: not being able to map elections at all.

But judge for yourself: below, you will find draft versions of these shapefiles, including (state by state) polling stations (raw and cleaned up), assembly constituencies and parliamentary constituencies. These are published under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license, which basically means you can use them for any non-commercial purpose as long as you attribute me and this blog post as the source and share any additions or modifications on equal terms.

If you are familiar with one particular state and competent in GIS, it would help tremendously if you could have a look at the files and suggest corrections, either in the comment box below, or via email. While this post goes online, not all states are yet available, but I will add them as they come out of my algorithm over the coming days (the larger a state by area, and the more constituencies therein, the longer it takes). I intend to gather feedback until the end of April, and then publish a final set of shapefiles before counting starts on May 16. Thank you!

EDIT: Thanks to the amazing data services at my university, the final shapefiles are now published and permanently archived after slightly adjusting the algorithm (see comment stream below) at http://dx.doi.org/10.4119/unibi/2674065

Johnson's picture

Raphael,

Great work! Just wanted to know if/when Maharastra will be available.

Thanks!

Raphael Susewind's picture

The planned schedule is Chhattisgarh, Haryana, Punjab, Assam, Kerala, Orissa, Gujarat, Rajasthan, Karnataka, MP, Tamil Nad, Bihar, and then - tada! - Maharashtra. And then West Bengal, AP and UP. Small ones (in terms of number of constituencies) first, rest later - watch Twitter for announcements: https://twitter.com/RaphaelSusewind

Devdatta's picture

Hi Raphael,

This is certainly an interesting approach to organically build up the constituencies from poling stations.

I am more knowledgeable about two large states: Maharashtra and Rajasthan, but those have not been generated.

I also know a little bit about Uttaranchal and Delhi, so I decided to check those outs.

For reference, I used the maps available from NIC at: http://ecimaps.gisserver1.nic.in/

Because we are generating this information based on two main data sources: GDAM and the Polling Station information, the result depends on the accuracy of these two sources. The data in GDAM might be slightly old, but it is quite accurate for that date. The polling station data, however is horrendously incorrect. (I have personally checked the Poling station data of Mumbai manually, and over 40% had some major issue) The problems are both in the Geometry, as well as the attributes.

Because of this the end product is also incorrect. This problem is exacerbated in regions of low density (Population which in turn determines the Polling stations).

But it's not all bad. I'm quite surprised to see how well the data matches the official boundaries in places where it does.

I think we might have to try to correlate this data with some other data source to improve it further. As to what this could be, I have no answer right now.

But it's a great start, and lets hope that we can improve this data further.

Raphael Susewind's picture

Devdatta, you are absolutely right. Since you have seen Uttaranchal and Delhi, you have seen about the worst and best example so far (look at the point clouds and you will see what I mean). Haryana, Chhattisgarh and Punjab (just uploaded) look a bit better. It seems best results of the heatmap approach are in "medium-density" situations. Which deals with the heatmap inaccuracies - the raw data inaccuracies are out of reach (and worst in Uttarakhand, from all states published so far). Unfortunately, my EC contacts tell me the private company whom this was outsourced to considers the job over and done. So we should not expect better raw data anytime soon...

avinash's picture

Hi Raphael,
Quick question. How did you get the polling station coordinates from the eci site? Did you scrape them? if so, would be great if you could give us a brief overview?

Avinash

Raphael Susewind's picture

Yep, scraped them. Was more difficult than your usual javascript-infested site though, because of rolling keys and google ajax nightmares if I remember correctly. Anyway, solution was to automate Firefox using the MozRepl plugin to display polling booths district by district (which simulated actual use beyond whatever WWW::Mechanize can do) while routing these requests through a custom written proxy server which then extracted the interesting bits and pieces from the raw HTTP stream...

Jeff Weaver's picture

Thanks so much for this wonderful work you're doing!

You may already have these/be planning on this, but I have post-delimitation AC boundaries for Bihar (from ML InfoMap) and could compare your output from Bihar to those once it is complete. Due to licensing issues, I suspect that this couldn't directly be used to modify your maps, but it might give you a better sense of how accurate the method is.

Best,

Jeff

Raphael Susewind's picture

Thanks Jeff - Bihar just went online, and looks pretty good at first glance, even in urban Patna (usually urban areas are troublemakers: raw data is more distorted because GPS measurement errors are both more likely and more severe in relation to the small size of urban constituencies, and the heatmap approach leads to "overheating" situations which distort the constituency's shape; for the latter at least, I have a workaround in the pipeline). Please do have a look and let me know...

Jeff Weaver's picture

Adding the comments I emailed you to the blog:

The main discrepancies that I noticed were:
1. According to http://gis.bih.nic.in/giselection.html, the ML MapInfo maps (which I was checking against) and the heat map generated shapes are both a little off for the Patna ACs. I think yours are actually a little closer to the correct ones than ML MapInfo's though!
2. Sahebganj AC is a bit off - it incorporates some of the Paroo AC
3. In the southwest, Buxar/Rajpur/Dumraon are not quite right, each is taking a piece of the other.
4. In the heat maps, there is a Muzaffarpur AC between Gaighat and Aurai ACs, but on the official Bihar website (and my ML MapInfo maps), that AC is located between Kanti, Kurhani and Bochana ACs. From regular maps, it seems like the official Bihar website is correct, not sure why the polling station coordinates would be systematically off there.

Other than that, there are the expected slight differences, but it is incredible how close the fit is! Thanks so much for sharing the maps with all of us, this is going to be a great resource going forward.

Dilip Damle's picture

You have done a great job, I just bumped in to this one only today. Just saw one map of Goa.
As you mentioned that your algorithmic approach tends to create smoother boundaries. Just a wild thought about that, perhaps you could implement later. Using the Drainage network where ever a boundary comes near or along a river snap it to the river.
Looking forward to the final shapefiles.

Raphael Susewind's picture

Thanks Dilip for the suggestion. I went with a simpler solution, though, and readjusted some parameters in the algorithm which to me seems to make urban areas more accurate. Anyways: The final dataset is now online at http://dx.doi.org/10.4119/unibi/2674065. Thanks to all who provided feedback!

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account associated with the e-mail address you provide, it will be used to display your avatar.
By submitting this form, you accept the Mollom privacy policy.