India's general elections are coming up, and many data folks are looking forward to analyze and map results spatially (assuming, as I also argued last time, that all politics are local). Until very recently, only few could do this, however, because the basic prerequisite - GIS shapefiles of India's post-delimitation constituencies and polling station localities - were only available commercially (and could easily cost several thousand US dollars). Today, I wish to present a set of draft shapefiles comprising current polling booth localities, assembly constituencies and parliamentary constituencies under an open license, shared in the hope that they enable more visualizations and better spatial analyses of the ongoing elections.

Unlike the only other set of openly licensed shapefiles I am aware of - the handcrafted parliamentary constituency shapefiles recently published by DataMeet after their Bangalore hackathon (which does not yet contain assembly constituencies or polling station localities) - I chose an automated, algorithm-driven approach, working off draft polling station locality data published online by the Election Commission. I processed this data in multiple steps to derive assembly and later parliamentary constituency shapefiles:

1. Using the district code given for each polling station, I discarded polling stations that were clearly out of place (GPS errors, typos, etc) by matching them against the outer extent of GADM's district boundary shapefiles (with some manual adjustment for cases of re-districting). This is by the way one of the reasons why my shapefiles are released under a non-commercial license - they are 'tainted' by GADM material (for which I am tremendously thankful, though).

2. With the help of some friendly people at GIS stackexchange, I calculated heatmaps for each assembly constituency point cloud.

3. Using some GRASS wizardry, I overlaid the heatmaps for each state and calculated the 'hottest' constituency for each and every location, which gave me fairly accurate boundary lines. The two grey shapes in today's post's picture for instance separate the green from the red point cloud (the heatmap approach also effectively dealt with random points out of place, though at the expense of some accuracy).

4. This was followed by a series of cleanup steps, making sure that small 'islands' were deleted, that each assembly constituency was represented by only one contingent area polygon, etc. The last of these steps was to cut the result with GADM's state boundaries - you will instinctively see the exact border line as opposed to the otherwise approximate constituency boundaries.

5. Finally, using the EC's Delimitation order, I matched in parliamentary constituency codes and names, dissolved the assembly shapefiles into parliamentary ones - and packaged everything together.

A key advantage of this algorithm-driven method over manual geocoding is speed (I am a constantly time-strapped grad student after all), but this comes at the disadvantage of reduced accuracy. More specifically, there are two sources of inaccuracy here. Firstly, the outcome's quality depends largely on the quality of the Election Commission's raw point data - which varies from district to district. Secondly, the heatmap approach itself also tends to create 'smoother' boundaries, particularly where there are very few polling stations (in really remote areas, think northern Sikkim) or many points in one place (in crowded urban areas, think Bangalore). But I believe the result still reflects the 'space under influence' of one constituency well enough, definitely good enough for visualization purposes - and better than the alternative: not being able to map elections at all.

But judge for yourself: below, you will find draft versions of these shapefiles, including (state by state) polling stations (raw and cleaned up), assembly constituencies and parliamentary constituencies. These are published under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license, which basically means you can use them for any non-commercial purpose as long as you attribute me and this blog post as the source and share any additions or modifications on equal terms.

If you are familiar with one particular state and competent in GIS, it would help tremendously if you could have a look at the files and suggest corrections, either in the comment box below, or via email. While this post goes online, not all states are yet available, but I will add them as they come out of my algorithm over the coming days (the larger a state by area, and the more constituencies therein, the longer it takes). I intend to gather feedback until the end of April, and then publish a final set of shapefiles before counting starts on May 16. Thank you!

EDIT: Thanks to the amazing data services at my university, the final shapefiles are now published and permanently archived after slightly adjusting the algorithm (see comment stream below) at http://dx.doi.org/10.4119/unibi/2674065