Viewing Population within a Country: How to Create Cartograms and Visualize Centroids with Python

Greg Feliu
8 min readNov 25, 2020

All maps lie.

Allow me to explain: every map is designed to show a specific feature of our world. Every detail of the map highlights features that the mapmaker wants the viewer to see. The projection, the words, the colors, everything helps communicate a certain viewpoint.

Map of light pollution over Italy (lighter color means more light pollution).

In most cases, maps successfully show political boundaries, major cities, land area, and water area. These are all features we expect from maps. One feature that is rarely shown, though, is the population of political units. In order to demonstrate how population within political units can be mapped, I will take the case of the populations within Italian provinces (equivalent to U.S. counties). With this data, we will learn more about where people live in Italy by using cartograms and by visualizing centroids.

Along the way, I’ll show the reader some neat features of the Python libraries GeoPandas and geoplot. Both libraries are easy to use and extremely helpful.

Without further ado, let’s get started!

Libraries

In order to start this project, I highly recommend using the Python libraries GeoPandas and geoplot (to my knowledge, there’s no other way to create cartograms in Python without this library). Included in the installation of GeoPandas is the Shapely library, which we will also need. Therefore, the first step is the following:

pip install geopandas
pip install geoplot

Data

In order to view Italy’s population, we need two things: data about the population of Italy’s provinces and the “shapes” of those provinces. The former is obviously needed, the latter, less so. The shapes are simply the boundaries of each political unit of interest. They need to be converted into Python objects which give precise coordinates of how to plot the units.

For the population data, we turn to our trusted friend Wikipedia. The table of data can be converted into a csv file in many different ways. For me, the easiest method is to use the extremely helpful wikipedia table converter tool. Once converted, we download the results as a csv file.

The province shape data was much more difficult to find. In fact, the difficulty of finding this data made me almost give up on the whole project! Eventually, I found out that the data could be downloaded from a repository maintained by NYU. This datawas exported as a GeoJSON file which could be read by GeoPandas (shapefile is another option).

Data Engineering

The first task in this project is to combine the two data sources into a single GeoDataFrame (a GeoPandas object). In order to do this, I opened up Jupyter Notebook and imported the previously mentioned libraries, along with the ever reliable Pandas library.

import geopandas as gpd
import geoplot as gplt
import pandas as pd

Additionally, I imported the population data into a Pandas DataFrame, and the province shape data into a GeoPandas GeoDataFrame.

prov_gdf = gpd.read_file("../data/it_provinces_shapes.geojson", driver='geojson')
provinces_df = pd.read_csv("../data/provinces.csv", skipfooter=1)
Example data from each of the two DataFrames.

Next, I removed columns that were unnecessary for my purposes (“President”, “adm1_code,” etc.). After this, I corrected the datatypes for columns as needed. For example, numbers were listed with commas in the “Population (2019)[3]” column in provinces_df. Pandas interprets this column as an object. In order to correct the datatypes for these columns, I did:

provinces_df['Population'] = provinces_df['Population (2019)[3]'].str.replace(',', '').astype(int)

In this one line of code, we convert the data into strings, replace the commas, and convert these strings into ints. With this change, we can show the population as a continuous variable, instead of an ill-conceived categorical one.

A problem with the shapes data also had to be corrected. Some province shapes have changed sinced 2015 (when theshape data was collected). Mainly this was an issue for southern Sardinia. The 4 provinces in 2015 became 2 in 2016. Since the shape of one of these new provinces was not provided to me, I had to combine all four shapes into one. The before and after can be seen below.

South Sardinia before and after conversion of the geometries.

In order to combine these four shapes, one needs to use gpd.dissolve() (I use this function in all other aggregations in this project). To use this function, a column must be unique for each intended unit. In this case, I labelled all four units as “South_Sardinia” in the “Code” column.

prov_gdf2 = prov_gdf.dissolve(by='Code')

Great! Our GeoDataFrame now has the correct data! Time to merge the population data to the GeoDataFrame.

provinces_data_gdf = prov_gdf2.merge(provinces_df, on='Code', how='inner')

Now that our data is ready, tome to start mapping it!

Cartograms

A cartogram is “a map in which the geometry of regions is distorted in order to convey the information of an alternate variable” (source). In essence, instead of emphasizing the size (i.e.: land area) of the unit, a continuous variable will be shown as the size the unit. This is best shown with an example:

To learn more about the relative population of the Italian provinces, we can plot the distorted size of the province next to its representation on the map.

ax1 = gplt.cartogram(provinces_gdf, scale='Population', projection=gcrs.AlbersEqualArea(), figsize=(8,8), limits=(0.1, .9), color = 'green')
gplt.polyplot(provinces_gdf, facecolor='white', edgecolor='grey', ax=ax1)
ax1.set_title("Cartogram of Italian Provinces", fontdict={"fontsize":15})
A cartogram of Italian provinces showing population.

As we can see above, most provinces cover a large area, but have relatively few people. In this case, the largest provinces appear as 90% of their actual size, while the smallest provinces appear as 10% of their actual size. Rome, the largest province (center), is fully green whereas most of the provinces in Sardinia, for example, are much, much smaller than they appear on a typical map. With good reason, too, the Rome province has more than twice as many people as all of Sardinia! With the cartogram map, this becomes very clear.

Italian regions distorted according to their population.

I also used the cartogram to view larger aggregations of the data. Just as I did in aggregating the “South_Sardinia” province, I used gpd.dissolve() to aggregate provinces. Regions, for example, show large disparities in terms of population. Lombardy, the largest region, is more than 80 times larger than the smallest region, Aosta Valley (top left)!

Italian cultural regions distorted according to their population.

Similarly, Italy is often divided into (two, sometimes three) cultural regions: North, (Central), and South Italy (the line dividing them isn’t agreed upon). From the cartogram (left), we can clearly see the North is the most populated, followed by the South, then by Central Italy.

Centroids

The “center” of a political unit can teach us other things about the regional units. What does the center mean, though? One can interpret this in many ways, of which I will focus on two: the geographic center, and the center of population. The geographic center is, as one would expect, the center of the political unit’s shape. This is a kind of baseline that tells us the center if everything were perfectly balanced.

The center of the population, or more accurately the “weighted mean population centroid,” is the mid point of the population within that political unit (see this excellent blog on how to calculate this coordinate or check out this project’s github repo). The left and right sides of the political unit, for example, should have an equal population if centered on this point. The population centroid, when plotted next to the geographic centroid, tells us roughly in which area the population is skewed within a political unit. Let’s look at the region centroids for an example.

To find the geographic centroid, we replace the geometry column with .centroid for each shape. Simple enough. The weighted centroid is a bit more complicated to compute. One had to find the geographic centroid of each province, scale everything by its population and divide by total population of the aggregate (details can be found here). These coordinates are then plotted.

# Finding geographic centroid
geo_gdf['geometry'] = geo_gdf['geometry'].centroid
# Plotting both centroids
ax = geographic_centroids.plot(color='black', figsize=(5,5))
population_weighted_centroids.plot(figsize=(5,5), ax=ax, color='green')
gplt.polyplot(geometries, facecolor='white', edgecolor='grey', ax=ax, figsize=(5,5))
ax.set_title("Region Centroids", fontdict={'fontweight':'bold'})
Geographic and population weighted centroids for Italian regions.

The centroids are sometimes the same, sometimes different, depending on the region. Sicily (island on bottom), for example has nearly identical centroids, meaning that the population is very evenly distributed. Apulia (the “heel” of Italy) shows the largest difference between the centroids, meaning that the population is skewed south, compared to the geographic center.

As I did with the cartograms, I also plotted the centroids using the cultural regions. In this map, we see that the population of the South is skewed east. This isn’t very surprising because, as stated before, Sardinia has a relatively small population. The other cultural regions are more evenly balanced.

Last, I plotted the centroids for all of Italy. To my surprise, there was almost no difference between two centroids! Turns out, the center of Italy is also the center of Italy’s population, meaning that the population is very evenly distributed.

Alternative Ways to View Population

The two methods shown here are only a few of the many ways to show population. A more precise visual of where people live in Italy would need more granular data. If we had population at the town/city level, we could use some more precise maps such as a geospatial scatter plot or Kernel Density Estimation (KDE) plot. Both can be plotted using geoplot.

One of the more interesting, and potentially accurate methods of mapping the population distribution uses light pollution (see uppermost image). Where people live, light must follow. In fact, researchers have attempted to find the population centroid using this approach. If each person has the same probability of creating a footprint of light, then this approach would be very fruitful.

Conclusion

Everyone has a perspective. Maps are similar to us in that way. In this blog post, I leveraged population data to learn more about where the majority of people live in Italy. The cartograms made clear that there are huge disparities in the sizes of the populations in all political units examined here. It comes as no surprise that regions that include these cities have more people. What the cartogram permitted us to see was to what extent these disparities are present within a larger political unit.

With the geographic and population centroids we saw which regions were “skewed” in terms of where people live within political units. Apulia, for example, has the majority of the population in the South, relative to its geographic centroid. We also saw that the centroids of Italy are identical, signifying that the country is evenly balanced in terms of population.

With the help of the geoplot and GeoPandas libraries I showed the reader how easy it is to view population within a country. I hope you use this project to make even more interesting cartograms and centroid visualizations in the future!

P.S.: All of the code used to engineer the data and to make these plots can be found in the project’s github repository.

--

--

Greg Feliu

Data Analyst | Data Engineer — Interests in language, sports, marketing and geographic visualizations