Visualizing geospatial data

May 6, 2020

Geospatial data refers to data which has a geographic component in it. Usually this means that the data entries are associated to some point on the surface of the earth. Therefore, geospatial data are usually visualized on maps.

Because the earth is round, in order to make a flat map, one must transform the earth's surface to a flat surface. Such transformations are called projections by cartographers (not to be confused with linear projections from linear algebra). Mathematicians know that a transformation between topologically different regions must be discontinous somewhere. So these projections, while very useful, cannot be perfect replicas of reality. It is useful to know this and a bit more about python modules for projections while attempting to visualize geospatial data on the globe.

Many references, including Geographic Data with Basemap of [JV-H], use the python module basemap. However in recent years, the module basemap has been deprecated in favor of the new python mapping project called Cartopy. Therefore, this activity aims at taking a quick look at cartopy. Cartopy, together with geopandas, a package built on top of pandas, shows promise in easing geospatial visualization. They are nonetheless relatively new efforts. You will notice that their documentation, while constantly improving, is not as mature as, say numpy. There will likely be a number of changes in the provided facilities as these efforts take hold. Our goal in this activity is to get acquainted with these new developments for visualizing geospatial data.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import geopandas as gpd
from cartopy import crs

Geometry representation

The GeoDataFrame class of geopandas is a pandas data frame with a special column representing geometry. This column is a GeoSeries object, which may be viewed as a pandas series where each entry is a set of shapes. The shapes are geometric objects like a a set of points, lines, a single polygon, or many polygons. These shapes are objects made using the shapely package. Together they allow easy interaction with matplotlib for plotting geospatial data.

Here is a GeoDataFrame object we have already used in 01 Overview:

In [2]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
type(world)
Out[2]:
geopandas.geodataframe.GeoDataFrame

The above-mentioned special column of the world data frame is this.

In [3]:
type(world.geometry)
Out[3]:
geopandas.geoseries.GeoSeries

A GeoDataFrame can have many columns of type GeoSeries, but there can only be one active geometry column, which is what is returned by the attribute GeoDataFrame.geometry. Note also that the name of this active GeoSeries can be anything (not necessarily geometry), but in the world object, the column happens to be called geometry, as can be seen below.

In [4]:
world.geometry.name
Out[4]:
'geometry'
In [5]:
world.head()
Out[5]:
pop_est continent name iso_a3 gdp_md_est geometry
0 920938 Oceania Fiji FJI 8374.0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 53950935 Africa Tanzania TZA 150600.0 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2 603253 Africa W. Sahara ESH 906.5 POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3 35623680 North America Canada CAN 1674000.0 MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4 326625791 North America United States of America USA 18560000.0 MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

The plot method of the data frame is redefined in a GeoDataFrame to use the geometry objects in the active geometry column. So to plot this map, all we have to do is use the plot method:

In [6]:
world.plot()
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x120d3cd00>

A GeoDataFrame has more attributes than a regular pandas data frame. For example, it stores the centroids of the shapes in the active geometry column.

In [7]:
type(world.centroid)
Out[7]:
geopandas.geoseries.GeoSeries

This is a GeoSeries that did not show up when we queried world.head(), but it is an attribute of world. We can, of course, make it an additional column of world by in the usual pandas way.

In [8]:
world['centroids'] = world.centroid
world.head()
Out[8]:
pop_est continent name iso_a3 gdp_md_est geometry centroids
0 920938 Oceania Fiji FJI 8374.0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000... POINT (163.85316 -17.31631)
1 53950935 Africa Tanzania TZA 150600.0 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... POINT (34.75299 -6.25773)
2 603253 Africa W. Sahara ESH 906.5 POLYGON ((-8.66559 27.65643, -8.66512 27.58948... POINT (-12.13783 24.29117)
3 35623680 North America Canada CAN 1674000.0 MULTIPOLYGON (((-122.84000 49.00000, -122.9742... POINT (-98.14238 61.46908)
4 326625791 North America United States of America USA 18560000.0 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... POINT (-112.59944 45.70563)

Now, world has two GeoSeries columns. If we make the centroids column the active geometry column, then the output of the plot method changes since it uses the active column's geometry specifications.

In [9]:
world = world.set_geometry('centroids')   # change the active geometry column
world.plot();

Coordinate Reference Systems

An essential data structure of cartopy is CRS, or Coordinate Reference Systems, the name used by cartopy (and other python projects) for the projections used in maps. A CRS often used as default is the the Plate Carrée projection, which cartopy provides as follows.

In [10]:
crs.PlateCarree()
Out[10]:
<cartopy.crs.PlateCarree object at 0x122e01ea0>

As you guessed from the above output, the CRS object is able to plot itself using matplotlib. This points to one avenue for visualizing geospatial data that has no need for geopandas. Cartopy produces a matplotlib axis on which you can overlay your data as you see fit: if your data has latitude and longitude associated to it, cartopy can apply the relevant projection automatically to place it at the proper place in the figure. Below, we will focus on alternative visualization methods using geopandas and facilities to interact between cartopy and geopandas.

The world object of class GeoDataFrame comes with a CRS attribute, another attribute that does not show up in world.head() output.

In [11]:
world.crs
Out[11]:
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

We see that the CRS above is codenamed EPSG:4326. Here EPSG stands for European Petroleum Survey Group. They maintain several data sets at https://spatialreference.org/ref/epsg/. Each set in “EPSG Geodetic Parameter Dataset is a collection of definitions of coordinate reference systems and coordinate transformations which may be global, regional, national or local in application.”

EPSG-number 4326 that we have here belongs to the WGS 84 coordinate system, the latest in the World Geodetic System (WGS) where the earth is represented as an oblate spheroid with coordinates in decimal degrees (Lat, Lon).

Changing the active geometry from previously set centroids to the original geometry column, let us plot the world again in order to compare the result with another CRS.

In [12]:
world = world.set_geometry('geometry')   # set the active geometry
world.plot(); plt.title('World in WGS 84 CRS');

We compare this output with another very well-known projection, the Mercator projection, which has the nice property that it is conformal, i.e., it preserves angles. EPSG has made available a Web Mercator projection. We can easily convert world from the WGS 84 to the Mercator CRS:

In [13]:
world_Mercator = world.to_crs("EPSG:3395") 
world_Mercator.plot();
plt.title('World in Mercator CRS');

Generations of school kids have grown up with this Mercator map. Note how the Mercator projection distorts the size of objects as the latitude increases from the equator to the poles. Some countries near equator look smaller than they really are. For example, Brazil is actually larger than the contiguous United States, but it looks smaller in the Mercator projection. If you have time for a digression, have a look into the many discussions online, e.g., The Problem With Our Maps, on how the false sizes on maps (perhaps inadvertently) shape our views on countries.

Two other CRS

The Azimuthal Equidistant and Albers Equal Area coordinate reference systems show areas of the globe in better proportions than the Mercator projection, as their names suggest. These are implemented in cartopy. We want to leverage geopandas' ability to work with cartopy in the next step. We don't always get meaningful plots after an arbitrary CRS to CRS conversion, but what is officially possible is laid out in the current geopandas documentation, which would be a good source to check back on for future changes.

First, for conversion from the default WGS 84 to the azimuthal equidistant CRS, we create a cartopy CRS object.

In [14]:
ae = crs.AzimuthalEquidistant()
type(ae)
Out[14]:
cartopy.crs.AzimuthalEquidistant

Then we convert the cartopy object to a form usable by geopandas through an intermediate step. That step offers a glimpse of what lies at the core of many of the mapping tools, the PROJ project. (All these dependencies made installing geopandas more involved, as you recall from our early install sessions.)

In [15]:
aeproj4 = ae.proj4_init           # Convert to`proj4` string/dict usable in gpd
world_ae = world.to_crs(aeproj4)  # Then call to_crs method
world_ae.plot()
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x12e704490>

This represents the geopandas object world_ae, which is the world object whose geometry has been converted to the azimuthal equidistant CRS. From the above output, it we see that the azimuthal equidistant projection shows the central view in good proportions, with obvious distortions farther out although the distortions are evocative of the globe. However, this CRS is often difficult to use for showing data that is spread over the populous parts of the globe. (You can change the central view as shown in the next cell.) See, for example, how difficult it is to get the far east, the west, and Europe, together in any perspective, due to the vastness of the intermediate Pacific ocean.

In [16]:
crs.AzimuthalEquidistant(central_longitude=200, central_latitude=10)
Out[16]:
<cartopy.crs.AzimuthalEquidistant object at 0x12e772cc0>

It is therefore useful to get to know another projection from cartopy called AlbersEqualArea projection.

In [17]:
aea = crs.AlbersEqualArea()
aea
Out[17]:
<cartopy.crs.AlbersEqualArea object at 0x12e8464f0>

Finally, as an illustration of how to plot geopandas geometries into an axis generated by cartopy, we will convert or project the existing geometry objects in world_ae to AlbersEqualArea CRS as shown below.

In [18]:
aea_geo = [aea.project_geometry(ii, src_crs=ae) 
           for ii in world_ae['geometry'].values]

Since cartopy works directly with matplotlib, we can immediately render the resulting geometries in matplotlib's axis.

In [19]:
fig, ax = plt.subplots(subplot_kw={'projection': aea})
ax.add_geometries(aea_geo, crs=aea);

We can alternately produce essentially the same plot using geopandas as follows.

In [20]:
gpd.GeoDataFrame(world, geometry=aea_geo, crs=aea.proj4_init).plot();

Mapping COVID-19 cases on the globe

As an application of the above-discussed geospatial visualization techniques, we will now make a map of COVID-19 cases throughout the world using the AlbersEqualArea coordinate reference system.

In [21]:
import os
from git import Repo

First update your data folder by pulling the newest reports of COVID-19 from the GitHub repository maintained by Johns Hopkins' researchers.

In [22]:
covidfolder = '../../data_external/covid19'
if os.path.isdir(covidfolder):   # if repo exists, pull newest data 
    repo = Repo(covidfolder)
    repo.remotes.origin.pull()
else:                            # otherwise, clone from remote
    repo = Repo.clone_from('https://github.com/CSSEGISandData/COVID-19.git',
                           covidfolder)   
datadir = repo.working_dir + '/csse_covid_19_data/csse_covid_19_time_series'
f = datadir + '/time_series_covid19_confirmed_global.csv'
In [23]:
c = pd.read_csv(os.path.abspath(f))
c = c.rename(columns={'Country/Region': 'country'}).iloc[:, 1:]
c.head()
Out[23]:
country Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 ... 7/14/20 7/15/20 7/16/20 7/17/20 7/18/20 7/19/20 7/20/20 7/21/20 7/22/20 7/23/20
0 Afghanistan 33.93911 67.709953 0 0 0 0 0 0 0 ... 34740 34994 35070 35229 35301 35475 35526 35615 35727 35928
1 Albania 41.15330 20.168300 0 0 0 0 0 0 0 ... 3667 3752 3851 3906 4008 4090 4171 4290 4358 4466
2 Algeria 28.03390 1.659600 0 0 0 0 0 0 0 ... 20216 20770 21355 21948 22549 23084 23691 24278 24872 25484
3 Andorra 42.50630 1.521800 0 0 0 0 0 0 0 ... 861 862 877 880 880 880 884 884 889 889
4 Angola -11.20270 17.873900 0 0 0 0 0 0 0 ... 541 576 607 638 687 705 749 779 812 851

5 rows × 187 columns

Some countries are repeated (as their provinces are being counted in separate data entries), as can be seen from the following nonzero difference:

In [24]:
len(c['country']) - len(set(c['country']))
Out[24]:
78

Therefore, we sum up each country's confirmed COVID-19 counts for each day using a pandas groupby operation. Of course, it doesn't make sense to sum up the latitudes and longitudes, so we take their average values within countries. (Taking the mean of latitudes and longitudes will do for our current purposes, but it is definitely not a perfect solution, which is why I'll be asking you to improve on it by making a chloropleth map in the next assignment.)

In [25]:
cg = c.groupby('country')[c.columns[3:]].sum()
cg['Lat'] = c.groupby('country')['Lat'].mean()
cg['Long'] = c.groupby('country')['Long'].mean()

This newly created data frame has no GeoSeries. We use the latitude and longitude information to create point geometries. With this information, we can make a GeoDataFrame.

In [26]:
geo = gpd.points_from_xy(cg['Long'], cg['Lat'])
c_aea_geo = [aea.project_geometry(ii) for ii in geo]
cg = gpd.GeoDataFrame(cg, geometry=c_aea_geo, crs=aea.proj4_init)

Now, in cg, we have a GeoDataFrame object that should know how to plot its data columns in the data's associated places on the globe.

In [27]:
def covidworldmap(date):

    fig, ax = plt.subplots(figsize=(12, 10))
    # put the world map on an axis
    w = gpd.GeoDataFrame(world, geometry=aea_geo, crs=aea.proj4_init)
    w.plot(ax=ax, color='midnightblue', edgecolor='darkslategray')
    ax.set_facecolor('dimgray') 
    mx = cg.iloc[:, :-3].max().max()   # get max across data

    # set marker sizes, with a min marker size for cases > 1000
    msz = 500 * np.where(cg[date]-1000, np.maximum(cg[date]/mx, 0.001), 0)
    cg.plot(ax=ax, cmap='Wistia', markersize=msz, alpha=0.5)

    ax.set_xticks([])   # remove axis marks
    ax.set_yticks([]);
In [28]:
covidworldmap('5/5/20')

The world is now littered with COVID-19 cases. I wish the world and our country had fared better, but the data doesn't lie.