Geospatial data refers to data which has a geographic component in it. Usually this means that the data entries are associated to some point on the surface of the earth. Therefore, geospatial data are usually visualized on maps.
Because the earth is round, in order to make a flat map, one must transform the earth's surface to a flat surface. Such transformations are called projections by cartographers (not to be confused with linear projections from linear algebra). Mathematicians know that a transformation between topologically different regions must be discontinous somewhere. So these projections, while very useful, cannot be perfect replicas of reality. It is useful to know this and a bit more about python modules for projections while attempting to visualize geospatial data on the globe.
Many references, including
Geographic Data with Basemap of [JV-H], use the python module basemap
. However in recent years, the module basemap has been deprecated in favor of the new python mapping project called Cartopy. Therefore, this activity aims at taking a quick look at cartopy. Cartopy, together with geopandas, a package built on top of pandas, shows promise in easing geospatial visualization. They are nonetheless relatively new efforts. You will notice that their documentation, while constantly improving, is not as mature as, say numpy. There will likely be a number of changes in the provided facilities as these efforts take hold. Our goal in this activity is to get acquainted with these new developments for visualizing geospatial data.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import geopandas as gpd
from cartopy import crs
The GeoDataFrame
class of geopandas
is a pandas
data frame with a special column representing geometry. This column is a GeoSeries
object, which may be viewed as a pandas
series where each entry is a set of shapes. The shapes are geometric objects like a a set of points, lines, a single polygon, or many polygons. These shapes are objects made using the shapely package. Together they allow easy interaction with matplotlib
for plotting geospatial data.
Here is a GeoDataFrame
object we have already used in 01 Overview:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
type(world)
The above-mentioned special column of the world
data frame is this.
type(world.geometry)
A GeoDataFrame
can have many columns of type GeoSeries
, but there can only be one active geometry column, which is what is returned by the attribute GeoDataFrame.geometry
. Note also that the name of this active GeoSeries
can be anything (not necessarily geometry
), but in the world
object, the column happens to be called geometry
, as can be seen below.
world.geometry.name
world.head()
The plot
method of the data frame is redefined in a GeoDataFrame
to use the geometry objects in the active geometry column. So to plot this map, all we have to do is use the plot
method:
world.plot()
A GeoDataFrame
has more attributes than a regular pandas
data frame. For example, it stores the centroids of the shapes in the active geometry column.
type(world.centroid)
This is a GeoSeries
that did not show up when we queried world.head()
, but it is an attribute of world
. We can, of course, make it an additional column of world
by in the usual pandas way.
world['centroids'] = world.centroid
world.head()
Now, world
has two GeoSeries
columns. If we make the centroids
column the active geometry column, then the output of the plot
method changes since it uses the active column's geometry specifications.
world = world.set_geometry('centroids') # change the active geometry column
world.plot();
An essential data structure of cartopy is CRS, or Coordinate Reference Systems, the name used by cartopy (and other python projects) for the projections used in maps. A CRS often used as default is the the Plate Carrée projection, which cartopy provides as follows.
crs.PlateCarree()
As you guessed from the above output, the CRS object is able to plot itself using matplotlib. This points to one avenue for visualizing geospatial data that has no need for geopandas. Cartopy produces a matplotlib axis on which you can overlay your data as you see fit: if your data has latitude and longitude associated to it, cartopy can apply the relevant projection automatically to place it at the proper place in the figure. Below, we will focus on alternative visualization methods using geopandas and facilities to interact between cartopy and geopandas.
The world
object of class GeoDataFrame
comes with a CRS attribute, another attribute that does not show up in world.head()
output.
world.crs
We see that the CRS above is codenamed EPSG:4326. Here EPSG stands for European Petroleum Survey Group. They maintain several data sets at https://spatialreference.org/ref/epsg/. Each set in “EPSG Geodetic Parameter Dataset is a collection of definitions of coordinate reference systems and coordinate transformations which may be global, regional, national or local in application.”
EPSG-number 4326 that we have here belongs to the WGS 84 coordinate system,
the latest in the
World Geodetic System (WGS)
where the earth is represented as an oblate spheroid with coordinates in
decimal degrees (Lat, Lon
).
Changing the active geometry from previously set centroids
to the original geometry
column, let us plot the world
again in order to compare the result with another CRS.
world = world.set_geometry('geometry') # set the active geometry
world.plot(); plt.title('World in WGS 84 CRS');
We compare this output with another very well-known projection, the Mercator projection, which has the nice property that it is conformal, i.e., it preserves angles. EPSG has made available a Web Mercator projection. We can easily
convert world
from the WGS 84 to the Mercator CRS:
world_Mercator = world.to_crs("EPSG:3395")
world_Mercator.plot();
plt.title('World in Mercator CRS');
Generations of school kids have grown up with this Mercator map. Note how the Mercator projection distorts the size of objects as the latitude increases from the equator to the poles. Some countries near equator look smaller than they really are. For example, Brazil is actually larger than the contiguous United States, but it looks smaller in the Mercator projection. If you have time for a digression, have a look into the many discussions online, e.g., The Problem With Our Maps, on how the false sizes on maps (perhaps inadvertently) shape our views on countries.
The Azimuthal Equidistant and Albers Equal Area coordinate reference systems show areas of the globe in better proportions than the Mercator projection, as their names suggest. These are implemented in cartopy. We want to leverage geopandas' ability to work with cartopy in the next step. We don't always get meaningful plots after an arbitrary CRS to CRS conversion, but what is officially possible is laid out in the current geopandas documentation, which would be a good source to check back on for future changes.
First, for conversion from the default WGS 84 to the azimuthal equidistant CRS, we create a cartopy CRS object.
ae = crs.AzimuthalEquidistant()
type(ae)
Then we convert the cartopy object to a form usable by geopandas through an intermediate step. That step offers a glimpse of what lies at the core of many of the mapping tools, the PROJ project. (All these dependencies made installing geopandas more involved, as you recall from our early install sessions.)
aeproj4 = ae.proj4_init # Convert to`proj4` string/dict usable in gpd
world_ae = world.to_crs(aeproj4) # Then call to_crs method
world_ae.plot()
This represents the geopandas object world_ae
, which is the world
object whose geometry has been converted to the azimuthal equidistant CRS. From the above output, it we see that the azimuthal equidistant projection shows the central view in good proportions, with obvious distortions farther out although the distortions are evocative of the globe. However, this CRS is often difficult to use for showing data that is spread over the populous parts of the globe. (You can change the central view as shown in the next cell.) See, for example, how difficult it is to get the far east, the west, and Europe, together in any perspective, due to the vastness of the intermediate Pacific ocean.
crs.AzimuthalEquidistant(central_longitude=200, central_latitude=10)
It is therefore useful to get to know another projection from cartopy
called AlbersEqualArea
projection.
aea = crs.AlbersEqualArea()
aea
Finally, as an illustration of how to plot geopandas geometries into an axis generated by cartopy, we will convert or project the existing geometry objects in world_ae
to AlbersEqualArea
CRS as shown below.
aea_geo = [aea.project_geometry(ii, src_crs=ae)
for ii in world_ae['geometry'].values]
Since cartopy
works directly with matplotlib
, we can immediately render the resulting geometries in matplotlib's axis.
fig, ax = plt.subplots(subplot_kw={'projection': aea})
ax.add_geometries(aea_geo, crs=aea);
We can alternately produce essentially the same plot using geopandas as follows.
gpd.GeoDataFrame(world, geometry=aea_geo, crs=aea.proj4_init).plot();
As an application of the above-discussed geospatial visualization techniques, we will now make a map of COVID-19 cases throughout the world using the AlbersEqualArea
coordinate reference system.
import os
from git import Repo
First update your data folder by pulling the newest reports of COVID-19 from the GitHub repository maintained by Johns Hopkins' researchers.
covidfolder = '../../data_external/covid19'
if os.path.isdir(covidfolder): # if repo exists, pull newest data
repo = Repo(covidfolder)
repo.remotes.origin.pull()
else: # otherwise, clone from remote
repo = Repo.clone_from('https://github.com/CSSEGISandData/COVID-19.git',
covidfolder)
datadir = repo.working_dir + '/csse_covid_19_data/csse_covid_19_time_series'
f = datadir + '/time_series_covid19_confirmed_global.csv'
c = pd.read_csv(os.path.abspath(f))
c = c.rename(columns={'Country/Region': 'country'}).iloc[:, 1:]
c.head()
Some countries are repeated (as their provinces are being counted in separate data entries), as can be seen from the following nonzero difference:
len(c['country']) - len(set(c['country']))
Therefore, we sum up each country's confirmed COVID-19 counts for each day using a pandas
groupby operation. Of course, it doesn't make sense to sum up the latitudes and longitudes, so we take their average values within countries. (Taking the mean of latitudes and longitudes will do for our current purposes, but it is definitely not a perfect solution, which is why I'll be asking you to improve on it by making a chloropleth map in the next assignment.)
cg = c.groupby('country')[c.columns[3:]].sum()
cg['Lat'] = c.groupby('country')['Lat'].mean()
cg['Long'] = c.groupby('country')['Long'].mean()
This newly created data frame has no GeoSeries
. We use the latitude and longitude information to create point geometries. With this information, we can make a GeoDataFrame
.
geo = gpd.points_from_xy(cg['Long'], cg['Lat'])
c_aea_geo = [aea.project_geometry(ii) for ii in geo]
cg = gpd.GeoDataFrame(cg, geometry=c_aea_geo, crs=aea.proj4_init)
Now, in cg
, we have a GeoDataFrame
object that should know how to plot its data columns in the data's associated places on the globe.
def covidworldmap(date):
fig, ax = plt.subplots(figsize=(12, 10))
# put the world map on an axis
w = gpd.GeoDataFrame(world, geometry=aea_geo, crs=aea.proj4_init)
w.plot(ax=ax, color='midnightblue', edgecolor='darkslategray')
ax.set_facecolor('dimgray')
mx = cg.iloc[:, :-3].max().max() # get max across data
# set marker sizes, with a min marker size for cases > 1000
msz = 500 * np.where(cg[date]-1000, np.maximum(cg[date]/mx, 0.001), 0)
cg.plot(ax=ax, cmap='Wistia', markersize=msz, alpha=0.5)
ax.set_xticks([]) # remove axis marks
ax.set_yticks([]);
covidworldmap('5/5/20')
The world is now littered with COVID-19 cases. I wish the world and our country had fared better, but the data doesn't lie.
Author: Jay Gopalakrishnan
License: ©2020. CC-BY-SA
$\ll$Table of Contents