Country Map

Author

David Gerbing

Published

May 10, 2024, 01:43 pm

Setup

Background

Goal: Create a map of Italy with its cities of population size greater than 250,000 plotted on the map as a bubble plot. The size of each bubble reflects the city population.

Our map will consist of three different layers, each layer plotted separately by ggplot2: Italy, the cities, and the city names. The data for each distinct set of information is constructed separately.

Prerequisite: First read Section 8.1, Map the World. This section explains projections and spatial data, including the special features mapping data format referred to below. The section also illustrates the ggplot2 geom for plotting a world map from a special features data file. The material on a datum can be ignored.

This material replaces book Section 8.1.3, subsection Geonames geocodes, and Section 8.1.4.

As an option, to allow for wider output displays, increase the line width to 100 characters.

options(width=100)

Access Needed Packages

As always, the following contributed packages must first be installed with install.packages(), also accessed from the RStudio Tools menu.

library(lessR)
library(rnaturalearth)  # access map data for plotting
library(sf)
library(ggrepel)  # access the function for plotting city names

Another package must also be installed for the mapping functions to access, though it does not need to be loaded directly with library(). That package is the high resolution version of rnaturalearth, rnaturalearthhires. Unlike other packages we have installed, this package is not available on the R server network, called CRAN. Instead, invoke the repos parameter to specify another repository, as shown below. The needed sp package is not on the specified repos server, so first download from the usual CRAN servers.

install.packages("sp")
install.packages("rnaturalearthhires", 
                 repos="https://ropensci.r-universe.dev", type="source")

Geocodes

Access the Geocodes

To identify each city we need its location, its geocode. A free, public domain database, GeoNames, offers city geocodes that include cities across the world. One source of geocodes is from simplemaps.com.

Following is a URL where you can download a free list of about 47,000 the world’s most populous cities. Go to this URL and click the Download button for the Basic, and free, database.

https://simplemaps.com/data/world-cities

Here the data was downloaded to the data folder within the current folder from which this R session is active.

d <- Read("data/worldcitiesOrig.csv")
Data Types
------------------------------------------------------------
character: Non-numeric data values
integer: Numeric data values, integers only
double: Numeric data values with decimal digits
------------------------------------------------------------

      Variable                  Missing  Unique 
          Name     Type  Values  Values  Values   First and last values
------------------------------------------------------------------------------------------
 1        city character  47868       0   44383   Tokyo  Jakarta ... Hongseong  Charlotte Amalie
 2  city_ascii character  47867       1   44155   Tokyo  Jakarta ... Hongseong  Charlotte Amalie
 3         lat    double  47868       0   35697   35.6897  -6.175 ... 36.6009  18.342
 4         lng    double  47868       0   38684   139.6922  106.8275 ... 126.665  -64.9331
 5     country character  47868       0     242   Japan ... U.S. Virgin Islands
 6        iso2 character  47868       0     241   JP  ID  IN ... KR  KR  VI
 7        iso3 character  47868       0     241   JPN  IDN  IND ... KOR  KOR  VIR
 8  admin_name character  47671     197    4042   Tōkyō  Jakarta ... Chungnam  Virgin Islands
 9     capital character  13023   34845       3   primary  primary ... admin  primary
10  population    double  47656     212   31038   37732000  33756000 ... NA  NA
11          id   integer  47868       0   47868   1392685764 ... 1850037473
------------------------------------------------------------------------------------------

From the output of Read(), there are 11 variables and 47868 rows of data. View the variable names and the first several lines of the d data frame.

head(d)
       city city_ascii     lat      lng     country iso2 iso3  admin_name capital population
1     Tokyo      Tokyo 35.6897 139.6922       Japan   JP  JPN       Tōkyō primary   37732000
2   Jakarta    Jakarta -6.1750 106.8275   Indonesia   ID  IDN     Jakarta primary   33756000
3     Delhi      Delhi 28.6100  77.2300       India   IN  IND       Delhi   admin   32226000
4 Guangzhou  Guangzhou 23.1300 113.2600       China   CN  CHN   Guangdong   admin   26940000
5    Mumbai     Mumbai 19.0761  72.8775       India   IN  IND Mahārāshtra   admin   24973000
6    Manila     Manila 14.5958 120.9772 Philippines   PH  PHL      Manila primary   24922000
          id
1 1392685764
2 1360771077
3 1356872604
4 1156237133
5 1356226629
6 1608618140

As an option for clarity, change the abbreviated variable names lat and lng to their full names. Here, use the lessR function rename() to make the changes, which requires the name of the data frame that contains the variables, the vector of the variable names as they currently exist, followed by the vector of their new names.

d <- rename(d, c(lat, lng), c(latitude, longitude))
Change the following variable names for data frame d :

lat --> latitude 
lng --> longitude 

Subset the Geocodes Data

The geocodes file is large. Because the goal is to create a map of Italy with its largest cities plotted, for convenience we subset the data frame to just include rows that reference large Italian cities and the relevant variables.

With that information, subset the data frame d of city information to just the Italian cities with more than 250,000 inhabitants. Retain only the relevant variables in the subset. Accomplish this subsetting of rows and columns of data frame d with the standard R extract operator that subsets rows and columns of the data frame, separated by commas within square brackets.

Choose only records from the full data file of Italian cities with a population greater than 250,000. Save the data in a new data frame, named here cities10.

cols <- c("city", "latitude", "longitude", "country", "population")
rows <- d$country=="Italy" & d$population > 250000
cities10 <- d[rows, cols]

The contents of the entire subsetted data frame follow. We have the city names, their location, and their respective populations, all the information we need from the complete, downloaded file.

cities10
         city latitude longitude country population
288      Rome  41.8933   12.4828   Italy    2748109
567     Milan  45.4669    9.1900   Italy    1354196
841    Naples  40.8333   14.2500   Italy     913462
915     Turin  45.0792    7.6761   Italy     841600
1184  Palermo  38.1111   13.3517   Italy     630167
1327    Genoa  44.4111    8.9328   Italy     558745
1838  Bologna  44.4939   11.3428   Italy     387971
1945 Florence  43.7714   11.2542   Italy     360930
2209     Bari  41.1253   16.8667   Italy     316015
2314  Catania  37.5000   15.0903   Italy     298762
2630   Verona  45.4386   10.9928   Italy     255588
2684   Venice  45.4375   12.3358   Italy     250369

The new cities10 data frame is considerably smaller than the data file initially read. The subsetted data frame contains only 12 rows of data and 5 columns of data. We now have the location of each Italian city that has more than 250,000 residents as well as its population.

This cleaned-up file of geocodes is now ready for further processing.

Italy Map Data

Get the Data

We have the relevant city information stored in the cities10 data frame. To plot a map of Italy we also need the shape of Italy and its constituent states. The needed shapes are constructed from polygons, stored in a separate data frame with a special format called a simple features data frame or sf.

The following code lists all the countries in the rnaturalearth package for which mapping information is available.

world <- ne_countries(returnclass="sf")
world[,"name"][[1]]

However, in this example it is not necessary to run this code as we already know to identify Italy by its name.

Obtain these polygons that define the countries and states (or equivalent) from the rnaturalearth package with the ne_states() function. Identify each country in this data set with the value of the country parameter, which is the full country name, such as “United States of America”, “Kenya”, or “Italy”.

Extract and then save only the Italian map data into the simple features data frame. Because ne_states(), by default, returns the older style sp formatted data frame, specify the more modern simple features data frame with the parameter returnclass set to "sf". Here, save the shape information in the data frame italy_sf.

italy_sf <- ne_states(country="Italy", returnclass="sf")

The information contained within the sf type of data frame is more complex than a regular data frame. These data files contain much geometric information that defines the boundaries of each individual state, here within Italy. We could apply the head() function to the created simple features data frame italy_sf. However, this function is not applied here because there is much information that is not particularly meaningful to view.

Merge Map Data with City Population

Although we have the needed longitude and latitude that locates each city in the simple features italy_sf data frame, and the shapes of the individual states, we also need the population information for each city that we wish tknitr::knit_code\(get('setup')oknitr::knit_code\)get(‘setup’) plot. The population of each city is not included in the mapping data. Create a simple features object that merges the additional information to the italy_sf data frame with the st_as_sf() function from the sf package. This combination of information is possible because simple features data frames can contain other variables in addition to the mapping information.

We have the longitude and latitude of each city in the cities10 data frame, but to place the cities on a map we need those coordinates in simple features data frame. We need to retrieve the the geographical information from the italy_sf simple features data frame to add to the population data in the cities10 data frame. Use the st_as_sf() function from the sf package, which converts other objects such as regular data frames to an sf data frame. The crs parameter of st_as_sf() extracts the geographical information in the form of what is called a coordinate reference system or crs for the map of Italy. The crs parameter invokes the st_crs() function, which does the retrieval. Set the st_as_sf() parameter remove to FALSE to retain the city latitude and longitude as separate variables from the cities10 data frame in the resulting newly created sf data frame, cities10_sf.

cities10_sf <- st_as_sf(cities10, coords=c("longitude", "latitude"),
                       crs=st_crs(italy_sf), remove=FALSE)

Show the contents of the new simple features data frame that contains the location of each city and the accompanying city statistics, here population. This new data frame contains the same information as the original cities10 data frame, but now in simple features format. The new mapping information is presented as the values for the variable geometry, the coordinates for each city, plotted as a point.

cities10_sf
Simple feature collection with 12 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 7.6761 ymin: 37.5 xmax: 16.8667 ymax: 45.4669
Geodetic CRS:  WGS 84
First 10 features:
         city latitude longitude country population                geometry
288      Rome  41.8933   12.4828   Italy    2748109 POINT (12.4828 41.8933)
567     Milan  45.4669    9.1900   Italy    1354196    POINT (9.19 45.4669)
841    Naples  40.8333   14.2500   Italy     913462   POINT (14.25 40.8333)
915     Turin  45.0792    7.6761   Italy     841600  POINT (7.6761 45.0792)
1184  Palermo  38.1111   13.3517   Italy     630167 POINT (13.3517 38.1111)
1327    Genoa  44.4111    8.9328   Italy     558745  POINT (8.9328 44.4111)
1838  Bologna  44.4939   11.3428   Italy     387971 POINT (11.3428 44.4939)
1945 Florence  43.7714   11.2542   Italy     360930 POINT (11.2542 43.7714)
2209     Bari  41.1253   16.8667   Italy     316015 POINT (16.8667 41.1253)
2314  Catania  37.5000   15.0903   Italy     298762    POINT (15.0903 37.5)

Now we have the two simple features data frames that we need from which to create the map: The mapping information for Italy and for 10 of its cities.

Create the Map

Create this map layer-by-layer.

  1. map of Italy
  2. the ten cities, each plotted as a bubble reflecting population size
  3. text that labels the cities.

To create the map we need the country information, here for Italy, and the city information, here the 10 largest cities in Italy. As such, wneed access to two simple feature data frames that we created: italy_sf and cities_sf.

The name of a ggplot2 plotted geometric object is called a geom. Plot the first two layers with the simple features geom, geom_sf(), which transforms the sf polygon data into a visualization, the corresponding part of the map.

Layer 1: The simple features version of a data frame, italy_sf, contains the polygon information to plot the map of Italy. Reduce the size of the border lines with the size parameter set to a value less than 1, here set at 0.2.

Draw the map of Italy with colors. The ggplot2 functions (and also the lessR functions) apply the parameters fill and color to customize visualizations. The parameter fill specifies the interior color of a polygon, the color that fills the polygon. The parameter color specifies the exterior color of the object, the color as it appears to someone viewing the object from the outside. To view all named colors with the color illustrated, invoke the lessR function showColors(), which directs the output to a pdf file.

geom_sf(data=italy_sf, fill="azure", color="burlywood", size=.2)

Layer 2: The second application of geom_sf() plots the cities, with the corresponding population mapped to the size of the plotted point. Specify a mild transparency for each point by setting the alpha parameter to 0.7. The optional scale_size_area() function instructs the scaling of the dots based on area instead of radius which, I think, makes more interesting sizing of the points.

geom_sf(data=cities10_sf, mapping=aes(size=population), alpha=.7) +  
        scale_size_area() +

Layer 3: The geom_text_repel() geom plots the third layer, the city names, available in the cities_sf data frame. It uses the base R parameter name for color, col, instead of color. The plotted label is the variable city, a variable in the cities_sf simple features version of the data frame. Set the size of the labels with the size parameter.

geom_text_repel(data=cities10_sf, aes(longitude, latitude, label=city),  
                size=3.75, col="black")

Following is the complete ggplot2 instructions for visualizing each of the three layers with a single call to the ggplot() function. Because there is a different data frame accessed by each layer, the data parameter is specified for each layer instead of the ggplot() function directly.

ggplot() +
  geom_sf(data=italy_sf, fill="azure", color="burlywood", size=.2) +
  geom_sf(data=cities10_sf, mapping=aes(size=population), alpha=.7) +  
          scale_size_area() +  
  geom_text_repel(data=cities10_sf, aes(longitude, latitude, label=city),  
                  size=3.75, col="black") 

The text placement algorithm for geom_text_repel() in this ggplot2 function call is insensitive to the mapping of population to the size of plotted point.1 For the large bubbles, the label tends to be too close or overlap with the bubble. To further customize, use the nudge_y parameter to change the default vertical location of each of the ten plotted points. The nudge units are the same as those expressed on the corresponding axis, here the latitude of each city on the vertical axis.

1 There is a way to account for different point sizes but requires more programming. See https://ggrepel.slowkow.com/articles/examples.html, which also illustrates many more features of ggrepl functions.

ggplot() +
  geom_sf(data=italy_sf, fill="azure", color="burlywood", size=.2) +
  geom_sf(data=cities10_sf, mapping=aes(size=population), alpha=.7) +  
          scale_size_area() +  
  geom_text_repel(data=cities10_sf, aes(longitude, latitude, label=city),  
                  size=3.75, col="black",  
                  nudge_y=c(.5,.4,.5,-.6,-.4,.4,-.2,0,.5,.4))

Now we have the finished map of Italy in custom colors with its ten largest cities located and plotted according to population size.