Chapter 15 Making maps: displaying spatial data
ggplot’s functionality can be easily applied to create maps. Maps are largely polygons filled with certain colors (e.g. administrative units or lakes), and amended with various line segments (e.g. rivers), dots (e.g. cities) and text. ggplot possesses all these tools. This section explains how to use most important ones of these.
15.1 Simple maps with ggplot
ggplot gives you access to a number of basic maps, such as the world map or the map of France. But before we use any of these, it is very helpful to make one out of scratch. This will explain all the tricks that you need to know, such as polygons, groups, order, and id-s when you use the ready-made maps.
15.1.1 How maps are made: a hand carved example
The map is made of lines that connect vertices. The picture demonstrates the vertices, lines, and polygons on a 2-D plane and indicates the coordinate pairs for each vertex. For instance, in the lower-left corner there is a line that connects point (0,1) to (2,0). The two polygons (South Island and North Island) are filled with gray.
If we want to plot this data as a map we should just draw the black lines that connect the vertices. We have to ensure though that we get these details right:
- We must be careful to only connect the
lines that correspond to the coastline of each island, we should not
draw a line across the sea from one coastline to the other.
- We must connect the points in the correct order, not in a criss-cross manner.
- And finally, we have to ensure that the polygons are closed, i.e. the last vertex of each polygon is connected to its first one.
Next, let’s create this map. We start with creating a data frame of the vertices. This is a little tedious process as data frames expect x and y to be given separately while for us it is easier to think of those as pairs. So if you need to make a somewhat larger map, you may want to enter data in a spreadsheet instead. But here is the R code:
data.frame(x = c(0, 2, 5, 5, # south island
nz <-6, 10, 8, 7, 7, 6), # north island
y = c(1, 0, 3, 6, # south island
4, 8, 8, 12, 8, 7)) # north island
nz
## x y
## 1 0 1
## 2 2 0
## 3 5 3
## 4 5 6
## 5 6 4
## 6 10 8
## 7 8 8
## 8 7 12
## 9 7 8
## 10 6 7
Now the data is ready and we can plot it using polygons:
ggplot(nz,
aes(x, y)) +
geom_polygon(col="black", fill=NA)
geom_polygon()
connects the vertices with lines in a similar manner
as geom_path()
(see Section 13.9.2).
However, it also connects the last vertex to the first one, and allows
the polygons to be filled with color. Here
we ask the outline to be drawn in black, and the polygons not to be
filled (fill = NA
).
But the result has serious problems. In particular, it connects both
islands by drawing a line from the last vertex of South Island to the
first vertex of the North island.
This is not surprising as we did not tell ggplot in
any way which points form separate islands. Note that geom_polygon
connected the last and first vertex automatically. Also, importantly,
we entered
the vertices in the correct order, and hence we got a nice coastline plot.
Next, let’s fix the island connection problem. This requires adding an island-id, an id that tells which vertex belongs to which island. This id is typically called group and consists of a numeric id. However, we can give it also more descriptive values, e.g. “south” and “north” to denote the south and north island respectively:
$group <- c("south", "south", "south", "south",
nz"north", "north", "north", "north", "north", "north")
nz
## x y group
## 1 0 1 south
## 2 2 0 south
## 3 5 3 south
## 4 5 6 south
## 5 6 4 north
## 6 10 8 north
## 7 8 8 north
## 8 7 12 north
## 9 7 8 north
## 10 6 7 north
Thereafter we tell ggplot that
each group is a separate polygon by using group
aesthetic:
ggplot(nz,
aes(x, y, group=group)) +
geom_polygon(col="black", fill=NA)
Indeed, we got the two islands separated: now geom_polygon()
understands that vertices labeled “north” belong to one polygon, and
should not be connected to vertices labeled “south”.
Finally, in order to avoid distorted maps, we use
coord_quickmap
to ensure that the lengths correspond to the real map
lengths. We also fill the islands green:
ggplot(nz,
aes(x, y, group=group)) +
geom_polygon(col="black",
fill="seagreen3") +
coord_quickmap()
coord_quickmap()
is a handy tool for quick maps, but you can also
choose from many other map projections using coord_map()
. See the
relevant documentation.
The result is fairly similar to real maps, except that our manually designed coastline is far too simplistic. But now we have all the tools we need to plot real maps given we get the suitable data.
15.1.2 Displaying data on maps
It is great that we now can plot a green map of New Zealand. But this alone is not very interesting. There are plenty of better maps already available. We want to use this map to display spatial data.
Imagine we want to display the weather map, and the temperature on North Island is 17C and on South Island it is 14C. A simple option is just to add this column to our map data:
nz %>%
nzt <- mutate(temp = ifelse(group == "north", 17, 14))
nzt
## x y group temp
## 1 0 1 south 14
## 2 2 0 south 14
## 3 5 3 south 14
## 4 5 6 south 14
## 5 6 4 north 17
## 6 10 8 north 17
## 7 8 8 north 17
## 8 7 12 north 17
## 9 7 8 north 17
## 10 6 7 north 17
Here we just use ifelse()
(see Section 8.3.3)
to assign value “17” for every data line where group is “north” and
“14” to all other lines.
We can achieve this by just speficying that we want to map data variable temp to the fill color:
ggplot(nzt,
aes(x, y, group = group,
fill = temp)) +
geom_polygon() +
coord_quickmap()
But such approach–manually creating data column in the map data–is very inconvenient for anything resembling a real-world dataset. Temperature data is always loaded from another source and should be merged with map data. Let us now do this.
First, here is temperature data:
data.frame(
temperature <-island = c("north", "south"),
temp = c(17, 14)
) temperature
## island temp
## 1 north 17
## 2 south 14
How can we merge this to the map data? The result should be similar what we got when we did above: it should be a dataset of one vertex per row, and there should be an additional column for temperature, that displays the corresponding temperature data. Here we need to use the island as the merge key–the island name is what connects polygons in the map data with the temperature in the temperature data. do not have
left_join(nz, temperature, by = c("group" = "island"))
## x y group temp
## 1 0 1 south 14
## 2 2 0 south 14
## 3 5 3 south 14
## 4 5 6 south 14
## 5 6 4 north 17
## 6 10 8 north 17
## 7 8 8 north 17
## 8 7 12 north 17
## 9 7 8 north 17
## 10 6 7 north 17
Exercise 15.1 Do you want to use left_join()
, right_join()
, inner_join()
or
outer_join()
above? Explain!
Now we have automatically created a similar dataset as what we manually made above.
15.1.3 Bonus maps: map_data()
Fortunately we do not have to create maps manually. ggplot includes a few common maps, and there are straightforward tools to load standard map data, such as what you can dowload from the websites.
ggplot can directly access maps in the package maps using function
map_data()
. First you need to install that package, or you’ll see
an error message
Error in
map_data()
: ! The package “maps” is required formap_data()
(See Section 3.6 for how to install packages.) For instance, the world map can be accessed with
map_data("world")
world <-head(world)
## long lat group order region subregion
## 1 -69.89912 12.45200 1 1 Aruba <NA>
## 2 -69.89571 12.42300 1 2 Aruba <NA>
## 3 -69.94219 12.43853 1 3 Aruba <NA>
## 4 -70.00415 12.50049 1 4 Aruba <NA>
## 5 -70.06612 12.54697 1 5 Aruba <NA>
## 6 -70.05088 12.59707 1 6 Aruba <NA>
This dataset is very similar to what we constructed for New Zealand above. The variables long and lat denote longitude and latitude (we used x and y above), and group is also the same as what we used above to distinguish between islands. There are a few differences we need to discuss though.
- Most importantly, order denotes the order by which the points must be connected. In case of the NZ map above, we created the vertices in the correct order. The same is true here–the vertices are in correct order, but the order may be violated after certain operations, for instance after merging map data with some other kind of data. If this is the case, then we can use the order variable to restore the order (but remember to group by group—the order is valid within each group!).
- The map distinguished between group and region. group is a
technical concept that tells
geom_polygon()
which vertices to connect, region is country name. This matters in case if multiple polygons belong to the same country like in case of New Zealand that consists of multiple islands. Islands (group-s) must be drawn as separate polygons, but if you want to color the countries by their GDP, all New Zealand (region) should be of the same color. - Finally, the map also contains subregion, name of political subdivisions, such as the U.S. states (Aruba does not have any subregions).
We can plot the map as
ggplot(world, aes(long, lat, group=group)) +
geom_polygon(
fill = "seagreen", # fill color
col = "black", # border color
linewidth = 0.3) +
coord_quickmap()
Next, let’s put some kind of data on the map. Here we use country-concept similarity data. This dataset contains similarity measures between country names and a set of different words. It looks like
read_delim("data/country-concept-similarity.csv.bz2")
similarity <-%>%
similarity head(2)
## # A tibble: 2 × 12
## country terrorism nuclear trade battery regime volcano palm
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aruba 0.0891 -0.011 0.0504 -0.01 -0.0356 0.166 0.293
## 2 afghanis… 0.447 0.220 0.109 0.0578 0.180 0.129 0.116
## # ℹ 4 more variables: fir <dbl>, flood <dbl>, drought <dbl>,
## # mountain <dbl>
It contains a list of country names (column country), and a list of words (terrorism, nuclear, trade and so on). The numbers are measures of similarity between the country name and the corresponding word. It is out of scope of this book to discuss how the similarity is calculated, but to but it broadly–it means how often these words are used in a similar context. For instance, “Aruba” and “terrorism” are less similar (similarity 0.891) than e.g. “Aruba” and “nuclear” (similarity -0.011).
Before we can plot the similarity data on a map, we have to merge to map data with similarity data. But note that that the map data is much larger (99338 rows) than similarity data (252 rows). This is because the map data contains one data point for each vertex–each turn of the coastline and political boundary, while the similarity data only contains a single value for each country. Hence we want to merge the map data in a way that we keep all world data points, and assign the same similarity value for each map datapoint that describes the same country. Also note that the merge key–the information that connects the map and similarities–is the country name. We also need to ensure that the country names have the same case, here we need to convert the world map country names to lower case. We can achieve this with (See section 14.1 for more about merging data):
world %>%
similarityMap <- mutate(country = tolower(region)) %>%
left_join(similarity, by = "country")
The few lines of the result look like:
%>%
similarityMap head(3)
## long lat group order region subregion country
## 1 -69.89912 12.45200 1 1 Aruba <NA> aruba
## 2 -69.89571 12.42300 1 2 Aruba <NA> aruba
## 3 -69.94219 12.43853 1 3 Aruba <NA> aruba
## terrorism nuclear trade battery regime volcano palm fir
## 1 0.0891 -0.011 0.0504 -0.01 -0.0356 0.1659 0.2927 0.0965
## 2 0.0891 -0.011 0.0504 -0.01 -0.0356 0.1659 0.2927 0.0965
## 3 0.0891 -0.011 0.0504 -0.01 -0.0356 0.1659 0.2927 0.0965
## flood drought mountain
## 1 0.0158 0.0581 0.1073
## 2 0.0158 0.0581 0.1073
## 3 0.0158 0.0581 0.1073
You can see that the three first Aruba vertices all have similarity 0.0891 with “terrrorism”.
The result is just what it is supposed to be–a combination of the map data and word similarity data.
Finally, we can make a plot as we did above, but this time filling the polygons by a given word, e.g. by "trade:
ggplot(similarityMap,
aes(long, lat, group=group, fill=trade)) +
geom_polygon(col="black", size=0.3) +
coord_quickmap()
We can see that the country name “China” is most often used in a similar context as “trade”, while the countries in central Asia and in Africa are not.
Exercise 15.2 Make similar plots but this time use words “fir” and “palm”. Do you see how geographic location is associated with trees?
Exercise 15.3 TBD: Merge data but leave out re-ordering. What happens?
15.2 Shapefiles and GeoJSON: using the professional maps
The example maps, included in ggplot, are fairly good, but there is only a small number of those to choose from. It is not a substitute for professional maps.
Fortunately, it is easy to use popular map files. There are multiple ways to store map data. One of the most popular formats is ESRI shapefiles (or just shapefiles). This format is widely used to store geographic data by government agencies, architects, and municipalities. You can find shapefiles for national boundaries, land use zones, and protected water bodies.
Another popular format is GeoJSON. This is an open standard, developed by geojson.org.
Below, we discuss how to use such files, using an example of Ukrainian regions.
15.2.1 Your first professional map: regions in Ukraine
In this example, we use a map of Ukrainian regions. This is stored in GeoJSON format, but working with shapefiles is very similar. The most convenient library for accessing geospatial data is sf, it also includes a nice integration with ggplot.
First, let’s load the file and make the map:
library(sf)
read_sf(
map <-"data/ukraine-with-regions_1530.geojson"
)ggplot(map) +
geom_sf(linewidth = 0.5)
- first, you need to load the sf library.
- thereafter, you load the map data.
read_sf()
can load both GeoJSON and shapefiles. - finally, you can use
ggplot()
to plot it.geom_sf()
understands the spatial nature of the dataset and makes an appropriate plot. You can adjust it with ordinary aesthetics likelinewidth
,col
, andfill
.
15.2.2 What is spatial data frame?
The map data we stored in variable map is a special type of data frame–spatial data frame. It is in many ways similar to ordinary data frame, but it also includes spatial data–coordinates of vertices, lines and polygons, and coordinate reference system.
Let’s take a closer look:
head(map, 2)
##
## 1 function (.x, .f, ..., .progress = FALSE)
## 2 {
First, the dataframe prints some overall information about the map. It tells that the map is made of polygons,28 what is it bounding box (the locations of the extreme left, right, top and bottom points), and coordinate reference system (CRS). Here it is “WGS 84”, the popular longitude-latitude coordinate system that is based on Greenwich meridian.
The data frame contains
0 variables,
__.
The first four are ordinary columns id, name, density and
path. The fourth one, geometry, is a special column that contains
polygons. This is the column that holds the most important map data,
and the column that geom_sf()
uses to actually draw the map.
Exercise 15.4 Compare the spatial data frame map here with the manually created New Zealand map data frame in Section 15.1.1. What do you think, what is the most important difference? What other differences can you spot?
See the answer
15.2.3 Adding data to maps
Adding data to professional maps works mostly in the same way as in case of ggplot-maps (see Section (maps-simple-data)). We need another data frame that contains polygon id-s and the value of interest. As this is a data of oblasts–each polygon is an oblast, we want to merge this with some other dataset about oblasts. The dataset ukraine-oblasts-population contains the oblasts’ population as in 2015:
read_delim("data/ukraine-oblasts-population.csv")
population <-head(population, 3)
## # A tibble: 3 × 4
## Prefecture Population `Urban population` `Rural population`
## <chr> <dbl> <dbl> <dbl>
## 1 Donetsk Oblast 4387702 3973317 414385
## 2 Dnipropetrovsk… 3258705 2724872 533833
## 3 Kyiv 2900920 2900920 NA
This is an ordinary (non-spatial) data frame where Prefecture denotes the oblast. Here we are just interested in the column Population. Now we can merge population to the map data. Remember that the key, the oblast name, is called name in the latter:
map %>%
ukraine <- left_join(population, by = c("name" = "Prefecture"))
## Error in UseMethod("left_join"): no applicable method for 'left_join' applied to an object of class "function"
Exercise 15.5 Why is the example using left_join()
? What happens if some
population
information is missing? What happens if some oblasts are missing?
See the answer
Let’s visualize the result by printing two variables only–name and Population:
%>%
ukraine select(name, Population) %>%
head(3)
## Error in eval(expr, envir, enclos): object 'ukraine' not found
As you can see, now the column Population is included in the map data. Besides this, the map still includes CRS and the geometry column, even if we did not explicitly select the latter. This is because it is spatial data frame.
Finally, we can plot it:
ggplot(ukraine) +
geom_sf(aes(fill = Population/1000)) +
labs(fill = "Population\n(thousand)")
The population is displayed in thousands as the resulting numbers are easier to grasp.
Exercise 15.6 Why is the population of three oblasts (Kiyv, Odesa and Zaporizhzhia) missing?
15.2.4 Working with different spatial files
Another task we frequently want to do is to add some other kind of data to the maps. Here we demonstrate it in only a rudimentary fashion, you can look up the more detailed instructions yourself. Let’s add the Kyiv, the capital city of Ukraine on the map as a red dot. Kyiv is located at 50°27′00″N, 30°31′24″E (lat/lon), or in decimal degrees 50.45N, 30.5233E.
This point (and more points or lines) can be added to the plot as follows:- Create a spatial data frame that contains the points requested
- Specify the coordinate referece system (CRS) for that spatial data
- Add the new data frame to the plot using another
geom_sf()
.
First, we create just a data frame of the coordinate. Thereafter we
convert it to a spatial data frame using st_as_sf
. We can
immediately add CRS
the the result when doing the conversion:
data.frame(x = 30.5233, y = 50.45) %>%
kyiv <- st_as_sf(coords = c("x", "y"),
crs = "wgs84")
## Error in st_as_sf(., coords = c("x", "y"), crs = "wgs84"): could not find function "st_as_sf"
kyiv
## Error in eval(expr, envir, enclos): object 'kyiv' not found
Here we use wgs84 as the coordinate reference system. Coordinate systems is a complicated topic, well beyond the scope of this book. To make it brief–Earth is broadly spherical, but not exactly spherical. If a high precision is required then much better models are needed than just a spherical earth. WGS 84 (World Geodetic System 1984 standard) is one of the most popular such models. It is based on longitude and latitude, and as a rule of thumb–if coordinates are given as longitude and latitude, then WGS 84 is good crs. If unsure, you can get crs of the current map using
st_crs(map) # WGS 84
Now we are ready to plot the map with Kyiv marked as a red dot:
ggplot() +
geom_sf(data = ukraine,
aes(fill = Population/1000)) +
geom_sf(data = kyiv,
col = "orangered", size = 3) +
labs(fill = "Population\n(thousand)")
Here I have chosen to make it a way that data is only fed to the
corresponding geom_sf
layer, not to the ggplot
layer itself. This
is to make it more clear which layer uses which data.
Alternatively, map may only contain lines (e.g. river map) or points (e.g. map of electricity substations).↩︎