Chapter 13 Visualizations: the gglot2 Library

Data visualizations, including plotting, is one of the most powerful ways to communicate information and findings. R provides multiple visualization packages, in particular the base-R plotting tools (the graphics library) which is flexible and powerful. However, in this chapter we introduce ggplot2 library that is oriented to visualizing datasets. It has an intuitive and powerful interface and simplifies many tasks that are tedious to achieve with base-R graphics. But be aware that as other tools, so also ggplot2 has its limits, and sometimes it is better to use other visualization packages.

ggplot2 is called ggplot2, because once upon a time there was a package called ggplot. However, as the authors found its API somewhat limiting, they wanted to break compatibility and start from blank sheet. To distinguish the new package from the old one, they called it ggplot2.

Examples in this chapter adapted from R for Data Science by Garrett Grolemund and Hadley Wickham.

13.1 A Grammar of Graphics

Just as the grammar of language helps us construct meaningful sentences out of words, the Grammar of Graphics helps us to construct graphical figures out of different visual elements. This grammar gives us a way to talk about parts of a plot: all the circles, lines, arrows, and words that are combined into a diagram for visualizing data. Originally developed by Leland Wilkinson, the Grammar of Graphics was adapted by Hadley Wickham to describe the components of a plot. It includes

  • the data being plotted
  • the aesthetics–visual elements, such as positions, colors and line styles, that make up the plot. It also covers the aesthetics mapping which visual elements are related to which data variables.
  • the geometric objects (circles, lines, etc.) that appear on the plot
  • a scale that describes how the data values are represented as visual elements
  • a statistical transformation used to calculate the data values used in the plot
  • a position adjustment for locating each geometric object on the plot
  • a coordinate system used to organize the geometric objects
  • the facets, a set of sub-plots to display different subsets of data.

ggplot organizes these components into layers, where each layer has a single geometric object, statistical transformation, and position adjustment. Following this grammar, you can think of each plot as a set of layers of images, where each image’s appearance is based on some aspect of the data set. This is somewhat similar to dplyr that developes a “grammar” for data processing. ggplot’s approach is intuitive in a similar fashion.

ggplot2 library provides a set of functions that mirror the above grammar, so you can fairly easily specify what you want a plot to look like. Compared to dplyr, it is somewhat less intuitive though.

ggplot2 is a part of tidyverse set of packages, so if you installed and loaded tidyverse, then ggplot2 is ready to use. Otherwise, you need to install it (using install.packages("ggplot2") and load it as:

library("ggplot2")

13.2 Basic Plotting with ggplot2

Now it is time to take a quick look at simple plotting with ggplot. The first task is to understand the basics, we’ll discuss all the topics in more details below.

13.2.1 Diamonds data

ggplot2 library comes with a number of built-in data sets. One of the more interesting ones is diamonds (see Section I.5). It contains price, shape, color and other information for approximately 50,000 diamonds. A sample of it looks

diamonds %>%
   sample_n(4)
## # A tibble: 4 × 10
##   carat cut   color clarity depth table price     x     y     z
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.73 Ideal I     VS1      60.7    56  2397  5.85  5.81  3.54
## 2  0.7  Ideal G     VS1      60.8    56  3300  5.73  5.8   3.51
## 3  0.31 Ideal D     VS1      61.6    55   713  4.3   4.33  2.66
## 4  0.31 Ideal H     VVS1     62.2    56   707  4.34  4.37  2.71

It is included in the ggplot2 library, so you do not need to load it separately. Here we use variables

  • carat: mass of diamonds in caracts (ct), 1 ct = 0.2g
  • cut: cut describes the shape of diamond. There are five different cuts: Ideal is the best and Fair is the worst in these data. Better cuts make diamonds that are more brilliant.
  • price: in $

As the dataset is big, we take a small subset for plotting:

d1000 <- diamonds %>%
   sample_n(1000)

13.2.2 Our first ggplot

plot of chunk unnamed-chunk-4

A simple scatterplot of mass versus price.

Let’s explain the plotting function with a small example, a simple scatterplot of diamonds’ mass (carat) versus price:

ggplot(d1000,
       aes(x = carat, y = price)) +
   geom_point()

Here is the walk-through of the code above:

  • ggplot plotting starts with ggplot(). This sets the plot up, but does not actually much else useful. It typically takes two arguments–the first one is the dataset we are using (here d1000, the 1000-observation subset of diamonds); and the second one is aesthetics mapping (see more below in Section ??). Here we tell ggplot that we want to put data variable carat on the horizontal axis (x) and price on the vertical axis (y).

  • Aesthetic mappings are defined using the aes() function. The aes() function takes a number of arguments like x = carat. The first one, x is the visual property to map to, and the second one carat, is the column in the data to map from. In the example above, x = carat means to take the data variable carat, and map it to the visual property x: the horizontal position; in a similar fashion it maps price to y, the vertical position.

    We did not specify any other visual properties, such as color, point size or point shape, so by default the geom_point() produced a set of black dots of the same size, positioned according to the carat and price. (See Section 13.3 below for how to use more aesthetics.) This is what the plot displays: each diamond is a dot, positioned according to carat and price.

  • Next line, geom_point(), does the actual plotting. It is one of the many geom-s, named geom_ followed by the name of the kind of plot type you wish to create. Here, geom_point() will create a layer with “points”, usually called scatterplot (see Section 13.4.1).

    There are many other options, including geom_line() to connect points with lines, geom_col() to make columns (barplots) and many more, see Sections 13.4 and 13.9.

  • You can add other geom layers (see Section 13.4.2), or other elements, such as labels or color customization to the plot by using the addition (+) operator.

Thus, basic simple plots can be created just by specifying a data set, a geom, and a set of aesthetic mappings. Although the graph we did above does not look the best, it may be enough for many purposes.

Exercise 13.1 How are diamonds’ length and width related? Make a similar plot where you put diamonds’ length x on the horizontal axis and their width y on the vertical axis.

See the solution

As the example above shows, ggplot is very well suited to visualize data in data frames. But sometimes you want to plot vectors instead. The library includes qplot() function for such “quick plots”, but in the base-R plotting may be quicker and easier (see Section 11.6.5).

Next, we discuss the ggplot tools in more detail.

13.3 Aesthetics mapping

The aesthetic mapping is a central concept of every data visualization. It means setting up the correspondence between aesthetics, the visual properties of the plot, such as position, color, size, or shape of the points and lines; and certain properties of the data, typically numeric values of certain variables. Aesthetics are the plot properties that you want to drive with your data values, rather than fix in code for all markers. Each property can therefore encode an aspect of the data and be used to express underlying patterns.

13.3.1 Specifying aesthetics

In our first example above we only use the horizontal position x and vertical position y and mapped these to carat and price as aes(x=carat, y=price). We did not specify any other visual properties, such as color, point size or point shape, so by default the geom_point() produced a set of black dots of the same size, positioned according to the carat and price. This is exactly what aesthetics means here: we position the dots according to carat and price. (See Section 13.3.3 for how to specify visuals that are not linked to data.)

plot of chunk unnamed-chunk-5

Plot of the same diamonds as above, but this time cut variable is mapped to color aesthetic.

The power of aes() function is the simplicity to add more visual properties that are driven by data. For instance, let’s color the dots according to cut (diamond shape). This means to take an additional aesthetic, color, and to map it to the variable cut in data as color=cut. This must be done in aes() function as an additional named argument:

ggplot(d1000,
       aes(x = carat, y = price,
           col = cut)) +
   geom_point()

The resulting plot displays the same dots as the one in Section 13.2.2–it uses the same x = carat and y = price mapping. But now the dots are colored according to cut as we added col = cut mapping. ggplot also adds the color key, telling which color corresponds to which cut.

The aesthetics mapping can be specified in ggplot() function, as we did above. In that case it applies to all following geom-s, here just to the geom_point(). This is a handy approach when we want to add multiple geoms using the same data, e.g. both points and lines. But it can also be specified inside of the geom function, such as geom_point(aes(...)). In that case it only applies to the particular geom.

Exercise 13.2 Use aes() function twice: once inside ggplot() to specify x and y position, and once inside geom_point() to specify color. What happens?

See the solution

Finally, not that ggplot treats variables differently, depending on whether they are continuos, discrete, or ordered. This is a frequent source of confusion, see Section 13.5 below.

13.3.2 Most important aesthetics

There is a number of aesthetics that ggplot recognizes. The most important are:

  • x, y: the horizontal and vertical position. These are the default first and second argument for aes() function, so you do not normally need to specify these. So instead of aes(x = carat, y = price), we can also write aes(carat, price). This is not true for any other aesthetics.
  • col or color: dot and line color. In case of scatterplot or line plot (geom_point() and geom_line()), this is the color of the objects. In case of filled objects, such as bars on the barplot, it is the outline color, not the fill color!
  • fill: the fill color of area objects, such as bars on barplot or regions on a map. It has no effect on lines and points.
  • size: size of points
  • linewidth: width of line elements
  • linetype: type of lines–solid, dotted, dashed and similar.
  • alpha: transparency. alpha = 1 is completely oblique, and alpha = 0 is completely transparent (invisible).
  • group: determines how data is grouped. For instance, in a children growth data that contains multiple children, measured at multiple time points, you may want to draw a separate line for each children. See Section 13.4.2.

13.3.3 Fixed aesthetics

Sometimes we do not want to map the visual properties to data, but just to specify some kind of fixed values. For instance, we want to make a plot, similar to the one above in Section 13.2.2, but request the points to be blue.

plot of chunk unnamed-chunk-6

The same plot as above, but now we request “purple” color for the points.

Such request must be done inside of the corresponding geom, but outside of the aes() function. For instance,

ggplot(d1000,
       aes(x = carat, y = price)) +
   geom_point(col = "purple")

When used outside of aes(), there will not be any mapping to the data, and the color is just what it is: a color.

plot of chunk unnamed-chunk-7

What happens if you map color to variable “purple”

But what happens if you specify a fixed color inside of the aes() function? This is a frequent source of confusing errors for beginners. This makes ggplot to think that we are mapping color to a data vector c("purple"). So it thinks that “purple” is a data value, not a color, and picks whatever color it considers appropriate:

ggplot(d1000,
       aes(x = carat, y = price,
           col = "purple")) +
   geom_point()

Here the color turns out to be orange. ggplot is even helpful and adds a color key that tells you that the orange color corresponds to value “purple”!

Exercise 13.3 Amend the colored plot:

  • Use a different color
  • Make the dots larger
  • Make them semi-transparent.

See the solution

13.4 Most important plot types

There is a large number of possible plot types. But here we introduce some of the most widely used ones–scatterplot, line plot, barplot, histogram and boxplot. Picking one of these is enough in many circumstances.

13.4.1 Scatterplot

On of the most widely used plot type is scatterplots, plots of point clouds.

plot of chunk unnamed-chunk-8

An example scatterplot of random data.

Scatterplots are good to visualize continuous data–it is best if the variable you put both on the horizontal and vertical axis are continuous. This includes values like income and age, usage percentage and reliability, and GDP and child mortality.

The objects you put on a scatterplot plot should be distinct–they should not transition from one to another. So different humans, hard disks or countries are good examples–one human does not transform into another, and neither do hard disks. Even for countries, such transitions are very rare.

plot of chunk ggplot-scatterplot-iris

Petal length versus width for iris flowers.

Here is an example scatterplot of real data–of the R built-in iris dataset:

## use iris dataset
ggplot(iris,
       ## plot petal length vs width
       aes(Petal.Length, Petal.Width)) +
   ## make scatterplot
   geom_point() +
   ## add axis labels
   labs(x = "Petal length (cm)",
        y = "Sepal length (cm)")
The plot depicts the relationship between length and width of petals of iris flowers. Scatterplot is a good choice here because each unit (observation) in data depics a separate flower. Obviously, flowers do not transition to each other, so it would be misleading (and very ugly) to connect the dots. We prefer a scatterplot.
plot of chunk ggplot-scatterplot-iris-species

Petal length versus width for iris flowers, this time marking different species with different colors.

However, there are three different species included in the dataset. If we want to convey the difference of their petal size, we can use another aesthetic, for instance color, to represent species:

ggplot(iris,
       ## plot petal length vs width,
       aes(Petal.Length, Petal.Width,
           ## mark species with color
           col = Species)) +
   geom_point() +
   labs(x = "Petal length (cm)",
        y = "Sepal length (cm)") +
   theme(text = element_text(size=15))

Here col = Species tells ggplot that and additional aesthetic, color, should be mapped according to the data variable Species.
Or to put it simpler–use dots of different color for different species. Note how ggplot automatically makes a legend, explanation of which color denotes which species.

Using more aesthetics allows to put more information on the plot. Besides of the relationship between length and width, we can now also see that setosa flowers tend to be small and virginica flowers large.

13.4.2 Line plot

TBD: add an example with multiple geoms (line + point) here.

Line plot is another very popular way of presenting information. It is similar to scatterplot in a sense that it is well-suited for plotting continuous data. However, connecting points with lines is useful mainly if there is a clear transition from one observation to the next one. This is commonly the case with time series data–time flows continuously, and usually the features we measure at different point of time are also continuously changing.

We demonstrate line plot using Scandinavian COVID-19 data:

covS <- read_delim("data/covid-scandinavia.csv.bz2") %>%
   select(country, date, type, count) %>%
   filter(date > "2020-03-01",
          date < "2020-07-01")
                           # select a 4-month date range only
covS %>%
   sample_n(5)
## # A tibble: 5 × 4
##   country date       type      count
##   <chr>   <date>     <chr>     <dbl>
## 1 Sweden  2020-06-09 Confirmed 46299
## 2 Norway  2020-03-04 Confirmed    56
## 3 Sweden  2020-06-30 Confirmed 67924
## 4 Finland 2020-04-28 Confirmed  4740
## 5 Denmark 2020-05-28 Deaths      568

The dataset includes the cumulative number of deaths and confirmed cases in four Scandinavian countries, Norway, Sweden Denmark and Finland. Note the structure of the dataset: an observation is country-date-type combination. For each country and each date, there are types of counts: Deaths and Confirmed. Below, we filter deaths only.

plot of chunk ggplot-lineplot-color

Line plot of total COVID-19 deaths. Different countries are depicted by different color.

As the data contains four different countries, it is a natural way to distinguish between countries using lines of different color: we pick the color aesthetic and map it to variable country: col = country:

covS %>%
   filter(type=="Deaths") %>%
                           # look at deaths only
   ggplot(aes(date, count,
                           # date vs death count
              color=country)) +
                           # distinguish countries by color
   geom_line() +
   theme(text = element_text(size=15))
                           # make text larger

The data shows that there was a rapid growth in COVID-19–related deaths in spring 2020. We can also see that there were many more deaths in Sweden than elsewhere.

Why is line plot a good choice here? Because the counts are based on dates, and time flows continuously from one day to another. One can imagine replacing the lines by dots (scatterplot), and sometimes it is useful. But here lines stress that observations–the dots–are actually connected. As time flows, yesterday turns into today, and yesterday’s counts turn into today’s counts.

plot of chunk ggplot-lineplot-grouped

Denoting different countries with similar lines.

Sometimes we may not want to use different colors or linestyles to denote different countries (or other groups of observations). In that case one can use the group aesthetic–it simply tells which observations should be grouped together. Visual representation, however, is unaffected:

covS %>%
   filter(type=="Deaths") %>%
   ggplot(aes(date, count,
              group=country)) +
                           # denote different countries by different lines
                           # of same color and type
   geom_line(col="gray") +
   theme(text = element_text(size=15))

The plot is less attractive, and, in particular, we cannot tell which line represents which country. But this may be sometimes desirable, for instance, if there are too many groups to color them individually. We may want to plot everything with the same gray color, and add a selected few with marked colors on top of it.

plot of chunk ggplot-lineplot-mixed

Unsuccessful attempt to do ungrouped line plot.

However, if you leave out group or color attribute alltogether, then the result may be hard to interpret:

covS %>%
   filter(type=="Deaths") %>%
   ggplot(aes(date, count)) +
   geom_line() +
   theme(text = element_text(size=15))

What happens here is that ggplot orders the observations along the date-axis, and then uses line to connect previous count to the next count. However, for every day we have four different counts–one for each country. So it ends up connecting all countries vertically for each date, and so we get an interesting shape made of densely packed lines here.

13.4.3 Barplot

Barplots are suitable to display data where one variable is categorical and the other one is numerical. We demonstrate the barplot using the average size of orange trees (see Section I.10):

data(Orange, package = "datasets")
avg <- Orange %>%
   group_by(Tree) %>%
   summarize(size = mean(circumference))
avg  # average size of 5 orange trees
## # A tibble: 5 × 2
##   Tree   size
##   <ord> <dbl>
## 1 3      94  
## 2 1      99.6
## 3 5     111. 
## 4 2     135. 
## 5 4     139.
plot of chunk ggplot-types-gray-barplot

Barplot using default options.

For instance, we can plot

ggplot(avg, aes(Tree, size)) +
   geom_col()

Barplots can be created with geom_col() (there is also geom_bar() but that creates histograms by default!) The default options of geom_col() create a rather dull figure of gray bars, but it conveys all the necessary information.

plot of chunk ggplot-types-purple-barplot

Adjusting colors of barplot. Remember that fill is the fill color and col is the border color. size here is the width of the outline.

The gray color may be exactly what you want if you intend to print it on b/w printer. But if you want to show it on a color-aware device, you may want to specify colors:

ggplot(avg, aes(Tree, size)) +
   geom_col(fill="mediumpurple4",
            col="gold1", size=2)

Why is barplot a good plot type for such tasks? This is because the horizontal position of bars is rather arbitrary (often based on alphabetic ordering, here based on the average size of trees). Bars are just next to each other, they are typically also of equal width, and the fact that tree “3” is after tree “2” does not typically mean these trees are “close” in any meaningful sense. The discrete bars stress that there is no natural smooth connection between trees, they are separate discrete .

Exercise 13.4 Color each bar of different color by making the fill aesthetic to depend on the tree id. Do you like the result?

See the solution

13.4.4 Histogram

Histograms is to visualize distributions–what kind of values are more common or less common. In case of continuous data, the values are split into a limited number of “bins”, and then the computer counts the number of values that fall into each bin.

plot of chunk unnamed-chunk-17

Histogram of diamonds’ price. Cheap diamonds are most common.

Let’s visualize the distribution of diamonds’ price using the same color scheme as for the barplot (Section 13.4.3).

ggplot(diamonds, aes(price)) +
   geom_histogram(
      bins = 20,
                           # split into 20 bins
      fill = "mediumpurple4",
      col = "gold1"
   )

The histogram shows that most diamonds are relatively cheap–the largest count is in the second-smallest price bin, less than $1000. But there are more expensive diamonds, almost up to $20,000.

Exercise 13.5 Use histograms to show two distributions:

  • Age of titanic passengers
  • Fare paid by the passengers

Experiment with the bins= option to make the histograms look good.

What do you think, why do these distributions look so different?

See the solution

13.4.5 Boxplot

Boxplots are simplified histograms, typically used to display distribution differences for different groups. They are often appropriate where it is otherwise hard to show the relationship between different values.

plot of chunk unnamed-chunk-18

Scatterplot of cut versus price. The result is not readable.

For instance, let’s try to understand how are cut and price of diamonds related. We can attempt to do it using a scatterplot:

ggplot(diamonds,
       aes(cut, price)) +
   geom_point(
      position = position_jitter(
         width=0.3
      ),
      alpha=0.3
   )

We move the points randomly left and right from the discrete cut values to avoid overplotting (this is what position = position_jitter() does) and make the points semi-transparent (alpha = 0.3). But the results are still incomprehensible.

## Error in vapply(ggplot2::get_element_tree(), function(x) {: values must be length 1,
##  but FUN(X[[61]]) result is length 2

A better way to display the dependency is to use boxplot. The image here displays the same data (just random values) in two ways–as scattered points at left, and as a boxplot at right. Boxplot consists of a box that normally stretches from the lowest to the highest quartile of the distribution, and the median is prominently displayed by a bold line. The box has “whiskers”, lines that stretch up and down from the box by no more than 1.5 times of the height of the box, till the last data point within this range. Finally, all cases that do not fit inside whiskers are marked by individual dots. See more in Wikipedia.

plot of chunk unnamed-chunk-20

Boxplot is well suited to reveal how the price depends on diamonds’ cut.

Here is the price distribution by cut as a boxplot:

ggplot(diamonds,
       aes(cut, price)) +
   geom_boxplot()

It reveals that if anything, we have an inverse relationship–better cut diamonds are cheaper.

Exercise 13.6 The message from the previous plot is that better cut diamonds are cheaper. This seems counter-intuitive and needs some explanation. Perhaps the ideal-cut diamonds are just smaller? Let’s check this out!

  • Select diamonds in a narrow price range only, e.g. in \([0.45, 0.5]\)ct and in \([0.95, 1.0]\)ct.
  • Do a similar boxplot, separately for both of these ranges. Do you see now that more desirable cut commands higher price?

See the solution

13.4.6 When to use which plot type

We finish this section by a sort re-cap of the most important plot types. There are many more types, some of which are described in Section 13.9.

plot of chunk unnamed-chunk-21

Scatterplot (geom_point()) is suitable to display relationship between two continuous (or nearly continuous) variables. The data points should not be connected to some sort of smooth transitions. For instance, different diamonds do not transform to each other in a smooth way.

Alternatives: line plot.

## Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "c('mts', 'ts')"

Line plot (geom_line()) is suitable to show relationship between two continuous variables, like scatterplot. However, there should exist a smooth transition between data points. For instance, data points that correspond to different age can be connected, to indicate that the points are in fact connected–we just do not have data for the intermediate age values.

It may be useful to depict different subjects with different lines, and mark the actual data points on the lines.

Alternatives: scatterplot, barplot

## Error in UseMethod("group_by"): no applicable method for 'group_by' applied to an object of class "c('mts', 'ts')"

Barplot (geom_col()) is suitable to describe the relationship between a categorical and a continuous variable. The bars indicate that the data on \(x\)-axis is not continuous.

Alternatives: scatterplot, line plot

plot of chunk unnamed-chunk-24

Histogram (geom_histogram()) is good to display distributions, most likely for a single continuous variable. It is usually hard to put several histograms on the same plot.

Alternatives: density plot, boxplot, violinplot.

plot of chunk unnamed-chunk-25

Boxplot (geom_boxplot()) is good to display distributions of a continuous variable by different categories of a categorical variable.

Alternatives: violinplot, histogram, density plot.

13.5 Discrete versus continuous variables

ggplot tries hard to guess what is what the user wants, and then make the plots accordingly. One important distinction is between continuous and discrete values (see Section 17.2 for how R describes discrete values).

Continuous variables are measured numerically and take any value, for instance age, distance, temperature or price. Discrete variables, however, can only contain a small set of pre-determined values. Examples include college majors (math, English, philosophy, …), college seniority (freshman, sophomore, junior, …), or the city name where you grew up (Seattle, Chongqing, Bangkok, …). It turns out that on the plots, discrete and continuous values are best to be represented somewhat differently.

TBD: continuous versus discrete aesthetics

TBD: factor for discrete colors

13.6 Inheritance: aesthetics and data

Orange tree data

ggplot(data, aes(age, circumference,
                 col = soiltype))
  • aes(x = age, y = circumference, col = Tree) inherited by all geoms
  • You can override inherited aesthetics as
... +
   geom_point(aes(col = soiltype))

or with fixed aesthetics as

... +
   geom_point(col = "mediumpurple")
orange <- read_delim(
   "data/orange-trees.csv"
)
ggplot(orange,
       aes(age, circumference,
           col = Tree)) +
   geom_point(col = "gray30",
              size = 3) +
   geom_line(linewidth = 1)

plot of chunk unnamed-chunk-29

`Tree'' not categorical, hencecol = Tree` does not group data!

Make “Tree” categorical with factor():

ggplot(orange,
       aes(age, circumference,
           col = factor(Tree))) +
   geom_point(col = "gray30",
              size = 3) +
   geom_line(linewidth = 1)

plot of chunk unnamed-chunk-30 Now it groups data!

What about colored dots and gray lines? Problem: lines not grouped any more

ggplot(orange,
       aes(age, circumference,
           col = factor(Tree))) +
   geom_point(size = 3) +
   geom_line(col = "gray30",
             linewidth = 1)

plot of chunk unnamed-chunk-31

  • Problem: col = "gray30" overridescol = Tree`
  • Hence data not grouped!
ggplot(orange,
       aes(age, circumference,
           col = factor(Tree))) +
   geom_point(size = 3) +
   geom_line(aes(group = Tree),
             col = "gray30",
             linewidth = 1)

plot of chunk unnamed-chunk-32

TBD: aesthetics and data inheritance

Exercise 13.7 Use Ice Extent data. Make a line and point plot of the extent over years for February month (month = 2). Include both northern (region = “N”) and southern (region = “S”) hemisphere, use different colors for the hemispheres.

  • Make a plot with both lines and points of different color
  • Make a plot with only points of different color, but lines dark gray.

Now make a plot of ice extent in the Northern hemisphere only, but including three months: February, May, and September.

  • Make a plot with both lines and points of different color
  • Make a plot with only points of different color

Hint: use %in% operator to select from multiple monts (see Section 12.6.1.2).

See the solution

13.7 Tuning plots: scales and colors

ggplot2 can make a large variety of plots. It will pick appropriate colors and supply meaningful labels, so you immediately have something that looks reasonable to be presented right away.

But even if reasonable, the resulting plot may not be good enough. Sometimes you are happy with the fonts and colors and want just to adjust the labels, but other times the colors are completely misleading and the plot looks like an incomprehensible mish-mash. There is no way around tuning the plots.

13.7.1 Scales: linking aesthetics and data

Before we get into adjustinc colors, we need to talk a bit about scales. In Section 13.3 above, we discussed aesthetics mapping. The mapping describes which visual properties, such as position and color, are derived from which data variables. For instance, aes(fill = Tree) tells that the bars should be filled by a color that depends on variable Tree.

But which tree is painted with which color? Is tree number “1” going to be red and number “2” blue? Or the way around? Or something else completely? This is the place where scales come to play. Scales is a way to specify such kind of connection. There are multiple types of scales, some are relevant for discrete variables like the tree number, others for continuous variables like temperature, and third one for completely different tasks like setting logarithmic scale for coordinate axis (see Section 13.8.6).

Discrete scales let you to specify the exact color for each different tree (see Section 13.7.2), or line style for each political party. These are suitable for displaying a small number of distinct categories as different colors, fill patterns or point shapes.

Continuous scales specify gradients of colors or other continuous properties, such as transparency or point size. They are suitable for displaying continuous outcomes, such as temperature or income. For instance, you can tell ggplot to display temperature as colors with dark blue being the coldest and bright yellow the hottest (see Section 13.7.3).

Finally, there are other scales, that specify coordinate types, such as log scale (see Section 13.8.6).

What are aesthetics and what are scales?

Aesthetics Scales
Aesthetics mapping: which variable determines which visual property Scales: how exactly is the property related to the values
Example:
Color should depend on variable Tree Tree #1 should be red, Tree #2 blue, …
use aes(col = Tree) use scale_color_manual() (see below)

Both aesthetic mapping and scales threat continuous and discrete variables differently.

13.7.2 Adjusting discrete colors

Colors is one of the most common things we want to adjust on plots. We discussed above how you can specify the element color manually as geom_point(col = "black") (see Section 13.3.3). But this is often not enough, as we want not to specify a single color but the dependency– how are colors related to values.

How to use scales very much depends on the type of variables–continuous or categorical. This is similar to ho ggplot treats colors, line styles and many other plot parameters. We discuss discrete colors first, adjusting continuous colors is explained below in Section 13.7.3.

Consider a simple task: you are political analyst in India and you want to make a plot of election results–the number of seats in Lok Sabha (the lower house) won by the three largest parties, BJP (Bharatiya Janata Party), INC (Indian National Congress) and AITC (All India Trinamool Congress). You have a data frame that looks like:

df <- data.frame(party = c("BJP", "INC", "AITC"),
                   seats = c(303, 52, 23))
df
##   party seats
## 1   BJP   303
## 2   INC    52
## 3  AITC    23
plot of chunk ggplot-tuning-loksabha-varplot

Seats by political parties using default colors.

It is easy to visualize the results with colored bars:

ggplot(df,
       aes(party, seats, fill=party)) +
   geom_col()

But now you have a problem. As in many other countries, the Indian political parties are traditionally represented with colors, but just not with these colors. BJP is usually saffron (orange), INC is sky blue, and AITC is light green. While you got INC blue, the colors of AITC and BJP are swapped around. This is just misleading.

So we need to tell ggplot that the default colors are not good, and it should pick different color values: value BJP should be “saffron”, value INC “sky blue”, and value AITC should be “light green”.

plot of chunk ggplot-tuning-loksabha-barplot-scale

Seats by political parties represented by custom colors.

Fortunately, it is easy to achieve. We need to add a color scale, that tells which party name should correspond to which color. This can be achieved by scale_fill_manual(values = c(BJP="orange2", ...)):

ggplot(df,
       aes(party, seats, fill=party)) +
   geom_col() +
   scale_fill_manual(
      values = c(BJP="orange2",
                 INC="skyblue3",
                 AITC="springgreen3")
   )

This results in the desired colors for each political party.

Note the syntax of setting colors:

scale_fill_manual(
   values = c(BJP="orange2",
              INC="skyblue3",
              AITC="springgreen3")
)

scale_fill_manual takes argument values, a named vector where names correspond to the discrete values of the variable (here party) and the vector components are the corresponding color values. Obviously, one can also use different color codes, such as c(BJP="#FF9933", INC="#19AAED", AITC="#20C646") for somewhat more customary colors for these parties.

Exercise 13.8 What happens if you use scale_fill_manual() but do not specify the color for one of the discrete value? Do you get an error, a default color, or something else? Try it with the political party plot!

See the solution.

But now we need to talk a few more words about scale_fill_manual(). What exactly does it do and when should you use it?

  • manual in scale_fill_manual() means that we pick colors manually–we are manually providing colors for each value in data. This is a good choice when there are only a few values, and when the pre-defined color palettes do not contain a suitable set of colors. This is the case here–we have only three values, and there is not dedicated palette for Indian politics.
  • fill in scale_fill_manual means you manually specify individual colors for the fill aesthetic. If you use col = party instead of fill = party, then you need to use its sibling function, scale_color_manual() instead.

scale_color_manual is a discrete scale. This means that you can only specify colors for discrete values. If the data variable is not discrete, e.g. you want to specify colors for different years, but year is a continuous number, then scale_fill_manual() will not work. You get an error

gdp <- data.frame(GDP=c(1000, 1050), year=c(2023, 2024))
ggplot(gdp,
       aes(year, GDP, fill=year)) +
   geom_col() +
   scale_fill_manual(
      values = c("2023"="orangered2", "2024" = "steelblue3")
   )
## Error in `scale_fill_manual()`:
## ! Continuous values supplied to discrete scale.
## ℹ Example values: 2023 and 2024

The error tells you exactly what it is–a continuous value (here year) is supplied to a discrete scale (here scale_fill_manual()). This is the same problem we encountered in Section 13.5.

plot of chunk unnamed-chunk-38

Forcing year to factor for the fill aesthetic allows to use discrete scale. Note that we haven’t forced it for x aesthetic, and hence we have fractions on the x-axis.

The solution is also the same: the continuous variable should be forced to categorical by wrapping it into factor():

ggplot(gdp,
       aes(year, GDP, fill=factor(year))) +
   geom_col() +
   scale_fill_manual(
      values = c("2023"="orangered2", "2024" = "steelblue3")
   )

(See more in Section 17.2.)

Exercise 13.9 Why is scale_color_manual() a discrete scale? Could you envision a function where you can manually specify colors for a continuous variable?

See the solution

Exercise 13.10 The GDP example above uses fill aesthetic and fill scale to specify the fill colors. But what happens if you use another scale, e.g. scale_color_manual() instead? Does it change the outline colors? Do you get an error?

See the solution

13.7.3 Adjusting continuous colors

Specifying individual colors manually is a good choice when there is only a small number of discrete data values. But often we have data where the count of possible values is essentially unlimited. This includes many physical measurements, such as height, weight, temperature, elevation and light intensity. Also many economic measures, in particular those that involve money belong here–income, wealth, price and GDP, but also inflation and unemployment are such values. In such a case there is no way that we can specify the colors manually. We need a continuous scale for continuous variables.

Below, we use Icecream dataset from Ecdat package (see Section I.7). This is a small dataset of ice cream consumption in the U.S. in the early 1950s, a sample of data looks like:

library(Ecdat)
Icecream %>%
   sample_n(3)
##     cons income price temp
## 20 0.342     86 0.277   60
## 24 0.326     92 0.285   27
## 23 0.284     94 0.277   32

here we use cons (ice cream consumption per person in pints), price (USD per pint), and temperature (in °F).

plot of chunk ggplot-colors-ice-cream-temp

Now let’s make a simple plot about how consumption depends on price and temperature. We put price on x-axis, consumption on y-axis, and color the data points by temperature:

ggplot(Icecream,
       aes(price, cons, col=temp)) +
   geom_point(size=5)

The picture suggests that there is little relationship between price and consumption–the dots are arranged fairly randomly.24 However, the relationship between weather and consumption is strong–you can see the light blue dots, denoting warmer weather, tend to be associated with more consumption.

In terms of colors, ggplot will automatically pick a scale to represent the various temperature values. The scale ranges from dark blue (low values) to light blue (high values). This is a continuous color scale, a color gradient, and it can represent unlimited number of colors, corresponding to unlimited number of potential temperature values.

plot of chunk ggplot-colors-ice-cream-temp-gradient

But we may want to show the temperature not just as shades of blue but with red colors to represent hot weather and blue colors to represent cold weather. In order to achieve this, we need to provide another color gradient where we supply our own custom colors for low and high temperature values. This can be done with scale_color_gradient(low="blue", high="red"). This will make a similar color gradient from blue to red, representing temperature from their lowest value to the the highest value in data:

ggplot(Icecream,
       aes(price, cons, col=temp)) +
   geom_point(size=5) +
   scale_color_gradient(
      low="steelblue2",
      high="orangered"
   )

The message from the image is similar but the choice of colors is a more conventional one when representing temperatures. In a similar fashion as with the discrete scales, your should replace scale _color_gradient() with scale_fill_gradient() if you use fill aesthetic instead of color aesthetic.

plot of chunk ggplot-colors-ice-cream-temp-gradient2

There are more ways to create gradients. For instance, if you want the blues not to turn into reds directly, but first into white, and thereafter into red, then you can use use scale_color_gradient2(). This scale takes three color values: low, mid and high, it also requires the midpoint value midpoint–what is the middle temperature value that should be represented as the middle color:

## pick average temp for midpoint
midpoint <- mean(Icecream$temp)
ggplot(Icecream,
       aes(price, cons, col=temp)) +
   geom_point(size=5) +
   scale_color_gradient2(
      low = "steelblue2",
      mid = "white",
      high = "orangered2",
      midpoint = midpoint
   )

Here we picked the middle point value to be mean temperature in the data.

If two gradients with a middle point is still too few for you then check out scale_color_gradientn() and pre-defined palettes below.

In a similar fashion like the discrete manual scale (see Section 13.7.2), continuous scale fails if applied to discrete data. If we try to use color gradient with the political parties example above in Section 13.7.2, we get:

partySeats <- data.frame(party = c("BJP", "INC", "AITC"),
                         seats = c(303, 52, 23))
ggplot(partySeats,
       aes(party, seats, fill=party)) +
   geom_col() +
   scale_fill_gradient()
## Error: Discrete value supplied to continuous scale

This means that the scale is expecting all kinds of numbers, put it was given with fill=party, and party only contains discrete values.

Exercise 13.11 Use ice extent data. Make a barplot of March ice extent on Northern Hemisphere only over all the years, where each bar represents the March ice extent for that particular year. Color the bars using blue-white-red gradient where blue represents a lot of ice, red represents little ice, and white is the period average.

See the solution

13.7.3.1 Pre-defined palettes

It is fairly easy to pick two-three colors that fit nicely together and get a professional-looking plot in this manner. But if you want to pick a larger number of colors, then it will rapidly become tricky. The task gets even more complex if you intend your figures to be readable for people with different types of color-blindness, or when printed on a paper in just black and white. Fortunately, you are not the first one who stumbles upon this problem. R includes a number of pre-defined color palettes. These include heat.colors(), terrain.colors(), topo.colors() and others. These functions return a number of color codes, e.g.

heat.colors(4)
## [1] "#FF0000" "#FF8000" "#FFFF00" "#FFFF80"

returns four color codes on red-yellow scale that may be good to represent “heat”.

plot of chunk ggplot-colors-ice-cream-temp-heat

Icecream consumption versus price. Outside temperature marked using heat.colors().

If we want to use such palettes for ggplot gradients then we can just feed a number of color from the palette to scale_color_gradientn():

ggplot(Ecdat::Icecream,
       aes(price, cons, col=temp)) +
   geom_point(size=5) +
   scale_color_gradientn(
      colors=heat.colors(10)
   )

The result looks like different levels of heat, although the color codes may be more about melting steel and less about weather…

These palettes above are designed with a continuous data in mind, like the smooth transition of color with temperature above. If you are displaying discrete values, then you may prefer colors that are the opposite–not blending smoothly into each other but easy to distinguish instead. ggplot2 includes such a palette, e.g. the default colors for election results in Section 13.7.2 are selected from the ggplot’s built-in palette.

Another popular choice is to use a pre-defined palette from colorbrewer.org. Color brewer palettes have been designed to look good and to be viewable both for people with normal vision and also with certain forms of color blindness. Colorbrewer’ color palettes are incorporated into R’s RColorBrewer package, one can see all the palettes with RColorBrewer::display.brewer.all()25 (but remember to install RColorBrewer() first). You can also get the palette and it’s color codes colorbrewer website by looking at the scheme query parameter in the URL.

plot of chunk brewer_point

1000 random diamonds’ size versus price, with cut denoted by different oranges.

These palettes can be used with scale_color_brewer() function, passing the palette as an argument. For instance, let’s plot the diamonds price using “Accent” palette:

diamonds %>%
   sample_n(1000) %>%
   ggplot(aes(carat, price, col=cut)) +
   geom_point() +
   scale_color_brewer(palette = "Oranges",
                      direction = -1)

The last argument, direction = -1, reverses the scale, so “Fair” fill be dark and “Ideal” light orange.

Note that ColorBrewer’s palettes are discrete–even the continuous–looking scales, like “YlOrRd” (yellow-orange-red) or “Blues” (light blues to dark blues) are discrete scales with only a limited number of possible values. This is because the human eye cannot easily distinguish between a large number of similar tones, and hence, if we want to make different continuous levels distinguishable, we need to use fewer colors. If you want a true continuous scale, you can always feed a color brewer palette into scale_color_gradientn, for instance.

13.8 Tuning other parameters

There are many more things you may want to adjust than just colors. Some of these, e.g. line type or point shape behaves in a fairly similar fashion as color and can be adjusted with corresponding scales (see Section 13.8.1).

But there are other adjustments, e.g. position, labels, facets and coordinate systems. This section discusses all these.

13.8.1 Scales for other aesthetics

There are many more aesthetics than just color and ggplot can map columns from data to those visual properties, and how exactly does the mapping occur is defined through the corresponding scales. Some of the other scales behave in a fairly similar fashion as color scales, for instance point shape or line style. But others, such as coordinate axes, have different properties.

  • size: point size. It works with both discrete and continuous values.
  • linewidth: width of lines, or outlines for bars and polygons. Works for both discrete and continuous values.
  • shape: point shape. Discrete values only.
  • alpha: transparency with “1” being completely oblique, and “0” being completely transparent (invisible). Works for both discrete and continuous values.
  • linetype: different line types, such as solid, dotted or dashed, for line plots. Discrete values only.
  • x, y: horizontal, vertical position.

13.8.1.1 Point shape and line type

Some of these aesthetics are discrete–they can only display a limited set of discrete values. These are point shape and line type–ggplot only offers a limited set of discrete shapes and types. You will get an error if you attempt to map continuous variables to shape or linetype. ggplot is also unhappy if you feed them with ordered categoricals (see Section 17.2.1). This is because there is no inherent ordering of point shapes like +, x and o. Your audience cannot guess which value belongs where in the ordered scale, and ggplot issues a warning.

Below, we demonstrate the usage of some of these with orange tree data (see Section I.10). As the aim is to explain the usage of scales, the result will not be very good.

plot of chunk orange-scaling-types

Growth of orange trees. Point size and shape, line width and style, and transparency, all differ by the tree.

We use a number of aesthetics–shape, linetype, alpha, and size to denote tree number. As alpha and size can display continuous variables, we can just map those as aes(alpha = Tree). However, shape and linetype can only handle discrete values, and hence we need to convert these to categoricals using factor(Tree).

orange %>%
   ggplot(aes(age, circumference,
              shape = factor(Tree),
              linetype = factor(Tree),
              alpha = Tree,
              size = Tree)) +
   geom_line() +
   geom_point()

The plot, using the default scales, is not pleasant. All trees are marked with black color, but as the black lines are at least somewhat transparent, they appear as different shades of gray. Also, the line styles are hard to comprehend, notably a very wide dashed line resembles a series of dots instead.

Not also that we have two legends, one for Tree and another for factor(Tree). This is because we mapped both of these options to different aesthetics.

Now let’s improve the legibility of the plot by some scales and other tuning:
plot of chunk orange-scaling-size

The same plot as above, just this time using manually designed scales.

orange %>%
   ggplot(aes(age, circumference,
              shape = factor(Tree),
              linetype = factor(Tree),
              alpha = Tree,
              size = Tree)) +
   geom_line() +
   geom_point() +
   scale_shape_manual(
      values = c("1" = 10, "2" = 11,
                 "3" = 12, "4" = 8,
                 "5" = 5)) +
   scale_linetype_manual(
      values = c("1" = "solid", "2" = "dashed",
                 "3" = "dotted", "4" = "dotdash",
                 "5" = "twodash")) +
   scale_size_continuous(range=c(0.5,3)) +
   scale_alpha_continuous(range=c(0.3,1)) +
   guides(linetype = "none",
          size = "none",
          alpha = "none") +
   labs(shape = "Tree #")

First, we hand-pick the values for two discrete scales. Each tree will have it’s own dedicated line type and point shape. For the continuous scales, we do not pick the individual values but adjust the ranges instead. For line widths we choose values between 0.5 and 3 (range = c(0.5, 3)) to make the lines of more equal width, and for alpha we do a similar conversion forcing the lines to be a bit more oblique. Finally, we tell that we do not want to see the linetype, size, and alpha legends on the figure; and adjust the legend label.

Exercise 13.12 What are the possible point shapes? Try this out:

data.frame(x = 1:25) %>%
  ggplot(aes(x, x)) +
  geom_point(shape = 1:25,
             col = "mediumpurple3",
             fill = "gold3",
             size = 2,
             stroke = 2)

(You may want to adjust size and stroke parameters).

  • Which point shapes include fill and outline?
  • What does stroke parameter do?
  • Explain why can we use shape = 1:25 instead of shape = factor(1:25).
  • shapes can also be specified as letters. What does shape = "m" do?
  • If you want to use three different point shapes, which ones would you choose? What are your considerations?

Exercise 13.13 Use the same aesthetics as above–point shape, line type, transparency and line width. Make the plot into an aesthetically pleasant publication-quality figure. Do not use colors!

13.8.1.2 x and y

# mileage relationship, ordered in reverse
ggplot(data = mpg) +
  geom_point(mapping = aes(x = cty, y = hwy)) +
  scale_x_reverse()

Similarly, you can use scale_x_log10() to plot on a logarithmic scale.

You can also use scales to specify the range of values on a axis by passing in a limits argument. This is useful for making sure that multiple graphs share scales or formats.

# subset data by class
suv <- mpg %>% filter(class == "suv") # suvs
compact <- mpg %>% filter(class == "compact") # compact cars

# scales
x_scale <- scale_x_continuous(limits = range(mpg$displ))
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
col_scale <- scale_colour_discrete(limits = unique(mpg$drv))

ggplot(data = suv) +
  geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
  x_scale + y_scale + col_scale

ggplot(data = compact) +
  geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
  x_scale + y_scale + col_scale

plot of chunk scale_limitplot of chunk scale_limit

Notice how it is easy to compare the two data sets to each other because the axes and colors match!

These scales can also be used to specify the “tick” marks and labels; see the resources at the end of the chapter for details. And for further ways specifying where the data appears on the graph, see the Coordinate Systems section below.

13.8.2 Statistical Transformations

If you look at the above bar chart, you’ll notice that the the y axis was defined for you as the count of elements that have the particular type. This count isn’t part of the data set (it’s not a column in mpg), but is instead a statistical transformation that the geom_bar automatically applies to the data. In particular, it applies the stat_count transformation, simply summing the number of rows each class appeared in the dataset.

ggplot2 supports many different statistical transformations. For example, the “identity” transformation will leave the data “as is”. You can specify which statistical transformation a geom uses by passing it as the stat argument:

# bar chart of make and model vs. mileage

# quickly (lazily) filter the dataset to a sample of the cars: one of each make/model
new_cars <- mpg %>%
  mutate(car = paste(manufacturer, model)) %>% # combine make + model
  distinct(car, .keep_all = TRUE) %>% # select one of each cars -- lazy filtering!
  slice(1:20) # only keep 20 cars

# create the plot (you need the `y` mapping since it is not implied by the stat transform of geom_bar)
ggplot(new_cars) +
  geom_bar(mapping = aes(x = car, y = hwy), stat = "identity") +
  coord_flip() # horizontal bar chart

plot of chunk bar_chart

Additionally, ggplot2 contains stat_ functions (e.g., stat_identity for the “identity” transformation) that can be used to specify a layer in the same way a geom does:

# generate a "binned" (grouped) display of highway mileage
ggplot(data = mpg) +
  stat_bin(aes(x = hwy, color = hwy), binwidth = 4) # binned into groups of 4 units

plot of chunk stat_summary

Notice the above chart is actually a histogram! Indeed, almost every stat transformation corresponds to a particular geom (and vice versa) by default. Thus they can often be used interchangeably, depending on how you want to emphasize your layer creation when writing the code.

# these two charts are identical
ggplot(data = mpg) +
  geom_bar(mapping = aes(x = class))

ggplot(data = mpg) +
  stat_count(mapping = aes(x = class))

13.8.3 Position Adjustments

In addition to a default statistical transformation, each geom also has a default position adjustment which specifies a set of “rules” as to how different components should be positioned relative to each other.

One of the plot type where the position adjustment is used quite often is geom_col() and geom_histogram(). This allows you to show different sides of the same data. Below we demonstrate four different histograms of diamond size depending on cut. Each of these histograms stresses a different side of data.

plot of chunk carat-histogram-stacked

When we color fill the diamond histogram by cut, then by default the bars are stacked on top of each other:

Let’s show this by plotting the histogram of diamonds’ mass (carat), by coloring the bars different according to cut:

diamonds %>%
   ggplot(aes(carat, fill=cut)) +
   geom_histogram()

This shows the histogram–count by binned carat values. The counts are done separately for different cut-s, and the corresponding bars are stacked on top of each other. This plot is good to show the overall distribution of diamonds, and show what kind of role do gems of different cut play there. We see that by far the most diamonds are small, less than 0.5 ct. But there are secondary peaks at 1ct, 1.5ct and 2ct. We can also see that there are many “ideal” diamonds, although the number differ in different bins.

plot of chunk carat-histogram-fill

Another histogram of carat, split into five different groups by cut. Now all the bars are of the same height, stressing the proportion of different cuts.

But the previous plot is not very informative if we are interested in comparing the share of different cuts. The bars are of different lenbth, and in particular the shorter ones, it is hard to see what proportion of ideal and other diamonds are there. But we can make the proportion visible by making the bars of equal height with pos = "fill" argument:

diamonds %>%
   ggplot(aes(carat, fill=cut)) +
   geom_histogram(pos = "fill")

In this form, the histogram shows that ideal cut is more common for small diamonds, and larger diamonds tend to be of less valuable cut.

plot of chunk carat-histogram-dodge

Third similar histogram. Now the bars are located next to each other, allowing a comparison of the mass distribution for differently cut diamonds. We only show 10 bins to make the bars easier to read.

But if we want to compare how the differently cut diamonds are distributed, neither of the plots above are good. We may want to put the bars next to each other with position = "dodge" instead:

diamonds %>%
   ggplot(aes(carat, fill=cut)) +
   geom_histogram(pos = "dodge",
                  bins=10)

(We only show 10 bins to make the bars easier to read.) This plot gives a somewhat similar messages as the one above (position = "fill"). This time, however, we can also see that diamonds between 0.5 to 2.5ct are the most common ones of every cut, diamonds larger than 3ct are extremely rare, no matter which cut you are looking at.

plot of chunk carat-histogram-density

Plotting density instead of counts. This is a better way to see what kind of diamonds are more or less common for a given cut.

Finally, if we are not interested in the distribution of different cuts given the diamond size, we can compare the densities by setting y = ..density.. (see Sectino 13.4.4) in the aesthetics mapping:

diamonds %>%
   ggplot(aes(carat,
              y = ..density..,
              fill=cut)) +
   geom_histogram(pos = "dodge",
                  bins=10)

Here we can see that most common ideal-cut diamonds are around 0.5ct, but the most common fair-cut diamonds are of 1ct.

As you can see from these examples, there is a variety of ways to display the same data. Each of these sends a different message, and which one you want to use, depends on which side you want to stress.

13.8.4 Facets

Facets are ways of grouping a data plot into multiple different pieces (subplots). This allows you to view a separate plot for each value in a categorical variable. Conceptually, breaking a plot up into facets is similar to using the group_by() verb in dplyr, with each facet acting like a level in an R factor.

You can construct a plot with multiple facets by using the facet_wrap() function. This will produce a “row” of subplots, one for each categorical variable (the number of rows can be specified with an additional argument):

# a plot with facets based on vehicle type.
# similar to what we did with `suv` and `compact`!
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~class)

plot of chunk facets

Note that the argument to facet_wrap() function is written with a tilde (~) in front of it. This specifies that the column name should be treated as a formula. A formula is a bit like an “equation” in mathematics; it’s like a string representing what set of operations you want to perform (putting the column name in a string also works in this simple case). Formulas are in fact the same structure used with standard evaluation in dplyr; putting a ~ in front of an expression (such as ~ desc(colname)) allows SE to work.

  • In short: put a ~ in front of the column name you want to “group” by.

13.8.5 Labels & Annotations

Textual labels and annotations (on the plot, axes, geometry, and legend) are an important part of making a plot understandable and communicating information. Although not an explicit part of the Grammar of Graphics (they would be considered a form of geometry), ggplot makes it easy to add such annotations.

You can add titles and axis labels to a chart using the labs() function (not labels, which is a different R function!):

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  labs(
    title = "Fuel Efficiency by Engine Power, 1999-2008", # plot title
    x = "Engine power (litres displacement)", # x-axis label (with units!)
    y = "Fuel Efficiency (miles per gallon)", # y-axis label (with units!)
    color = "Car Type"
  ) # legend label for the "color" property

plot of chunk labels

It is possible to add labels into the plot itself (e.g., to label each point or line) by adding a new geom_text or geom_label to the plot; effectively, you’re plotting an extra set of data which happen to be the variable names:

# a data table of each car that has best efficiency of its type
best_in_class <- mpg %>%
  group_by(class) %>%
  filter(row_number(desc(hwy)) == 1)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +  # same mapping for all geoms
  geom_point(mapping = aes(color = class)) +
  geom_label(data = best_in_class, mapping = aes(label = model), alpha = 0.5)

plot of chunk annotations

R for Data Science (linked in the resources below) recommends using the ggrepel package to help position labels.

move it up after “most important plot types”, merge “position adjustments”, “coordinate systems”, and “statistical transformations” in here.

It is important to know how to produce the desired plots on computer. But it is perhaps even more important to know what kind of plots to produce. Here we discuss some of the most important properties you should try to achieve.

13.8.6 Linear and log scale

plot of chunk diamond-price-histogram

Diamonds’ price. Both the horizontal and vertical axes are in linear scale.

Many datasets contain a lot of small values and not so many large values. For instance, the histogram of diamond’s price will look like

ggplot(diamonds, aes(price)) +
   geom_histogram(bins=50,
                  fill="steelblue3",
                  col="white")

This may be exactly what you want–it shows that there are many cheap diamonds (in a few thousand dollar price range), and not that many expensive ones with a price over $10,000. But this plot also has it’s downsides–only a small portion of the figure is devoted to the most common price range, while over a half of it is almost empty, and confirming what we may know anyway–there are not that many expensive gems.

If we want to see more details in the cheap diamonds’ region, we may want to somehow stretch the space devoted to cheap diamonds while compressing the area covered by expensive ones. The most common solution is to use log transformation. Logarithm does exactly that–it expands small numbers (makes difference between small numbers larger) and contracts large numbers (makes difference between large numbers smaller).
plot of chunk diamond-price-histogram-logx

Diamonds’ price. The vertical axis is still linear, but the horizontal is now in log scale. It is a log-linear plot.

Let’s repeat the above example using log scale:

ggplot(diamonds, aes(price)) +
   geom_histogram(bins=50,
                  fill="steelblue3",
                  col="white") +
   scale_x_log10()

Code-wise, we just add + scale_x_log10() to the previous example, this forces the x-axis to be logarithmic. (See Section 13.8.1 for more about scales). Now the plot looks very different. The bars are of broadly equal height and a problem with the previous plot–a lot of empty space–is gone.

13.8.7 Coordinate Systems

The next term from the Grammar of Graphics that can be specified is the coordinate system. As with scales, coordinate systems are specified with functions (that all start with coord_) and are added to a ggplot. There are a number of different possible coordinate systems to use, including:

  • coord_cartesian the default cartesian coordinate system, where you specify x and y values.
  • coord_flip a cartesian system with the x and y flipped
  • coord_fixed a cartesian system with a “fixed” aspect ratio (e.g., 1.78 for a “widescreen” plot)
  • coord_polar a plot using polar coordinates
  • coord_quickmap a coordinate system that approximates a good aspect ratio for maps. See the documentation for more details.

Most of these system support the xlim and ylim arguments, which specify the limits for the coordinate system.

13.9 More geoms and plot types

In Section 13.4 we introduced the most important plot types and the corresponding geoms. Here we discuss a selection of other plot types, and other tasks that can be achieved with different geoms.

We use ice extent data, a dataset, compiled by from NSIDC, that contains monthly satellite measures of polar ice area and extent from 1978 on. A sample of the dataset looks like

ice <- read.delim("data/ice-extent.csv.bz2")
tail(ice, 4)
##      year month data.type region extent  area     time
## 1059 2022    12   NRTSI-G      N  11.92 10.23 2022.958
## 1060 2022    12   NRTSI-G      S   8.69  5.12 2022.958
## 1061 2023     1   NRTSI-G      N  13.35 11.83 2023.042
## 1062 2023     1   NRTSI-G      S   3.23  1.95 2023.042

The relevant variables here are

  • time: time, measured in years at mid-month.
  • extent: sea ice extent, M \(km^2\). Sea ice extent is sea area where ice concentration is over 15%. This is a measure of sea ice that is less error prone that sea ice area, area where sea is completely frozen over.
  • region: “N”/“S” for Northern or Southern hemisphere.
septemberData <- ice %>%
   filter(month == 9, region == "N")

TBD: geom_smooth

13.9.1 Density plot

Above, in Section 13.4.4, we introduced histograms. These are basically counts of observations by binned x. However, histograms have two undesirable properties:

  • The bin heights–counts–depend not just on how data is distributed, but also on the bin with and the dataset size. Obviously, narrow bins on a small dataset contain few observations. But this makes it hard to compare two different distributions.
  • The shape of histograms depends on the exact with and location of the bins. This is quite a big problem for discrete values, such as age (almost always measured in full years), and may obscure some patterns in data.

The solution to the first problem is fairly simple: instead of plotting the number (count) of observations in each bin, we plot density–percentage of data points per unit bin with. The density numbers do not change much when changing the dataset size or number of bins.

plot of chunk sept-histogram

Histogram of September ice extent. Bin heights correspond to density.

In order to replace counts with density, you need to specify after_stat(density) in the aes() function for the y aesthetics:

ggplot(septemberData,
       aes(extent,
           after_stat(density))) +
   geom_histogram(
      bins=7,
      fill="mediumpurple4",
      col="gold1"
   )

We see that the smallest extent values are less than 3 (M km\(^2\)), and the largest values are close to 8.

plot of chunk sept-hist-20

The same histogram, now with 20 bins. The density values are comparable.

For comparison, let’s repeat the same plot with 20 bins:

ggplot(septemberData,
       aes(extent,
           after_stat(density))) +
   geom_histogram(
      bins=20,
      fill="mediumpurple4",
      col="gold1"
   )

This is clearly too many bins for such a small dataset. But as you can see, while the bin count and shape are rather different, the density values are comparable.

plot of chunk september-density

Smooth density curve (black), filled from below with semi-transparent blue.

Unfortunately, this does not solve the second problem–histograms are sensitive to the exact location of bins. A solution is not to use discrete bins, but to “smear” the observations around on the plot. The result will look something like a “smoothed histogram”, this can be done with geom_density():

ggplot(septemberData, aes(extent)) +
   geom_density(
      fill = "mediumpurple4",
      col = "gold1",
      alpha = 0.5
   )

By default, geom_density() just draws the smooth golden curve. When you specify fill = then it also fills the area under the curve. Here we also set alpha = 0.5, making the fill color semi-transparent to make the underlying coordinate lines somewhat visible.

Such semi-transparent density curves may be good for comparing distribution across different categorical variables.

Exercise 13.14 Make such a density plot for the ticket price (fare) distribution on Titanic for all 3 passenger classes. Mark the different classes by different color.

See the solution

13.9.2 Paths and lines

Line plot (see Section 13.4.2) is one of the simplest and most popular plot types. It first orders observations by x and then connects the corresponding data points with lines. But sometimes we do not want to order the observations. For instance, we may want to plot a path of something moving over time. In such case we can use geom_path()–it is otherwise similar to geom_line(), just the points remain unordered.

plot of chunk ice-path-2022

For instance, here is how ice extent and area changed through 2022:

ice %>%
   filter(year == 2022,
          region == "N") %>%
   ggplot(
      aes(extent, area,
          col = month)
   ) +
   geom_path() +
   geom_point()

In winter, both area and extent were large. However, area fall faster through the melt season, as the melting ice weakened and broke up. This made area smaller than extent. The opposite happens during the fall freezing season: sea freezes over with little breakups, and hence extent area can almost catch up the extent.

plot of chunk ice-path-all

Sometimes is may be worth of plotting all the other data points in gray. This involves feeding the data argument at least twice to ggplot(): first in the ggplot() function (through pipe), and thereafter explicitly in the first geom_path():

ice %>%
   filter(
      ## use only 2022 for the
      ## colored plot
      year == 2022,
      region == "N") %>%
   ggplot(
      aes(extent, area,
          col = month)
   ) +
   geom_path(
      ## use all years for the
      ## gray background
      data = ice %>%
         filter(region == "N",
                area > 0),
      col = "gray60") +
   geom_path(linewidth = 2) +
   geom_point(size = 4)

Now we can see how 2022 relates to all the other years.

There are also a number of geoms which’ primary purpose is not to display data but to mark certain values on the figure. This includes geom_vline() for vertical lines, geom_hline() for horizontal lines, and geom_abline() for diagonal lines. These geoms can be used to display data, but the result is often not particularly useful.

plot of chunk september-vlines

All September ice extents, displayed as vertical lines.

For instance, we can plot all the ice extent using vertical lines while coloring those according to year:

ggplot(septemberData) +
   geom_vline(
      ## have to specify 'xintercept' here,
      ## not in ggplot()!
      aes(xintercept = extent, col=year)
   ) +
   viridis::scale_color_viridis()

The figure is somewhat useful (it is a form of rugplot), but alone, it is usually not the best way to visualize data. Nevertheless, when colored like this, it indicates that the largest September ice extent in data occured in 1980-s, 2010-s tend to be at the low end.

ggplot(septemberData, aes(extent)) +
   geom_density() +
   geom_vline(
      xintercept=c(
         mean(septemberData$extent),
         median(septemberData$extent)
      ),
      col = c("orangered1",  # mean is red
              "seagreen4")  # median is green
   )
plot of chunk september-avg-median

The period average and median ice extent marked as vertical lines of different color.

Instead, the lines are typically used to display certain values on the figure. For instance, we can add the sample mean and median to the density plot as vertical lines of different color:

ggplot(septemberData, aes(extent)) +
   geom_density() +
   geom_vline(
      xintercept=c(
         mean(septemberData$extent),
         median(septemberData$extent)
      ),
      col = c("orangered1",  # mean is red
              "seagreen4")  # median is green
   )

Note that while aes() accepts the data variables names directly, the xintercept = inside geom_vline() does look for workspace variables only. One has to use dollar notation or something similar to extract data variables from the data frame.

13.9.3 Other plot types

TBD: other plot types

13.10 Programming with ggplot2

ggplot is well suited to be incorporated into code. So far, we have been focusing on the interactive usage, or usage in small code snippets. When you run the code in a more extensive program, there are a number of issues you may run into. Here we discuss some of these.

13.10.1 Outputting plots

The typical usage of ggplot in short code snippets is something like

ggplot(data, aes(x, y)) +
   geom_something()

When you run this code, you’ll see the plot popping up in the plot window. This may make you to think that the ggplot() function that itself draws the image on screen. But this is not true! A call to ggplot() just creates a ggplot-object, a data structure that contains data and all other necessary details for creating the plot. But it does not create the image. The image is created instead by the print-method. This is analogous with all other R expression on the console–if you just type an expression, then R evaluates and prints the result. This is why we can use it as a manual calculator for simple math, such as 2 + 2. The same is true for ggplot: it returns a ggplot-object, and given you don’t store it into a variable, it is printed, and the print method of ggplot-object actually makes the image. This is why we can immediately see the images when we work with ggplot on console.

Things may be different, however, if we do this in a script. When you execute a script, the returned objects are not printed, even if you do not save those in a variable. For instance, the script

diamonds %>%
   sample_n(1000) %>%
   ggplot() +
   geom_point(aes(carat, price))  # note: not printed

will not produce any output when sourced as a single script (Source-d, not sourced with echo, and not run line-by-line). If this is the problem you run into, then you may want to store the plot into a variable and afterward print it explicitly:

p <- ggplot(...) +
   geom_something(...)
print(p)

As an additional bonus, now you have stored your plot in a variable and you can also add additional layers to it (see Section 13.10.2).

Note that this also applies to other contexts where ggplot is used, e.g. when making plots in shiny (see Section E).

In scripts we often want the code not to produce image on screen, but store it in a file instead. This can be achieved in a variety of ways, for instance through redirecting graphical output to a pdf device with the command pdf():

data <- diamonds %>%
   sample_n(1000)
p <- ggplot(data) +
   geom_point(aes(carat, price))  # store here
pdf(file="diamonds.pdf", width=10, height=8)
                           # redirect to a pdf file
print(p)  # print here
dev.off()  # remember to close the file

After redirecting the output, all plots will be written to the pdf file (as separate pages if you create more than one plot). Note you have to close the file with dev.off(), otherwise it will be broken. There are other output options besides pdf, you may want to check jpeg and png image outputs. Finally, ggplot also has a dedicated way to save individual plots to file using ggsave.

13.10.2 Re-using your code: adding layers with +

You frequently need to make multiple rather similar plot. You want the plots to be rather similar but not exactly the same. This means most of the plotting commands are the same, but a few are different. A good way to achieve this in ggplot is to store the parts of your plot you want to re-use into variables. For instance, let’s do two plots of diamonds, one in linear-linear, and the other in log-log scale.

## Error in `geom_point()`:
## ! Problem while computing aesthetics.
## ℹ Error occurred in the 1st layer.
## Caused by error in `scales_add_defaults()`:
## ! could not find function "scales_add_defaults"
plot of chunk diamonds-plot-log

Two similar plots, done by re-using the same code.

First, we select a subsample of diamonds and make a simple scatterplot:

p <- diamonds %>%
   sample_n(1000) %>%
   ggplot(aes(carat, price)) +
   geom_point()

This scatterplot is saved as p, and we can create the plot by just printing it as

p

As we did not specify any special scale, the plot will be in linear-linear scale. The other plot is created from the same variable p by adding the log-log scales:

p +
   scale_x_log10() +
   scale_y_log10()

As we do not store this into a new variable, the result will just be printed (plotted).

Note how we just add more layers to p, in a similar fashion as one can add numbers to a numberic variable. The + sign understands that these are ggplot objects and combines the appropriate layers instead.

13.10.3 Indirect variable names

Another common task when programming is to use indirect variable names (see Section 11.3.1).

## Error in `filter()`:
## ! Incompatible data source.
## ✖ `.data` is a <ts> object, not a data source.
## ℹ Did you want to use `stats::filter()`?
## Error in `ggplot()`:
## ! `data` cannot be a function.
## ℹ Have you misspelled the `data` argument in `ggplot()`

Indirect variable name used in a wrong way. ggplot assumes that the vertical position is given through the workspace variable “var”, and it’s value is “circumference”. That’s why every signle dot is at level “circumference”.

It is not immediately obvious how to do it–a naive attempt with

data <- Orange %>%
   filter(Tree == 1)
var <- "circumference"
ggplot(data, aes(age, var)) +
   geom_point()

will result in a weird plot that is almost certainly not what you want.

an error where ggplot complains about “var” not found. The problem is similar as with dollar versus double bracket notation in case of data frames–ggplot expects “var” to be the data variable name, and complains that it cannot find it.

## Error in `filter()`:
## ! Incompatible data source.
## ✖ `.data` is a <ts> object, not a data source.
## ℹ Did you want to use `stats::filter()`?
## Error in `ggplot()`:
## ! `data` cannot be a function.
## ℹ Have you misspelled the `data` argument in `ggplot()`

Indirect variable names can be used with aes_string(). Note that “age” must be quoted too.

Fortunately, there is an easy remedy. aes_string(), instead of aes() will expect the arguments to be strings, not unqoted variable names. And one can easily pass string values indirectly:

data <- Orange %>%
   filter(Tree == 1)
var <- "circumference"
ggplot(data, aes_string("age", var)) +
   geom_point()

Note that all variable names must be passed as character to aes_string(), so you must write aes_string("age", var), not aes_string(age, var). “Var” must be left unquoted as it is a workspace variable, but “age” is the name of a data variable.

13.11 Other Visualization Libraries

ggplot2 is easily the most popular library for producing data visualizations in R. That said, ggplot2 is used to produce static visualizations: unchanging “pictures” of plots. Static plots are great for for explanatory visualizations: visualizations that are used to communicate some information—or more commonly, an argument about that information. All of the above visualizations have been ways to explain and demonstrate an argument about the data (e.g., the relationship between car engines and fuel efficiency).

Data visualizations can also be highly effective for exploratory analysis, in which the visualization is used as a way to ask and answer questions about the data (rather than to convey an answer or argument). While it is perfectly feasible to do such exploration on a static visualization, many explorations can be better served with interactive visualizations in which the user can select and change the view and presentation of that data in order to understand it.

While ggplot2 does not directly support interactive visualizations, there are a number of additional R libraries that provide this functionality, including:

  • ggvis is a library that uses the Grammar of Graphics (similar to ggplot), but for interactive visualizations. The interactivity is provided through the shiny library, which is introduced in a later chapter.

  • Bokeh is an open-source library for developing interactive visualizations. It automatically provides a number of “standard” interactions (pop-up labels, drag to pan, select to zoom, etc) automatically. It is similar to ggplot2, in that you create a figure and then and then add layers representing different geometries (points, lines etc). It has detailed and readable documentation, and is also available to other programming languages (such as Python).

  • Plotly is another library similar to Bokeh, in that it automatically provided standard interactions. It is also possible to take a ggplot2 plot and wrap it in Plotly in order to make it interactive. Plotly has many examples to learn from, though a less effective set of documentation than other libraries.

  • rCharts provides a way to utilize a number of JavaScript interactive visualization libraries. JavaScript is the programming language used to create interactive websites (HTML files), and so is highly specialized for creating interactive experiences.

There are many other libraries as well; searching around for a specific feature you need may lead you to a useful tool!

Resources

thematic::thematic_on(font = thematic::font_spec(scale=1.8))

  1. This conclusion is perhaps a bit pre-mature. Namely, the variation in price is fairly small, compared to that of temperature.↩︎

  2. RColorBrewer::display.brewer.all() calls the function display.brewer.all() from the RColorBrewer package, without loading it first. It is mostly equivalent to two separate commands, library(RColorBrewer) and display.brewer.all().↩︎