Chapter 10 Visualizing data

set.seed(34)
library(tidyverse)
options(width=66, pillar.width=66)
thematic::thematic_on(font = thematic::font_spec(scale=1.8))

Data visualizations are a very powerful way to understand certain properties and relationships in data. They also may add a completely different feel to an otherwise dull report.

But visualizations are not all powerful, and can be deceptive sometimes. Visualizations can often represent low-dimensional data (containing only 2-3 variables) very well, but the options the make a high-dimensional dataset visually understandable are very limited.

Different kinds of visualizations are useful for different type of data. Here is a list of some of the most important categories:

Histogram: a single numeric variable. See Section 10.2.1.
Scatterplot: two numeric variables. Scatterplot is the appropriate way to represent data points where there is no inherent connection between different points. See Section 10.2.2.
Line plot: two numeric variables. Line plot is the appropriate way to represent data points where the data points are ordered, so there is a connection from the “previous” to the “next” point. See Section 10.2.3.
Barplot: a categorical and a numeric variable, and you only want to show a single value for each category (e.g. mean income by occupation). See Section 10.2.4.
Boxplot: a categorical and a numeric variable, but where you want to visualize the distribution of the numeric variable, depending on the category. See Section 10.2.4.

We discuss these visualizations in a more detail below.

We rely on ggplot visualization library. It is part of the tidyverse world, and hence does not require any additional setup.

10.1 ggplot visualization framework

ggplot is a framework designed for visualizing data. A ggplot plotting command consists of “layers”, separated by “+” sign. The most important layers are aesthetics, connecting data to the visual properties of the plot (e.g. we may want to put variable “age” on the horizontal axis), and geoms, the layers that actually make a plot. Below, we provide a few basic examples, and discuss the plot types separately later.

We demonstrate the plots with Ice extent data. This can be loaded as:

ice <- read_delim("ice-extent.csv.bz2")

Now we have loaded the dataset and stored it under name ice in the R workspace. The dataset looks like

ice %>%
   head(3)

## # A tibble: 3 × 7
##    year month `data-type` region extent  area  time
##   <dbl> <dbl> <chr>       <chr>   <dbl> <dbl> <dbl>
## 1  1978    11 Goddard     N        11.6  9.04 1979.
## 2  1978    11 Goddard     S        15.9 11.7  1979.
## 3  1978    12 Goddard     N        13.7 10.9  1979.

For our purpose, the important variables are year, month, region, extent, and area. Region means the hemisphere, “N” for north and “S” for south. Extent and area are sea ice extend and area (in millions of km2).

First, let’s make a plot the average sea ice area for each month for the northern hemisphere in 2021:

## first filter 2021 only:
ice2021 <- ice %>%
   filter(year == 2021, region == "N")
## next, do the plot
ggplot(ice2021,
       aes(x=month, y=area)) +
   geom_point() +
   labs(x = "Month (2021)",
        y = "Monthly average sea ice area in North (million km2)")

plot of chunk icescatterplot

The first two lines of the code block just filter the 2021 northern hemisphere data out of the whole dataset, and store it as a new dataset “ice2021”. Next, we get into the actual plotting:

The first line, ggplot(ice2021, sets the dataset, it just tells ggplot that data below is originating from that dataset.
The next line, aes(x=month, y=area)) sets up aesthetics. This is the mapping between the visual layout of the plot, and variables in data. Here it tells that the variable “month” should be placed horizontally (“x”) and “area” should be placed vertically (“y”). But note while the two first lines set everything up for plotting (such as axis, labels, and the gray background), they actually do not show any data.
The third line, geom_point() is the geom. Here we pick geom_point(), scatterplot, that actually makes the black dots that are visible on the image.
The final line, labs(...) adjusts the axis labels. It is fairly self-explanatory. If you leave this out, the axis labels will be just the variable names, here “month” and “area”.
Note that the lines are combined with plus signs, not pipes. All the ggplot command is essentially a single command, you can think of the plus signs as the word “add”. The example above might then be read as

Take ice2021 data. Put “month” on x-axis and “area” on y-axis. Add scatterplot, add labels.

The plot above was quite simplistic, and one may want to adjust it in many ways. Here is the same plot that includes a bit more tuning:

ggplot(ice2021,
       aes(x=month, y=area)) +
   geom_point(col="skyblue2", alpha=0.7, size=4) +
   labs(x = "Month (2021)",
        y = "Monthly average sea ice area in North (million km2)") +
   coord_cartesian(ylim=c(0,13)) +
   scale_x_continuous(breaks = c("Mar"=3, "Jun"=6, "Sep"=9, "Dec"=12))

plot of chunk unnamed-chunk-3

Here we tell geom_point to make the points to be of color “skyblue2” (just search google for R color names, you can also use html hex values), and somewhat transparent (alpha=0.7 means only 70% oblique). We also request those to be larger (size 4).
coord_cartesion(ylim=c(0,13)) sets the vertical span of the plot to be from 0 to 13 (M km2). Here it is just to demonstrate the axis limits, but it also helps the reader to understand how far we are from zero–from no sea ice at all condition.
Finally, scale_x_continuous(breaks = c("Mar"=3, "Jun"=6, "Sep"=9, "Dec"=12)) tells ggplot to only mark months 3, 6, 9 and 12, and label those not with the numbers, but with the month names.

Next, we discuss the plot types in a more detail.

10.2 Basic plot types

10.2.1 Histogram

Histogram is a way to display distribution of a single variable. We encountered histograms in Section 7.2.3 above. It is essentially counts of different values. For continuous variables, the values are binned, so we do not display the values, but instead, we count how many values fall into each bin. Histogram is a great way to display the distribution of the numeric variables. For instance, we can display the distribution of ice extent on northern hemisphere through all the years and months:

iceN <- ice %>%
   filter(region == "N") %>%
   filter(area > 0)  # remove missings, coded as negative values
ggplot(iceN, aes(x=area)) +
   geom_histogram()

plot of chunk ice-histogram

The default values are not particularly beautiful–we have a number of dark gray bars on light gray background. But we can see that the values stretch from less than 3 to almost 15, with values around 12 and around 5 being the most common.

However, displaying monthly sea ice data on a histogram like this is not very enlightening. Sea ice has strong seasonal trends, and on top of it the ice area has been steadily falling through the recent decades. Neither of these features are visible on the histogram. Histogram is better suited for for variables that either do not show such trends, or for cross-sectional data, data where one measures a number of cases at the same time. In Section 7.2.3 we demonstrated the age and fare histogram for titanic passengers. Let’s repeat it here again. First, load Titanic data:

titanic <- read_delim("titanic.csv.bz2")

Now we can create the histogram as

ggplot(titanic, aes(x = age)) +
   geom_histogram(fill="skyblue", col="black", bins=30) +
   labs(x = "Age", y = "Count")

plot of chunk titanic-age-histogram

We added better axis labels with labs(). We also picked better colors. Note that for barplots, and other plots that cover an area, fill means the fill color and col means the border color. Finally, we told geom_histogram to create 30 bins.

Such a representation of passengers’ age is rather informative, for instance it tells that the bulk of passengers were between 20 and 40 years old, probably representing the prime age for re-settling from the Old World to the New one. It also indicates another peak for toddlers, one may guess that this represents the children of the immigrants.

10.2.2 Scatterplot

Scatterplot is a good way to relate two numeric variables. Above in Section 10.1 we used scatterplot (geom_point) to related time and ice area. Let’s do it here again, but now we will plot all months in data:

ice %>%
   filter(area > 0,  # exclude missings (marked as negative)
          region == "N") %>%  # only northern hemisphere
   ggplot(aes(time, area)) +
   geom_point(col="skyblue", size=0.5)

plot of chunk viz-ice-north-scatterplot

While this plot looks somewhat interesting, it is not very enlightening. The problem is that we are squeezing too much information–too many months with widely varying ice area–on the same plot. Let us just focus on a single month–September (month of the yearly northern ice minimum):

ice %>%
   filter(area > 0,
          region == "N",
          month == 9) %>%
   ggplot(aes(time, area)) +
   geom_point(col="skyblue3", size=3)

plot of chunk viz-sept-scatterplot

10.2.3 Line plot

Line plot is also a way to represent two numeric variables, in a sense it is very similar to scatterplot, but here the points (observations) must be somehow clearly linked, e.g. they may represent the same object measure at different point in time.

However, in case of ice extent over time, the points (years) are clearly ordered in time. Hence one may also consider connecting these with lines, transforming the plot essentially into a line plot:

ice %>%
   filter(area > 0,
          region == "N",
          month == 9) %>%
   ggplot(aes(time, area)) +
   geom_line(col="orangered") +
   geom_point(col="skyblue3", size=3)

plot of chunk unnamed-chunk-6

Here we use two geoms–geom_point to make the dots, and geom_line to connect them. Note that the geoms are drawn in the given order, here first the lines and thereafter the points. Here we are essentially using a combined plot, neither a pure line plot nor a pure scatterplot.

10.2.4 Barplot

Barplot is a way to display the relationship between a numeric and a categorical variable. Unlike in case of scatterplot and line plot, it is hard to display many different variables per category using barplot. For instance, we cannot really do a barplot of each fare paid by all the passengers depending on the class (but we can do such a scatterplot). Instead, we can make a barplot that shows the average fare paid on Titanic, by the passenger class. We start by computing the average by class:

titanic %>%
   group_by(pclass) %>%
   summarize(fare = mean(fare, na.rm=TRUE))

## # A tibble: 3 × 2
##   pclass  fare
##    <dbl> <dbl>
## 1      1  87.5
## 2      2  21.2
## 3      3  13.3

(See Section 5.6 for more about grouped operations.)

This results in a data frame with two columns, pclass and fare. We can either store it as a variable, or feed directly to ggplot:

titanic %>%
   group_by(pclass) %>%
   summarize(fare = mean(fare, na.rm=TRUE)) %>%
   ggplot(aes(x=pclass, y=fare)) +
   geom_col(fill="skyblue", col="white")

plot of chunk unnamed-chunk-8

Why do we prefer a barplot here instead of scatterplot or line plot? Line plot would be quite misleading–the lines connecting the averages hint that there is some sort of continuous change from 87.5 (average for the 1st class) to 21.2 (average for the second class) and so on. But there is no continuous transition between classes.

Scatterplot is, strictly speaking, not wrong, but the small dots hint that these values are measured exactly at 1.0, 2.0, and 3.0 for the 1st, 2nd and 3rd class. It looks as if it is also possible to have 2.6th class and so on. Wide bars of the barplot stress that it is not possible.

10.2.5 Boxplot and violin plot

Boxplot is a way to compare distributions for different categorical variables. It is a little bit like several histograms, plotted next to each other, just the histograms are rather simplified. Boxplots are widely used in scientific literature, but not that much elsewhere.

Below, we compare the age distribution by sex. “Sex” (male and female) is a categorical variable, and “age” is a continuous numerical variable, distribution of which are we analyzing. One option for this is to do two separate histograms–one for men and one for women. This is, in essence, what boxplot does, just the histograms are very much simplified:

Let us compare the age distribution of male and female passengers:

ggplot(titanic,
       aes(sex, age)) +
   geom_boxplot(fill="lightblue3")

Typical boxplot shows multiple features of the distribution. The first, and the most prominent one, is the box. It displays where the most of data, from the first to the third quartile, is located. The thick black lines in the middle of boxes denote medians. We can see that the median age for both men and women is in the upper 20-s, slightly higher for men than for women. On top and bottom of the boxes are “whiskers”. These extend up and down by 1.5 times of the box height, but no further than the largest and smallest observation. Finally, data points that are further away than the end of whiskers are “outliers” and are marked as separate dots.

This plot shows that male and female age distributions are very similar. Men tend to be 1-2 years older than women, but no major difference is visible here.

Boxplot, as shown above, only works if the grouping variable is categorical. Above, we use sex, that has two categories. However, categories are often marked by numbers, and in that case R may just not know that the variable is actually categorical. For instance, it is common to denote sex by “1” and “2” instead of “male” and “female”. In such cases we need to tell R that it is actually a categorical variable by wrapping sex in the factor() function:

ggplot(titanic,
       aes(factor(sex), age)) +
   geom_boxplot(fill="lightblue3")

factor() changes the variable from numeric to categorical. Here it is not necessary, because the existing categories are not numbers! See more in Section 6.1.3.

Exercise 10.1 Make a box plot where you analyze the passengers’ age depending on their class. What can you conclude from this plot?

Hint: pclass is numeric, not categorical!

See the solution

Another plot type that can be used for the same tasks as boxplot is violin plot. They are called for violin plots because they typically look like some sort of elongated rounded symmetric objects, like violins.

Violin plots are almost like plotting multiple histograms, one for each group, vertically:

ggplot(titanic, aes(sex, age)) +
   geom_violin(fill="lightblue3")

The violin plot gives us broadly similar information as boxplot. We can see that both men and women on the ship were dominatedly in their 20s and 30s, but there was also a number of children. We also see that men are somewhat older, but the difference is not large.

There are various options for violin plot, e.g. to display the sample quantiles.

Which one–boxplot or violin plot–should you choose? This depends on what exactly do you want to show. Do you want to show just a few sample quantiles? Then the boxplot is better, as it marks the quantiles and nothing else. But if you want to stress the shape of the distributions, then this is what violin plots are suited for.

Exercise 10.2 Create a violin plot that shows the ticket price (fare) as by passenger class.

See the solution

10.3 Grouping data on plots

A powerful feature of ggplot is to split the data into groups and denote the groups by different colors, line types or other markers. Above, in Section @(visualizing-ggplot), we made a plot of ice area by month in northern hemisphere. We achieved this by first filtering only northern values (filter(region == "N")), and second by setting the month on the horizontal and ice area on the vertical axis by aes(x = month, y = area). But what if we want to display the amount on ice both in the southern and northern hemisphere?

Obviously, one can do two similar plots, one for north and another for south. This is fairly easy to do, but it feels somewhat redundant. After all, the original image had plenty of free space, and instead of creating two such mostly empty plots, we could mark both ice figures on the same plot by using a different color (or line style for black and white print). Fortunately, this can be achieved easily. We just need to keep both regions (remove the region == "N" filtering condition), and tell ggplot to use an additional aesthetic that will now depend on the region. For instance, if we want to mark the different hemispheres with different colors, we can add col = region to the aes() function, so the function becomes aes(x=month, y=area, col=region). Here is the result:

ice %>%
   filter(area > 0,
          year == 2021) %>%
   ggplot(aes(x=month, y=area, col=region)) +
   geom_line() +
   geom_point() +
   scale_x_continuous(breaks = c("Mar"=3, "Jun"=6, "Sep"=9, "Dec"=12))

plot of chunk viz-ice-groups

The example is otherwise similar to the one in Section 10.1. But now the aesthetics function aes includes three variables that are mapped to graphical elements. We tell to use

month as horizontal position x
area as vertical position y
region as line color col.

Note that col = region will automatically split the plot into two separate lines, one red (for north) and another blue (for south). It is not a single line with alternate red and blue segments! So additional aesthetics not just make the points of different color, they also separate the points into different groups. Even more, we also have the corresponding legend on right hand side of the plot.

Here is another example. Let’s plot the monthly average ice area not as points, but as bars. But this time, let’s color bars according to the area:

ice %>%
   filter(area > 0,
          year == 2021,
          region == "N") %>%
   ggplot(aes(x=month, y=area, col=area)) +
   geom_line() +
   geom_point(size=15) +
   scale_x_continuous(breaks = c("Mar"=3, "Jun"=6, "Sep"=9, "Dec"=12))

plot of chunk viz-ice-monthly The plot is done mostly in a similar way as the one in Section 10.1. First we filter only valid observations, year 2021, and northern hemisphere. But now we as to use

month as horizontal position x
area as vertical position y
area as fill color for the bars col.

We also make the points large (size = 15) in order to be able to actually see the color difference. There are two things to notice:

First, we use the variable area for two different visual elements: vertical location and color. This is perfectly fine, although may be somewhat redundant–we can guess what color is a point based on the position, and we can also guess the position based on color. But this may be useful sometimes to stress your point.

Second, now instead of making different lines for each area values, we still have just a single line. Just the points are of different color. Compare this with the previous example–when we requested col = region then we got two lines, now when we request col = area, then we get a single line. Why is it like that? The reason is that region is a categorical variable, but area is a numeric variable. If you request the color (or other similar elements) to be dependent on a categorical value, then the data points will be split into different groups based on that value. If you request color to be dependent on a simple numeric variable, then you get a single group, but points are painted of different color. This is typically what one wants, but sometimes not.

For instance, let us now look not the long-term trends in ice area for different months. We mark the trend of ice area for different months over years, and we mark the months with different colors. Let’s focus on the northern hemisphere. Let us start with March (northern ice maximum), June (month of rapid melting) and September (minimum). Here is the plot:

ice %>%
   filter(area > 0,
          month %in% c(3,6,9),
          region == "N") %>%
   ggplot(aes(x=year, y=area, col=month)) +
   geom_line() +
   geom_point()

plot of chunk viz-ice-month

The plot looks weird. Instead of seeing three different lines drawn in three different colors, we see a single very jumpy line that contains dots of different shades of blue. The problem is that our color variable, month, is not coded as categorical. If you look at a few lines of data,

ice %>%
   head(1)

## # A tibble: 1 × 7
##    year month `data-type` region extent  area  time
##   <dbl> <dbl> <chr>       <chr>   <dbl> <dbl> <dbl>
## 1  1978    11 Goddard     N        11.6  9.04 1979.

you can see that month is marked as <dbl>, i.e. a numeric variable. Fortunately we can force a numeric into a categorical on very easily using function factor():

ice %>%
   filter(area > 0,
          month %in% c(3,6,9),
          region == "N") %>%
   ggplot(aes(x=year, y=area, col=factor(month))) +
   geom_line() +
   geom_point()

plot of chunk viz-ice-month-factor

This plot now looks as expected. Let’s repeat what did we do here: First, we filter only the selected months, 3, 6, 9, and 12, and only northern hemisphere. We tell ggplot to map

year as horizontal position x
area as vertical position y
month as line color col. But note–as month is a numeric variable, we convert it to a categorical one as factor(month).

So we get four lines of different color that depict the ice area for different months. The red line shows March area which, not surprisingly, is at the top of the figure as March is the month of ice maximum on the northern hemisphere.