2018-07-11

Data visualization

Data visualization is a critical component of and is used in two main settings

  • visualization for explaining
  • visualization for exploring

While common principals apply to both of the above, the aims of the two are different.

In this lecture we focus primarily on the second type of data visualization, and will introduce tools for exploratory data analysis in R.

ggplot2 R package for Data Visualisation

ggplot2 R package

  • ggplot is a function in the ggplot2 package and is based on The Grammar of Graphics by Leland Wilkinson, and the lattice package
  • ggplot is designed to work in a layered fashion, starting with a layer showing the raw data then adding layers of annotation and statistical summaries
  • The idea is to make the nice features of lattice available in a simpler way, and also make it easier to add additional components to the plot (as layers, which we talk about later)

ggplot() function

Let's look at an example diamonds data that comes with the ggplot2 package. But first, let's load the the package to our session using the library() function. If you have not yet installed the ggplot2 package, you should do this first (you only have to do this once).

Some information about the diamonds dataset :

  • ~54,000 round diamonds from http://www.diamondse.info/
  • Variables:
    • carat, colour, clarity, cut
    • total depth, table, depth, width, height
    • price
  • A question of interest: What is the relationship between carat and price, and how does it depend on other factors?

Data on diamonds

Load the diamonds data set, get the dimensions, and look at the first few lines

data(diamonds, package="ggplot2")
dim(diamonds)
# [1] 53940    10
head(diamonds)
# # A tibble: 6 x 10
#   carat cut       color clarity depth table price     x     y     z
#   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.230 Ideal     E     SI2      61.5   55.   326  3.95  3.98  2.43
# 2 0.210 Premium   E     SI1      59.8   61.   326  3.89  3.84  2.31
# 3 0.230 Good      E     VS1      56.9   65.   327  4.05  4.07  2.31
# 4 0.290 Premium   I     VS2      62.4   58.   334  4.20  4.23  2.63
# 5 0.310 Good      J     SI2      63.3   58.   335  4.34  4.35  2.75
# 6 0.240 Very Good J     VVS2     62.8   57.   336  3.94  3.96  2.48

Default R plot of price versus carat

Using the default settings with the plot() function

plot(price~carat, data=diamonds)

Default ggplot price versus diamonds

Using the default settings in ggplot()

ggplot(diamonds, aes(x=carat,y=price)) + geom_point()

A first glance

  • The default option in ggplot looks nicer!
  • ggplot's syntax looks weird, especially if you're not very familiar with lattice
  • ggplot() may be slower than plot() that comes with base R
  • Note: It is possible to manipulate the plot() options to get a similar plot (and maybe even a better one), but that would require a lot of extra coding
  • The ggplot syntax also makes plotting more structured and easier to update

ggplot syntax

Using the package ggplot2

Elements of a plot

  • data
  • aesthetics: mapping of variables to graphical elements
  • geom: type of plot structure to use
  • transformations: log scale, …

Additional components

  • layers: multiple geoms, multiple data sets, annotation
  • facets: show subsets in different plots
  • themes: modifying style

ggplot syntax

ggplot(diamonds, aes(x=carat,y=price)) + geom_point()

The basic concept of a ggplot graphic is to combine different elements into layers. Each layer of a ggplotgraphic must have a data set and aesthetic mappings

  • data: for ggplot(), this must be a data frame!
  • aes: a mapping from the data to the plot; basically the x and y-axes

ggplot syntax

ggplot(diamonds, aes(x=carat,y=price)) + geom_point()

Layers can also have:

  • a geom, or a geometric object: defines the overall look of the layer – is it bars, points, or lines?

  • a stat, or a statistical summary: how should the data be summarized (e.g., binning for histograms, or smoothing to draw regression lines, etc).
  • a position: how to handle overlapping points

When not specified, the defaults are used.

geom_boxplot

We can use geom_boxplot to create boxplots when one variable is continuous and the other is a factor.

ggplot(diamonds, aes(x=cut,y=price)) + geom_boxplot()

Changing the aesthetics

You can control the aesthetics of each layer, e.g. colour, size, shape, alpha (opacity) etc.

ggplot(diamonds, aes(carat, price)) + geom_point(colour = "blue")

A few more examples

Changing the alpha level

ggplot(diamonds, aes(x=carat,y=price)) + geom_point(alpha = 0.2)

A few more examples

Changing the point size

ggplot(diamonds, aes(x=carat,y=price)) + geom_point(size = 0.2)

A few more examples

Changing the shape and the point size

ggplot(diamonds, aes(x=carat,y=price)) + geom_point(shape = 2,size=0.4)

Combining layers

The real power of ggplot is its ability to combine layers

ggplot(diamonds, aes(x=carat,y=price)) + geom_point(size = 0.2) +
  geom_smooth()

Tranfsormations

In this case (and many other situations) a log transformation may allow for the relationships between variables to be clearer. Can use coord_trans()

ggplot(diamonds, aes(carat, price)) + geom_point(size = 0.5) +
coord_trans(x = "log10", y = "log10")

Adding information for a third variable

We can color by a factor variable (not that it's useful here!)

ggplot(diamonds, aes(carat, price, colour=color)) + geom_point() + 
    coord_trans(x = "log10", y = "log10")

Adding information for a third variable

Can also color by a continuous variable (not really useful either!)

ggplot(diamonds, aes(carat, price, colour=depth)) + geom_point() + 
    coord_trans(x = "log10", y = "log10")

Adding information for a third variable

In some cases, it may be more useful to get separate plots for each category of the third variable, to understand conditional relationships

ggplot(diamonds, aes(carat, price)) + geom_point() +
  facet_wrap(~color, ncol=4)

Adding information for a third variable

Alternatively, you can use facet_grid, which also allows more than 1 conditioning variable (tables of plots)

ggplot(diamonds, aes(carat, price)) + geom_point() +
 facet_grid(~color, labeller=label_both)

A final note about syntax

There are actually many ways to get the same plot! The following commands will produce the same plot:

  • ggplot(diamonds, aes(price, carat)) + geom_point()
  • ggplot() + geom_point(aes(price, carat), diamonds)
  • ggplot(diamonds) + geom_point(aes(price, carat))
  • ggplot(diamonds, aes(price)) + geom_point(aes(y = carat))
  • ggplot(diamonds, aes(y=carat)) + geom_point(aes(x = price))

A final note about syntax

ggplot(diamonds) + geom_point(aes(price, carat))

Additional plots available with ggplot

Cheat sheet

We are covering only a few of the many plot types that can be greated with the ggplot2 package

For a more comprehensive view of ggplot2, take a look at the ggplot2 Cheat sheet

Boxplots

We can summarize univariate distributions using boxplots

ggplot(diamonds, aes(1, depth)) + geom_boxplot()

Histograms

However, a histogram would be a better choice here

ggplot(diamonds, aes(depth)) + geom_histogram()

Notice the difference in the aes call; boxplot is really designed for multiple categories!

Histograms

Tthe default options in histogram may not be sensible, and you often need to adjust the binwidth and xlim

ggplot(diamonds, aes(depth)) + geom_histogram(binwidth=0.2) + xlim(56,67)

Boxplots with multiple categories

A better use of boxplot is when we want to compare distributions of a quantitative variable across categories of a factor variable, as previously discussed

ggplot(diamonds, aes(cut, depth)) + geom_boxplot()

Histograms with multiple categories

We can also get multiple histograms, though we need to either display them separately (less useful when comparing)

ggplot(diamonds, aes(depth)) + geom_histogram(binwidth = 0.2) + 
    facet_wrap(~cut) + xlim(56, 67)

Overlaying Histograms

Or, you can overlay the historgrams

ggplot(diamonds, aes(depth, fill=cut)) + 
    geom_histogram(binwidth=0.2) + xlim(56,67)