2018-07-11
While common principals apply to both of the above, the aims of the two are different.
In this lecture we focus primarily on the second type of data visualization, and will introduce tools for exploratory data analysis in R.
Let's look at an example diamonds data that comes with the ggplot2 package. But first, let's load the the package to our session using the library() function. If you have not yet installed the ggplot2 package, you should do this first (you only have to do this once).
Some information about the diamonds dataset :
Load the diamonds data set, get the dimensions, and look at the first few lines
data(diamonds, package="ggplot2") dim(diamonds) # [1] 53940 10 head(diamonds) # # A tibble: 6 x 10 # carat cut color clarity depth table price x y z # <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> # 1 0.230 Ideal E SI2 61.5 55. 326 3.95 3.98 2.43 # 2 0.210 Premium E SI1 59.8 61. 326 3.89 3.84 2.31 # 3 0.230 Good E VS1 56.9 65. 327 4.05 4.07 2.31 # 4 0.290 Premium I VS2 62.4 58. 334 4.20 4.23 2.63 # 5 0.310 Good J SI2 63.3 58. 335 4.34 4.35 2.75 # 6 0.240 Very Good J VVS2 62.8 57. 336 3.94 3.96 2.48
Using the default settings with the plot() function
plot(price~carat, data=diamonds)
Using the default settings in ggplot()
ggplot(diamonds, aes(x=carat,y=price)) + geom_point()
Elements of a plot
Additional components
ggplot(diamonds, aes(x=carat,y=price)) + geom_point()
The basic concept of a ggplot graphic is to combine different elements into layers. Each layer of a ggplotgraphic must have a data set and aesthetic mappings
ggplot(diamonds, aes(x=carat,y=price)) + geom_point()
Layers can also have:
a geom, or a geometric object: defines the overall look of the layer – is it bars, points, or lines?
a position: how to handle overlapping points
When not specified, the defaults are used.
We can use geom_boxplot to create boxplots when one variable is continuous and the other is a factor.
ggplot(diamonds, aes(x=cut,y=price)) + geom_boxplot()
You can control the aesthetics of each layer, e.g. colour, size, shape, alpha (opacity) etc.
ggplot(diamonds, aes(carat, price)) + geom_point(colour = "blue")
Changing the alpha level
ggplot(diamonds, aes(x=carat,y=price)) + geom_point(alpha = 0.2)
Changing the point size
ggplot(diamonds, aes(x=carat,y=price)) + geom_point(size = 0.2)
Changing the shape and the point size
ggplot(diamonds, aes(x=carat,y=price)) + geom_point(shape = 2,size=0.4)
The real power of ggplot is its ability to combine layers
ggplot(diamonds, aes(x=carat,y=price)) + geom_point(size = 0.2) + geom_smooth()
In this case (and many other situations) a log transformation may allow for the relationships between variables to be clearer. Can use coord_trans()
ggplot(diamonds, aes(carat, price)) + geom_point(size = 0.5) + coord_trans(x = "log10", y = "log10")
We can color by a factor variable (not that it's useful here!)
ggplot(diamonds, aes(carat, price, colour=color)) + geom_point() + coord_trans(x = "log10", y = "log10")
Can also color by a continuous variable (not really useful either!)
ggplot(diamonds, aes(carat, price, colour=depth)) + geom_point() + coord_trans(x = "log10", y = "log10")
In some cases, it may be more useful to get separate plots for each category of the third variable, to understand conditional relationships
ggplot(diamonds, aes(carat, price)) + geom_point() + facet_wrap(~color, ncol=4)
Alternatively, you can use facet_grid, which also allows more than 1 conditioning variable (tables of plots)
ggplot(diamonds, aes(carat, price)) + geom_point() + facet_grid(~color, labeller=label_both)
There are actually many ways to get the same plot! The following commands will produce the same plot:
ggplot(diamonds) + geom_point(aes(price, carat))
We are covering only a few of the many plot types that can be greated with the ggplot2 package
For a more comprehensive view of ggplot2, take a look at the ggplot2 Cheat sheet
We can summarize univariate distributions using boxplots
ggplot(diamonds, aes(1, depth)) + geom_boxplot()
However, a histogram would be a better choice here
ggplot(diamonds, aes(depth)) + geom_histogram()
Notice the difference in the aes call; boxplot is really designed for multiple categories!
Tthe default options in histogram may not be sensible, and you often need to adjust the binwidth and xlim
ggplot(diamonds, aes(depth)) + geom_histogram(binwidth=0.2) + xlim(56,67)
A better use of boxplot is when we want to compare distributions of a quantitative variable across categories of a factor variable, as previously discussed
ggplot(diamonds, aes(cut, depth)) + geom_boxplot()
We can also get multiple histograms, though we need to either display them separately (less useful when comparing)
ggplot(diamonds, aes(depth)) + geom_histogram(binwidth = 0.2) + facet_wrap(~cut) + xlim(56, 67)
Or, you can overlay the historgrams
ggplot(diamonds, aes(depth, fill=cut)) + geom_histogram(binwidth=0.2) + xlim(56,67)