Chapter 4 Plotting: matplotlib and seaborn

import numpy as np
np.random.seed(10)

Python has many plotting libraries. Here we discuss some of the simplest ones, matplotlib and seaborn. Matplotlib is in a sense a very basic plotting library, oriented on vectors, not datasets (in this sense comparable to base-R plotting). But it is very widely used, and with a certain effort, it allows to create very nice looking plots. It is also easier to tinker with the lower level features in matplotlib, compared to the more high-level data oriented libraries.

Seaborn is such a high-level data oriented plotting library (comparable to ggplot in R in this sense). It has ready-made functionality to pick variables from datasets and modify the visual properties of lines and points depending on other values in data.

We assume you have imported the following modules:

import numpy as np
import pandas as pd

4.1 Matplotlib

Matplotlib is designed to be similar to the plotting functionality in the popular matrix language matlab. This is a library geared to scientific plotting. In these notes, we are mainly interested in the pyplot module but matplotlib contains more functionality, e.g. handling of images. Typically we import pyplot as plt:

import matplotlib.pyplot as plt

This page is compiled using matplotlib 3.5.1.

4.1.1 Introductory examples

The module provides the basic functions like scatterplot and line plot, both of these functions should be called with x and y vectors. Here is a demonstration of a simple scatterplot:

x = np.random.normal(size=50)
y = np.random.normal(size=50)
_ = plt.scatter(x, y)
_ = plt.show()

plot of chunk unnamed-chunk-5

This small code demonstrates several functions:

  • first, we create 50 random dots using numpy
  • plt.scatter creates a scatterplot (point plot). It takes arguments x and y for the horizontal and vertical placement of dots
  • plt.scatter returns an object, we may want to assign it to a temporary variable to avoid printing. We use variable name _ (just underscore) for a value we are not really storing but just avoiding printing.
  • Scatterplot automatically computes the suitable axis range.
  • plt.show makes the plot visible. It may not be necessary, depending on the environment you use. For instance, when you run a notebook cell, it will automatically make the plot visible at the end. However, if you want to make two plots inside of the cell, you still need to call plt.show to tell matplotlib that you are done with the first plot and now it is time to show it.
  • Finally, plt.show also returns an object, and we assign it to a temporary variable to avoid printing.

Next, here is another simple example of line plot:

x = np.linspace(-5, 5, 100)
y = np.sin(x)
_ = plt.plot(x, y)
_ = plt.show()

plot of chunk line-plot

Most of the functionality should be clear by now, but here are a few notes:

  • The first lines create a linear sequence of 100 numbers between -5 and 5, and compute sin of these numbers.
  • Line plots are done using plt.plot, it also takes arguments x and y.

4.1.2 Tuning plots

Matplotlib offers a number of arguments and additional functions to improve the look of the plots. Below we demonstrate a few:

x = np.random.normal(size=50)
y = np.random.normal(size=50)
_ = plt.scatter(x, y,
                color="red",  # dot color
                edgecolor="black",
                alpha=0.5  # transparency
)
_ = plt.xlabel("x")  # axis labels
_ = plt.ylabel("y")
_ = plt.title("Random dots")  # main label
_ = plt.xlim(-5, 5)  # axis limits
_ = plt.ylim(-5, 5)
_ = plt.show()

plot of chunk unnamed-chunk-6

Most of the features demonstated above are obvious from the code and comments. However, some explanations are still needed:

  • Argument color denotes dot color when specified as color name, like “red” or “black”. There is also another way to specify colors, c, see below.
  • Alpha denotes transparency with alpha=0 being completely transparent (invisible) and alpha=1 being completely oblique (default).
  • All the additional functions return an object that we store into a temporary variable in order to avoid printing.
  • All the additional functions in plt are executed before the actual plot is drawn on screen. In particular, despite we specify the axis limits after plt.scatter, they still apply to the scatterplot.

Sometimes we want to make color of the dots dependent of another variable. In this case we can use argument c instead of color:

x = np.random.normal(size=50)
y = np.random.normal(size=50)
z = np.random.choice([1,2,3], size=50)
_ = plt.scatter(x, y,
                c=z  # color made of variable "z"
)
_ = plt.show()

plot of chunk unnamed-chunk-7

Now the dots are of different color, depending on the value of z. Note that the values submitted to c argument must be numbers, strings will not work.

4.1.3 Histograms

Histograms are a quick and easy way to get an overview of 1-D data distributions. These can be plotted using plt.hist. As hist returns bin data, one may want to assign the result into a temporary variable to avoid spurious printing in ipython-based environments (such as notebooks):

x = np.random.normal(size=1000)
_ = plt.hist(x)
_ = plt.show()

plot of chunk unnamed-chunk-8

Not surprisingly, the histogram of normal random variables looks like, well, a normal curve.

We may tune the picture somewhat using arguments bins to specify the desired number of bins, and make bins more distinct by specifying edgecolor:

_ = plt.hist(x, bins=30, edgecolor="w")
_ = plt.show()

plot of chunk unnamed-chunk-9

4.2 Seaborn: data oriented plotting

Seaborn library is designed for plotting data, not vectors of numbers. It is built on top of matplotlib and has only limited functionality outside of that library. Hence in order to achieve the desired results with seaborn, one has to rely on some matplotlib functionality for adjusting the plots. Seaborn is typically imported as _sns:

import seaborn as sns
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
##   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

Below we use seaborn 0.11.2.

Here is an usage example using a data frame of three random variables:

df = pd.DataFrame({"x": np.random.normal(size=50),
                   "y": np.random.normal(size=50),
                   "z": np.random.choice([1,2,3], size=50)})
_ = sns.scatterplot(x="x", y="y", hue="z", data=df)
_ = plt.show()

plot of chunk unnamed-chunk-12

Note the similarities and differences compared to matplotlib:

  • The information is fed to seaborn using arguments x, y, and hue (and more) that determine the horizontal and vertical location of the dots, and their color (“hue”).
  • These arguments are here not the data vectors as in case of matplotlib but data variable names, those are looked up in the data frame, specified with the argument data.
  • Seaborn automatically provides the axis labels and the legend.
  • If needed, the plot can be further adjusted with matplotlib functionality, here we just use plt.show() to display it.
  • For some reason, seaborn insist that there should be legend for z value “0”, even if no such value exists in data:
df.z.unique()
## array([2, 3, 1])

4.2.1 Different plot types

The plotting functions of seaborn are largely comparable to those of matplotlib but the names may differ. It also offers additional plot types, such as density plot, and to add regression line on scatterplot.

4.2.1.1 Scatterplot

The example above already demonstrated scatterplot. We make another scatterplot here using sea ice extent data, this time demonstrating marker types (style). The dataset looks like

ice = pd.read_csv("../data/ice-extent.csv.bz2", sep="\t")
ice.head(3)
##    year  month data-type region  extent   area         time
## 0  1978     11   Goddard      N   11.65   9.04  1978.875000
## 1  1978     11   Goddard      S   15.90  11.69  1978.875000
## 2  1978     12   Goddard      N   13.67  10.90  1978.958333

We plot the northern sea ice extent (measured in km\(^2\)) for September (month of yearly minimum) and March (yearly maximum) through the years. We put both months on the same plot using a different marker:

_ = sns.scatterplot(x="time", y="extent", style="month",
                    data=ice[ice.month.isin([3,9]) & (ice.region == "N")])
_ = plt.ylim(0, 17)
_ = plt.show()

plot of chunk unnamed-chunk-15

The plot shows two sets of dots–circles for March and crosses for September. Note that seaborn automatically adds default labels for the marker types. We also use matplotlib’s plt.ylim to set the limits for \(y\)-axis.

4.2.1.2 Line plot

Here we replicate the previous example using line plot

_ = sns.lineplot(x="time", y="extent", style="month",
                    data=ice[ice.month.isin([3,9]) & (ice.region == "N")])
_ = plt.ylim(0, 17)
_ = plt.show()

plot of chunk unnamed-chunk-16

Note that the code is exactly the same as in the scatterplot example, just we use sns.lineplot instead of sns.scatterplot. As a result the plot is made of lines, not dots, and the style option controls line style, not the marker style.

4.2.1.3 Regression line on scatterplot

Seaborn has a handy plot type, sns.regplot, that allows one to add the regression line on plot. Here we plot the september ice extent, and add a trend line (regression line) on the plot. We also change the default colors using the scatter_kws and line_kws arguments:

_ = sns.regplot(x="time", y="extent", 
                scatter_kws = {"color":"blue", "alpha":0.5, "edgecolor":"black"},
                line_kws={"color":"black"},
                data=ice[ice.month.isin([9]) & (ice.region == "N")])
_ = plt.show()

plot of chunk unnamed-chunk-17

Unfortunately, regplot does not accept arguments like style for splitting data into two groups.

4.2.1.4 Histograms and density plots

Seaborn can do both kernel density plots and histograms using sns.distplot. By default, the function shows histogram, overlied by kernel density line, but these can be turned off. Both plots can further be customized with further keywords.

_ = sns.distplot(ice[ice.month.isin([9]) & (ice.region == "N")].extent,
                 bins=10,
                 kde=False,  # no density
                 hist_kws={"edgecolor":"black"})
_ = plt.show()

plot of chunk unnamed-chunk-18

_ = sns.distplot(ice[ice.month.isin([9]) & (ice.region == "N")].extent,
                 hist=False)  # no histogram
_ = plt.show()

plot of chunk unnamed-chunk-18

Note that distplot does not use data frame centric approach, unlike regplot or lineplot, it takes its input in a vector form instead.