Chapter 4 Plotting: matplotlib and seaborn
import numpy as np
10) np.random.seed(
Python has many plotting libraries. Here we discuss some of the simplest ones, matplotlib and seaborn. Matplotlib is in a sense a very basic plotting library, oriented on vectors, not datasets (in this sense comparable to base-R plotting). But it is very widely used, and with a certain effort, it allows to create very nice looking plots. It is also easier to tinker with the lower level features in matplotlib, compared to the more high-level data oriented libraries.
Seaborn is such a high-level data oriented plotting library (comparable to ggplot in R in this sense). It has ready-made functionality to pick variables from datasets and modify the visual properties of lines and points depending on other values in data.
We assume you have imported the following modules:
import numpy as np
import pandas as pd
4.1 Matplotlib
Matplotlib is designed to be similar to the plotting functionality in
the popular matrix language
matlab.
This is a
library geared to scientific plotting. In these notes,
we are mainly interested in the pyplot
module but matplotlib contains
more functionality, e.g. handling of images. Typically we import
pyplot as plt
:
import matplotlib.pyplot as plt
This page is compiled using matplotlib 3.5.1.
4.1.1 Introductory examples
The module provides the basic functions like scatterplot and line plot, both of these functions should be called with x and y vectors. Here is a demonstration of a simple scatterplot:
= np.random.normal(size=50)
x = np.random.normal(size=50)
y = plt.scatter(x, y)
_ = plt.show() _
This small code demonstrates several functions:
- first, we create 50 random dots using numpy
plt.scatter
creates a scatterplot (point plot). It takes arguments x and y for the horizontal and vertical placement of dotsplt.scatter
returns an object, we may want to assign it to a temporary variable to avoid printing. We use variable name_
(just underscore) for a value we are not really storing but just avoiding printing.- Scatterplot automatically computes the suitable axis range.
plt.show
makes the plot visible. It may not be necessary, depending on the environment you use. For instance, when you run a notebook cell, it will automatically make the plot visible at the end. However, if you want to make two plots inside of the cell, you still need to callplt.show
to tell matplotlib that you are done with the first plot and now it is time to show it.- Finally,
plt.show
also returns an object, and we assign it to a temporary variable to avoid printing.
Next, here is another simple example of line plot:
= np.linspace(-5, 5, 100)
x = np.sin(x)
y = plt.plot(x, y)
_ = plt.show() _
Most of the functionality should be clear by now, but here are a few notes:
- The first lines create a linear sequence of 100 numbers between -5 and 5, and compute sin of these numbers.
- Line plots are done using
plt.plot
, it also takes arguments x and y.
4.1.2 Tuning plots
Matplotlib offers a number of arguments and additional functions to improve the look of the plots. Below we demonstrate a few:
= np.random.normal(size=50)
x = np.random.normal(size=50)
y = plt.scatter(x, y,
_ ="red", # dot color
color="black",
edgecolor=0.5 # transparency
alpha
)= plt.xlabel("x") # axis labels
_ = plt.ylabel("y")
_ = plt.title("Random dots") # main label
_ = plt.xlim(-5, 5) # axis limits
_ = plt.ylim(-5, 5)
_ = plt.show() _
Most of the features demonstated above are obvious from the code and comments. However, some explanations are still needed:
- Argument
color
denotes dot color when specified as color name, like “red” or “black”. There is also another way to specify colors,c
, see below. - Alpha denotes transparency with alpha=0 being completely transparent (invisible) and alpha=1 being completely oblique (default).
- All the additional functions return an object that we store into a temporary variable in order to avoid printing.
- All the additional functions in
plt
are executed before the actual plot is drawn on screen. In particular, despite we specify the axis limits afterplt.scatter
, they still apply to the scatterplot.
Sometimes we want to make color of the dots dependent of another
variable. In this case we can use argument c
instead of color:
= np.random.normal(size=50)
x = np.random.normal(size=50)
y = np.random.choice([1,2,3], size=50)
z = plt.scatter(x, y,
_ =z # color made of variable "z"
c
)= plt.show() _
Now the dots are of different color, depending on the value of z.
Note that the values submitted to c
argument must be numbers,
strings will not work.
4.1.3 Histograms
Histograms are a quick and easy way to get an overview of 1-D data
distributions. These can be plotted using plt.hist
. As hist
returns bin data, one may want to assign the result into a temporary
variable to avoid spurious printing in ipython-based environments
(such as notebooks):
= np.random.normal(size=1000)
x = plt.hist(x)
_ = plt.show() _
Not surprisingly, the histogram of normal random variables looks like, well, a normal curve.
We may tune the picture somewhat using arguments bins
to specify the
desired number of bins, and make bins more distinct by specifying
edgecolor
:
= plt.hist(x, bins=30, edgecolor="w")
_ = plt.show() _
4.2 Seaborn: data oriented plotting
Seaborn library is designed for plotting data, not vectors of numbers. It is built on top of matplotlib and has only limited functionality outside of that library. Hence in order to achieve the desired results with seaborn, one has to rely on some matplotlib functionality for adjusting the plots. Seaborn is typically imported as _sns:
import seaborn as sns
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
## warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Below we use seaborn 0.11.2.
Here is an usage example using a data frame of three random variables:
= pd.DataFrame({"x": np.random.normal(size=50),
df "y": np.random.normal(size=50),
"z": np.random.choice([1,2,3], size=50)})
= sns.scatterplot(x="x", y="y", hue="z", data=df)
_ = plt.show() _
Note the similarities and differences compared to matplotlib:
- The information is fed to seaborn using arguments x, y, and hue (and more) that determine the horizontal and vertical location of the dots, and their color (“hue”).
- These arguments are here not the data vectors as in case of matplotlib but data variable names, those are looked up in the data frame, specified with the argument data.
- Seaborn automatically provides the axis labels and the legend.
- If needed, the plot can be further adjusted with matplotlib
functionality, here we just use
plt.show()
to display it. - For some reason, seaborn insist that there should be legend for z value “0”, even if no such value exists in data:
df.z.unique()
## array([2, 3, 1])
4.2.1 Different plot types
The plotting functions of seaborn are largely comparable to those of matplotlib but the names may differ. It also offers additional plot types, such as density plot, and to add regression line on scatterplot.
4.2.1.1 Scatterplot
The example above already demonstrated scatterplot. We make another scatterplot here using sea ice extent data, this time demonstrating marker types (style). The dataset looks like
= pd.read_csv("../data/ice-extent.csv.bz2", sep="\t")
ice 3) ice.head(
## year month data-type region extent area time
## 0 1978 11 Goddard N 11.65 9.04 1978.875000
## 1 1978 11 Goddard S 15.90 11.69 1978.875000
## 2 1978 12 Goddard N 13.67 10.90 1978.958333
We plot the northern sea ice extent (measured in km\(^2\)) for September (month of yearly minimum) and March (yearly maximum) through the years. We put both months on the same plot using a different marker:
= sns.scatterplot(x="time", y="extent", style="month",
_ =ice[ice.month.isin([3,9]) & (ice.region == "N")])
data= plt.ylim(0, 17)
_ = plt.show() _
The plot shows two sets of dots–circles for March and crosses for
September. Note that seaborn automatically adds default
labels for the marker
types. We also use matplotlib’s plt.ylim
to set the limits for
\(y\)-axis.
4.2.1.2 Line plot
Here we replicate the previous example using line plot
= sns.lineplot(x="time", y="extent", style="month",
_ =ice[ice.month.isin([3,9]) & (ice.region == "N")])
data= plt.ylim(0, 17)
_ = plt.show() _
Note that the code is exactly the same as in the scatterplot example,
just we use sns.lineplot
instead of sns.scatterplot
. As a result
the plot is made of lines, not dots, and the style option controls
line style, not the marker style.
4.2.1.3 Regression line on scatterplot
Seaborn has a handy plot type, sns.regplot
, that allows one to add
the regression line on plot. Here we plot the september ice extent,
and add a trend line (regression line) on the plot.
We
also change the default colors using the scatter_kws and line_kws
arguments:
= sns.regplot(x="time", y="extent",
_ = {"color":"blue", "alpha":0.5, "edgecolor":"black"},
scatter_kws ={"color":"black"},
line_kws=ice[ice.month.isin([9]) & (ice.region == "N")])
data= plt.show() _
Unfortunately, regplot does not accept arguments like style for splitting data into two groups.
4.2.1.4 Histograms and density plots
Seaborn can do both kernel density plots and histograms using
sns.distplot
. By default, the function shows histogram, overlied by
kernel density line, but these can be turned off. Both plots can
further be customized with further keywords.
= sns.distplot(ice[ice.month.isin([9]) & (ice.region == "N")].extent,
_ =10,
bins=False, # no density
kde={"edgecolor":"black"})
hist_kws= plt.show() _
= sns.distplot(ice[ice.month.isin([9]) & (ice.region == "N")].extent,
_ =False) # no histogram
hist= plt.show() _
Note that distplot
does not use data frame centric approach, unlike
regplot
or lineplot
, it takes its input in a vector form instead.