Chapter 16 Trees and Forests
We assume you have loaded the following packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Below we load more as we introduce more.
16.1 Trees and forests in sklearn
sklearn library provides a number of decision-tree related methods.
The central ones we discuss here are DecisionTreeRegression
and
DecisionTreeClassifier
. The former is designed for regression tasks
(continuous target) and the latter for categorization tasks
(categorical target). The usage of both is similar to the other
supervised ML models in sklearn, such as linear regression and
logistic regression (see Sections
10.2.2 and
11.2.2).
As in case of other sklearn regression models, we have to proceed in following steps:
- Create the design matrix \(\mathsf{X}\) and the target vector \(\boldsymbol{y}\).
- Set up the model, we call it m. For decision trees you may want to set various parameters (see below), unlike in case of linear regression.
- Fit the model using the
.fit(X, y)
-method. - Use the fitted model to compute accuracy or \(R^2\) (the
.score()
method), or for predictions.
Obviously, you may want to use other relevant tools, such as training-validation split when doing these tasks.
The DecisionTreeRegression
and
DecisionTreeClassifier
allow for a large number of options. The
most relevant are
- max_depth: maximum depth of the tree to build
- min_samples_split: minimum node size to be considered for splitting
See more in the sklearn documentation.
Below, we demonstrate the basic usage and visualization through different examples, first using regression trees and thereafter classification trees.
16.2 Regression Trees
Below is an example how to use decision trees for regression tasks. We focus on a 1-D case as this is easy to visualize. We use Boston housing data to predict the house prices based on the number of rooms.
First, we load data and create the matrices:
= pd.read_csv("../data/boston.csv.bz2", sep="\t")
boston = boston[["rm"]]
X = boston.medv y
In order to use regression trees, we need to import the function, set up the model, and fit it. Below, we specify the maximum depth 2, again for easier visualization:
from sklearn.tree import DecisionTreeRegressor
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
## warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
= DecisionTreeRegressor(max_depth=2)
m = m.fit(X, y) _
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
# on training data, may not be reliable m.score(X, y)
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
We can plot it using the following code:
= np.linspace(X.rm.min(), X.rm.max(),
xx 300).reshape((-1,1))
# reshape to 2-D
= m.predict(xx)
yhat = plt.scatter(X.rm, y,
_ = "black",
color =0.5)
alpha= plt.plot(xx, yhat,
_ = "deepskyblue")
color = plt.show() _
We begin by creating a equally-spaced sequence of 300 points that stretches the same range as the variable rm. We need to reshape it into a matrix, otherwise sklearn cannot use it for predictions. Thereafter we predict the values using all these datapoints. As you can see, the predicted values for a piecewise-constant curve, this is how rectangular regions look in one dimension.
The simpler way to visualize the tree is using using plot_tree
function from sklearn.tree
:
from sklearn.tree import plot_tree
= plot_tree(m,
_ = X.columns) feature_names
This is is simple and often good enough, but it may be hard to fit into an html document as one can see here.
More complex visualizations can be done with graphviz package:
import graphviz
from sklearn.tree import export_graphviz
= export_graphviz(m, out_file=None,
dot =X.columns,
feature_names=True)
filled= graphviz.Source(dot, format="svg")
graph = graph.render("boston-tree-graphviz",
_ =".fig") directory
Here one has to start by creating a dot object that describes the structure of the tree. Afterward it is sourced as graph, and finally rendered. Here it is rendered as a svg file, displayed in the browser at right. Graphviz package has many more options, consult the documentation for details.
Exercise 16.1 Use Boston Housing data.
- Estimate the median price (medv) as a function of average number of
rooms (rm) and age (age) using regression trees. Use
max_depth=2
as above. - Compute \(R^2\) on training data. Did it improve over what we found above?
- Visualize the corresponding regression tree using
plot_tree
or graphviz. How does the tree differ from the only-rm tree above? - Now include all the features in your model (all columns, except medv) and repeat the above: compute \(R^2\) and visualize the tree. Does adding more features change the tree in any significant way?
See The Solution
16.3 Classification Trees
This section demonstrates the usage of classification trees using Yin-Yang data (see Section 22.1).
16.3.1 Doing classification trees in sklearn
First load the dataset. For a refresher, it contains three variables, x, y, and c; the first two are coordinates on \(x-y\)-plane, and the latter (0 or 1) is color. Load it and have a look:
= pd.read_csv("../data/yin-yang.csv.bz2", sep="\t")
yy 3) yy.head(
Decision trees in sklearn work in a similar fashion as other sklearn categorization models. In particualr, you need to import the method (DecisionTreeClassifier), create the model, fit the model, and use it for any kind of predictive analysis. For fitting, we need the design matrix X and the outcome vector y. sklearn can typically handle both data frames and matrices, but the decision boundary code below assumes we use matrices. So we ensure they are matrices:
from sklearn.tree import DecisionTreeClassifier
= yy.c.values # .values converst to numpy vector
y = yy[["x", "y"]].values # conver to numpy matrix
X
= DecisionTreeClassifier()
m = m.fit(X, y) _
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
# accuracy on training data m.score(X, y)
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
Here we followed the standard sklearn’s approach: fit the model, and
compute accuracy using .score()
. By default, the trees overfit
and hence we get accuracy 1 on training data.
16.3.2 Decision boundary
One of the most instructive ways to understand the models it to plot their decision boundary. This is only possible for 2-D models (features form a 2-D space), in 1-D case it is just a set of points, in 3-D case it is complicated, and in case of more features, it is impossible.
Here is a function that makes such a plot. It has 4 arguments: a fitted model, design matrix X, outcome vector y, and number of grid points. This version assumes that X and y are numpy matrices, not pandas’ objects. The actual values are used for a) finding the extent of data to be plotted; and b) to be added to the actual plot, for visual guidance.
def DBPlot(m, X, y, nGrid = 100):
## find the extent of the features
= X[:, 0].min() - 1, X[:, 0].max() + 1
x1_min, x1_max = X[:, 1].min() - 1, X[:, 1].max() + 1
x2_min, x2_max ## make a dense grid (nGrid x nGrid) in this extent
= np.meshgrid(np.linspace(x1_min, x1_max, nGrid),
xx1, xx2
np.linspace(x2_min, x2_max, nGrid))= np.column_stack((xx1.ravel(), xx2.ravel()))
XX ## predict on the grid
= m.predict(XX).reshape(xx1.shape)
hatyy =(8,8))
plt.figure(figsize## show the predictions as an semi-transparent image
= plt.imshow(hatyy, extent=(x1_min, x1_max, x2_min, x2_max),
_ ="auto",
aspect='none', origin='lower',
interpolation=0.3)
alpha## add the actual data points on the image
0], X[:,1], c=y, s=30, edgecolors='k')
plt.scatter(X[:,
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max) plt.show()
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
Now we can plot the decision boundary of the fitted tree model as
This shows which areas are predicted as purple, and which ones as gold. The figure shows pretty clearly that the current tree splits the feature space into too detailed rectangles–it carves out single rectangles for individual purple and yellow dots. This is how it achieves the perfect accuracy, but as we only work on training data, it is obviously overfitting.
It is probably more useful to use trees of smaller depth:
= DecisionTreeClassifier(max_depth = 2)
m = m.fit(X, y)
_ DBPlot(m, X, y)
This figure shows how a tree of depth-2 partitions the feature space into four rectangles.
16.4 Bagging
from sklearn.ensemble import BaggingClassifier
DecisionTreeClassifier(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=2)
= BaggingClassifier() m
DecisionTreeClassifier(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=2)
= m.fit(X, y) _
BaggingClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier()
m.score(X, y)
BaggingClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier()
Predict on grid:
DBPlot(m, X, y)
BaggingClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier()
16.5 Random Forests
from sklearn.ensemble import RandomForestClassifier
BaggingClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier()
= RandomForestClassifier() m
BaggingClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier()
= m.fit(X, y) _
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
m.score(X, y)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
Predict on grid:
DBPlot(m, X, y)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
16.6 Boosting
from sklearn.ensemble import AdaBoostRegressor
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
= AdaBoostRegressor() m
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
= m.fit(X, y) _
AdaBoostRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostRegressor()
m.score(X, y)
AdaBoostRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostRegressor()
Predict on grid:
DBPlot(m, X, y)
AdaBoostRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostRegressor()