Chapter 16 Trees and Forests

We assume you have loaded the following packages:

import numpy as np
import pandas as pd

## /home/otoomet/R/x86_64-pc-linux-gnu-library/4.4/reticulate/python/rpytools/loader.py:120: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
##   return _find_and_load(name, import_)

import matplotlib.pyplot as plt

Below we load more as we introduce more.

16.1 Trees and forests in sklearn

sklearn library provides a number of decision-tree related methods. The central ones we discuss here are DecisionTreeRegression and DecisionTreeClassifier. The former is designed for regression tasks (continuous target) and the latter for categorization tasks (categorical target). The usage of both is similar to the other supervised ML models in sklearn, such as linear regression and logistic regression (see Sections 10.2.2 and 11.2.2).

As in case of other sklearn regression models, we have to proceed in following steps:

Create the design matrix \(\mathsf{X}\) and the target vector \(\boldsymbol{y}\).
Set up the model, we call it m. For decision trees you may want to set various parameters (see below), unlike in case of linear regression.
Fit the model using the .fit(X, y)-method.
Use the fitted model to compute accuracy or \(R^2\) (the .score() method), or for predictions.

Obviously, you may want to use other relevant tools, such as training-validation split when doing these tasks.

The DecisionTreeRegression and DecisionTreeClassifier allow for a large number of options. The most relevant are

max_depth: maximum depth of the tree to build
min_samples_split: minimum node size to be considered for splitting

See more in the sklearn documentation.

Below, we demonstrate the basic usage and visualization through different examples, first using regression trees and thereafter classification trees.

16.2 Regression Trees

Below is an example how to use decision trees for regression tasks. We focus on a 1-D case as this is easy to visualize. We use Boston housing data to predict the house prices based on the number of rooms.

First, we load data and create the matrices:

boston = pd.read_csv("../data/boston.csv.bz2", sep="\t")
X = boston[["rm"]]
y = boston.medv

In order to use regression trees, we need to import the function, set up the model, and fit it. Below, we specify the maximum depth 2, again for easier visualization:

from sklearn.tree import DecisionTreeRegressor
m = DecisionTreeRegressor(max_depth=2)
_t = m.fit(X, y)
m.score(X, y)  # on training data, may not be reliable

## 0.5828967342544766

We can plot it using the following code:

xx = np.linspace(X.rm.min(), X.rm.max(),
                 300).reshape((-1,1))
# reshape to 2-D
yhat = m.predict(xx)
_ = plt.scatter(X.rm, y,
                color = "black",
                alpha=0.5)
_ = plt.plot(xx, yhat,
             color = "deepskyblue")
plt.show()

We begin by creating a equally-spaced sequence of 300 points that stretches the same range as the variable rm. We need to reshape it into a matrix, otherwise sklearn cannot use it for predictions. Thereafter we predict the values using all these datapoints. As you can see, the predicted values for a piecewise-constant curve, this is how rectangular regions look in one dimension.

The simpler way to visualize the tree is using using plot_tree function from sklearn.tree:

from sklearn.tree import plot_tree
_ = plot_tree(m,
              feature_names = X.columns)

This is is simple and often good enough, but it may be hard to fit into an html document as one can see here.

The tree above, as visualized by graphviz package.

More complex visualizations can be done with graphviz package:

import graphviz
from sklearn.tree import export_graphviz

dot = export_graphviz(m, out_file=None,
                      feature_names=X.columns,
                      filled=True)
graph = graphviz.Source(dot, format="svg")
_ = graph.render("boston-tree-graphviz",
                 directory=".fig")

Here one has to start by creating a dot object that describes the structure of the tree. Afterward it is sourced as graph, and finally rendered. Here it is rendered as a svg file, displayed in the browser at right. Graphviz package has many more options, consult the documentation for details.

Exercise 16.1 Use Boston Housing data.

Estimate the median price (medv) as a function of average number of rooms (rm) and age (age) using regression trees. Use max_depth=2 as above.
Compute \(R^2\) on training data. Did it improve over what we found above?
Visualize the corresponding regression tree using plot_tree or graphviz. How does the tree differ from the only-rm tree above?
Now include all the features in your model (all columns, except medv) and repeat the above: compute \(R^2\) and visualize the tree. Does adding more features change the tree in any significant way?

See The Solution

16.3 Classification Trees

This section demonstrates the usage of classification trees using Yin-Yang data (see Section 23.3).

16.3.1 Doing classification trees in sklearn

First load the dataset. For a refresher, it contains three variables, x, y, and c; the first two are coordinates on \(x-y\)-plane, and the latter (0 or 1) is color. Load it and have a look:

yy = pd.read_csv("../data/yin-yang.csv.bz2", sep="\t")
yy.head(3)

Decision trees in sklearn work in a similar fashion as other sklearn categorization models. In particualr, you need to import the method (DecisionTreeClassifier), create the model, fit the model, and use it for any kind of predictive analysis. For fitting, we need the design matrix X and the outcome vector y. sklearn can typically handle both data frames and matrices, but the decision boundary code below assumes we use matrices. So we ensure they are matrices:

from sklearn.tree import DecisionTreeClassifier

y = yy.c.values  # .values converst to numpy vector
X = yy[["x", "y"]].values  # conver to numpy matrix

m = DecisionTreeClassifier()
_t = m.fit(X, y)
m.score(X, y)  # accuracy on training data

Here we followed the standard sklearn’s approach: fit the model, and compute accuracy using .score(). By default, the trees overfit and hence we get accuracy 1 on training data.

16.3.2 Decision boundary

One of the most instructive ways to understand the models it to plot their decision boundary. This is only possible for 2-D models (features form a 2-D space), in 1-D case it is just a set of points, in 3-D case it is complicated, and in case of more features, it is impossible.

Here is a function that makes such a plot. It has 4 arguments: a fitted model, design matrix X, outcome vector y, and number of grid points. This version assumes that X and y are numpy matrices, not pandas’ objects. The actual values are used for a) finding the extent of data to be plotted; and b) to be added to the actual plot, for visual guidance.

def DBPlot(m, X, y, nGrid = 100):
    ## find the extent of the features
    ## and add 'padding', here "1" to it.  Adjust as needed!
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
            # note: indexing assumes X is a np.array!
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    ## make a dense grid (nGrid x nGrid) in this extent
    xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, nGrid),
                           np.linspace(x2_min, x2_max, nGrid))
    ## Convert two grids into two columns
    XX = np.column_stack((xx1.ravel(), xx2.ravel()))
    ## predict on the grid
    hatyy = m.predict(XX).reshape(xx1.shape)
    plt.figure(figsize=(8,8))
    ## show the predictions as an semi-transparent image
    _ = plt.imshow(hatyy, extent=(x1_min, x1_max, x2_min, x2_max),
                   aspect="auto",
                   interpolation='none', origin='lower',
                   alpha=0.3)
    ## add the actual data points on the image
    plt.scatter(X[:,0], X[:,1], c=y, s=30, edgecolors='k')
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.show()

Now we can plot the decision boundary of the fitted tree model as

<<dbplot-yy-1>>

This shows which areas are predicted as purple, and which ones as gold. The figure shows pretty clearly that the current tree splits the feature space into too detailed rectangles–it carves out single rectangles for individual purple and yellow dots. This is how it achieves the perfect accuracy, but as we only work on training data, it is obviously overfitting.

It is probably more useful to use trees of smaller depth:

m = DecisionTreeClassifier(max_depth = 2)
_t = m.fit(X, y)
DBPlot(m, X, y)

This figure shows how a tree of depth-2 partitions the feature space into four rectangles.

16.4 Bagging

from sklearn.ensemble import BaggingClassifier

m = BaggingClassifier()
_t = m.fit(X, y)
m.score(X, y)

Predict on grid:

DBPlot(m, X, y)

plot of chunk unnamed-chunk-14

16.5 Random Forests

from sklearn.ensemble import RandomForestClassifier

m = RandomForestClassifier()
_t = m.fit(X, y)
m.score(X, y)

Predict on grid:

DBPlot(m, X, y)

plot of chunk unnamed-chunk-16

16.6 Boosting

from sklearn.ensemble import AdaBoostRegressor

m = AdaBoostRegressor()
_t = m.fit(X, y)
m.score(X, y)

Predict on grid:

DBPlot(m, X, y)

plot of chunk unnamed-chunk-18