Chapter 13 Overfitting and Validation

This section demonstrates overfitting, training-validation approach, and cross-validation using python. While overfitting is a pervasive problem when doing predictive modeling, the examples here are somewhat artificial. The problem is that both linear and logistic regression are not typically used in such a way that overfitting becomes a major concern. This is more likely to happen with more flexible models and with a larger amount of features.

Below we add random features to the dataset. This may feel like a weird thing to do, but it approximates work with a large feature set where we have little knowledge about the quality and relevance of many of the features.

We assume you have loaded the following packages:

import numpy as np
import pandas as pd
## /home/otoomet/R/x86_64-pc-linux-gnu-library/4.4/reticulate/python/rpytools/loader.py:120: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
##   return _find_and_load(name, import_)
import matplotlib.pyplot as plt
np.random.seed(3)  # for replicability

We work with the sklearn package, as this is the library that is typically used with predictive modeling.

13.1 How random data overfits

We demonstrate overfitting and validation using Boston Housing Dataset (see Section 23.1). This is a dataset of 506 neighborhoods in Boston, MA. It contains median house prices in these neighborhoods (variable medv) and 13 other relevant features, such as percentage of old housing stock (age), crime rate (crim), and closeness to Charles river (chas):

boston = pd.read_csv("../data/boston.csv.bz2", sep="\t")
boston.head(3)
##       crim    zn  indus  chas    nox  ...  tax  ptratio   black  lstat  medv
## 0  0.00632  18.0   2.31     0  0.538  ...  296     15.3  396.90   4.98  24.0
## 1  0.02731   0.0   7.07     0  0.469  ...  242     17.8  396.90   9.14  21.6
## 2  0.02729   0.0   7.07     0  0.469  ...  242     17.8  392.83   4.03  34.7
## 
## [3 rows x 14 columns]

The task is to predict the average house value medv using all available features. For simplicity, we limit the analysis to a random sample of 100 cases:

boston = boston.sample(100)

We split the data into the target y and the design matrix X:

y = boston.medv.values
X = boston.drop("medv", axis=1).values
X.shape
## (100, 13)

Not that both X and y are converted to numpy arrays. You can also opt to work with pandas’ objects, but in my opinion numpy arrays have simpler and more intuitive indexing.

First let’s fit the model and compute a few performance indicators:

from sklearn.linear_model import LinearRegression

m = LinearRegression()
a = m.fit(X, y)
m.score(X, y)  # R2
## 0.858976794738731
yhat = m.predict(X)
np.sqrt(np.mean((yhat - y)**2))  # RMSE
## 3.4461957491420616

Next, we add random columns to the data. As we stay with 100 cases only, we add 86 columns of random numbers, so that the resulting design matrix contains 99 columns:

X99 = np.hstack((X, np.random.uniform(size=(X.shape[0], 86))))
X99.shape
## (100, 99)

Now we also have an augmented design matrix X99 where the first 13 columns are real data, and the other columns is just random noise.

13.2 Cross-validation

The simplest way to do perfrom cross-validation in python is to use function cross_val_score from the module sklearn.model_selection. The most relevant arguments are the following:

  • estimator: the model, such as LinearRegression(). It does not have to be fitted, cross_val_score fits the model internally.
  • X: design matrix X
  • y: target y
  • scoring: scoring function. If left out it takes the standard model score (R2 in case of linear regression, accuracy for logistic regression). Otherwise one can use strings to describe scoring functions like “f1” for F-score, “recall” for recall and “neg_root_mean_squared_error” for negative of RMSE. See sklearn documentation for more. However, apparently the R2 scoring misbehaves so it is better to use something else.
  • cv: how many folds CV to do.

Let’s start just to perform a linear regression and compute the R2 using the score method:

from sklearn.linear_model import LinearRegression

m = LinearRegression()
_t = m.fit(X, y)
m.score(X, y)
## 0.858976794738731

First an example how to use it using the original 13-feature Boston housing data:

from sklearn.model_selection import cross_val_score

cv = cross_val_score(m, X, y, cv=5)
cv
## array([0.86524896, 0.79070171, 0.69521005, 0.61697953, 0.87665679])

Note that we are using the same model m as we defined above. We already fitted it but that does not matter–cross_val_score will fit it again, and in this example five times.

cv now is an array of 5 separate values for R2: one value for each fold–each different training-validation data split. Normally one reports the average score over all the folds instead

cv.mean()
## 0.7689594079137093

We can get better results if validate over more folds, but it will come with added computation costs:

cross_val_score(m, X, y, cv=50).mean()
## -138.40009402271704

Unfortunately sklearn misbehaves here, the more folds you ask the smaller is R2.

But as long as we rely on the same data we can use different scoring functions. For instance, we can use RMSE instead of R2. sklearn by default implements negative of RMSE as “neg_root_mean_squared_error”. This can be told using scoring argument:

cross_val_score(m, X, y, cv=5, scoring="neg_root_mean_squared_error").mean()
## -4.136440137189319
cross_val_score(m, X, y, cv=50, scoring="neg_root_mean_squared_error").mean()
## -3.6840867875366428