Chapter 13 Overfitting and Validation

This section demonstrates overfitting, training-validation approach, and cross-validation using python. While overfitting is a pervasive problem when doing predictive modeling, the examples here are somewhat artificial. The problem is that both linear and logistic regression are not typically used in such a way that overfitting becomes a major concern. This is more likely to happen with more flexible models and with a larger amount of features.

Below we add random features to the dataset. This may feel like a weird thing to do, but it approximates work with a large feature set where we have little knowledge about the quality and relevance of many of the features.

We assume you have loaded the following packages:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Below we load more as we introduce more.

13.1 Boston Housing Data

We demonstrate overfitting and validation using Boston Housing Dataset. This is a dataset of 506 neighborhoods in Boston, MA. It contains average house prices in these neighborhoods (variable medv) and 13 other relevant features, such as percentage of old housing stock (age), crime rate (_crim), and closeness to Charles river (chas):

boston = pd.read_csv("../data/boston.csv.bz2", sep="\t")
boston.head(3)
##       crim    zn  indus  chas    nox     rm   age     dis  rad  tax  ptratio   black  lstat  medv
## 0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3  396.90   4.98  24.0
## 1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8  396.90   9.14  21.6
## 2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8  392.83   4.03  34.7

Our task is to predict the average house value medv as well as we can using all other features. For simplicity, we limit the analysis to a random sample of 100 cases:

boston = boston.sample(100)

We split the data into the target y and the design matrix X:

y = boston.medv.values
X = boston.drop("medv", axis=1).values
X.shape
## (100, 13)

Next, we add random columns to the data. As we stay with 100 cases only, we add 86 columns of random numbers, so that the resulting design matrix contains 99 columns:

X99 = np.hstack((X, np.random.uniform(size=(X.shape[0], 86))))
X99.shape
## (100, 99)

Now we also have an augmented design matrix X99 where the first 13 columns are real data, and the other columns is just random noise.

13.2 Cross-validation

The simplest way to do perfrom cross-validation in python is to use function cross_val_score from the module sklearn.model_selection. The most relevant arguments are the following:

  • estimator: the model, such as LinearRegression(). It does not have to be fitted, cross_val_score fits the model internally.
  • X: design matrix X
  • y: target y
  • scoring: scoring function. If left out it takes the standard model score (\(R^2\) in case of linear regression, accuracy for logistic regression). Otherwise one can use strings to describe scoring functions like “f1” for \(F\)-score, “recall” for recall and “neg_root_mean_squared_error” for negative of RMSE. See sklearn documentation for more. However, apparently the \(R^2\) scoring misbehaves so it is better to use something else.
  • cv: how many folds CV to do.

Let’s start just to perform a linear regression and compute the \(R^2\) using the score method:

from sklearn.linear_model import LinearRegression
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
##   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
m = LinearRegression()
_ = m.fit(X, y)
m.score(X, y)
## 0.7456673303610316

First an example how to use it using the original 13-feature Boston housing data:

from sklearn.model_selection import cross_val_score

cv = cross_val_score(m, X, y, cv=5)
cv
## array([ 0.75748442,  0.48907166, -0.05559189,  0.64465188,  0.58328873])

Note that we are using the same model m as we defined above. We already fitted it but that does not matter–cross_val_score will fit it again, and in this example five times.

cv now is an array of 5 separate values for \(R^2\): one value for each fold–each different training-validation data split. Normally one reports the average score over all the folds instead

cv.mean()
## 0.48378095924231157

We can get better results if validate over more folds, but it will come with added computation costs:

cross_val_score(m, X, y, cv=50).mean()
## -30.559410306619665

Unfortunately sklearn misbehaves here, the more folds you ask the smaller is \(R^2\).

But as long as we rely on the same data we can use different scoring functions. For instance, we can use RMSE instead of \(R^2\). sklearn by default implements negative of RMSE as “neg_root_mean_squared_error”. This can be told using scoring argument:

cross_val_score(m, X, y, cv=5, scoring="neg_root_mean_squared_error").mean()
## -5.543081927545676
cross_val_score(m, X, y, cv=50, scoring="neg_root_mean_squared_error").mean()
## -4.11972450802044