Chapter 13 Overfitting and Validation
This section demonstrates overfitting, training-validation approach, and cross-validation using python. While overfitting is a pervasive problem when doing predictive modeling, the examples here are somewhat artificial. The problem is that both linear and logistic regression are not typically used in such a way that overfitting becomes a major concern. This is more likely to happen with more flexible models and with a larger amount of features.
Below we add random features to the dataset. This may feel like a weird thing to do, but it approximates work with a large feature set where we have little knowledge about the quality and relevance of many of the features.
We assume you have loaded the following packages:
## /home/otoomet/R/x86_64-pc-linux-gnu-library/4.4/reticulate/python/rpytools/loader.py:120: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
## return _find_and_load(name, import_)
We work with the sklearn package, as this is the library that is typically used with predictive modeling.
13.1 How random data overfits
We demonstrate overfitting and validation using Boston Housing Dataset (see Section 23.1). This is a dataset of 506 neighborhoods in Boston, MA. It contains median house prices in these neighborhoods (variable medv) and 13 other relevant features, such as percentage of old housing stock (age), crime rate (crim), and closeness to Charles river (chas):
## crim zn indus chas nox ... tax ptratio black lstat medv
## 0 0.00632 18.0 2.31 0 0.538 ... 296 15.3 396.90 4.98 24.0
## 1 0.02731 0.0 7.07 0 0.469 ... 242 17.8 396.90 9.14 21.6
## 2 0.02729 0.0 7.07 0 0.469 ... 242 17.8 392.83 4.03 34.7
##
## [3 rows x 14 columns]
The task is to predict the average house value medv using all available features. For simplicity, we limit the analysis to a random sample of 100 cases:
We split the data into the target y and the design matrix X:
## (100, 13)
Not that both X and y are converted to numpy arrays. You can also opt to work with pandas’ objects, but in my opinion numpy arrays have simpler and more intuitive indexing.
First let’s fit the model and compute a few performance indicators:
from sklearn.linear_model import LinearRegression
m = LinearRegression()
a = m.fit(X, y)
m.score(X, y) # R2
## 0.858976794738731
## 3.4461957491420616
Next, we add random columns to the data. As we stay with 100 cases only, we add 86 columns of random numbers, so that the resulting design matrix contains 99 columns:
## (100, 99)
Now we also have an augmented design matrix X99 where the first 13 columns are real data, and the other columns is just random noise.
13.2 Cross-validation
The simplest way to do perfrom cross-validation in python is to use
function cross_val_score
from the module sklearn.model_selection
.
The most relevant arguments are the following:
- estimator:
the model, such as
LinearRegression()
. It does not have to be fitted,cross_val_score
fits the model internally. - X: design matrix X
- y: target y
- scoring: scoring function. If left out it takes the standard model score (R2 in case of linear regression, accuracy for logistic regression). Otherwise one can use strings to describe scoring functions like “f1” for F-score, “recall” for recall and “neg_root_mean_squared_error” for negative of RMSE. See sklearn documentation for more. However, apparently the R2 scoring misbehaves so it is better to use something else.
- cv: how many folds CV to do.
Let’s start just to perform a linear regression and compute the R2
using the score
method:
from sklearn.linear_model import LinearRegression
m = LinearRegression()
_t = m.fit(X, y)
m.score(X, y)
## 0.858976794738731
First an example how to use it using the original 13-feature Boston housing data:
## array([0.86524896, 0.79070171, 0.69521005, 0.61697953, 0.87665679])
Note that we are using the same model m as we defined above. We
already fitted it but that does not matter–cross_val_score
will fit
it again, and in this example five times.
cv now is an array of 5 separate values for R2: one value for each fold–each different training-validation data split. Normally one reports the average score over all the folds instead
## 0.7689594079137093
We can get better results if validate over more folds, but it will come with added computation costs:
## -138.40009402271704
Unfortunately sklearn misbehaves here, the more folds you ask the smaller is R2.
But as long as we rely on the same data we can use different scoring
functions. For instance, we can use RMSE instead of R2.
sklearn by default implements negative of RMSE as
“neg_root_mean_squared_error”. This can be told using scoring
argument:
## -4.136440137189319
## -3.6840867875366428