Chapter 15 Regularization and Feature Selection

15.1 How highly correlated features fail

Explore the case where highly correlated features fail. We create a small sample of random data where \(y\) is closely related to (true) \(x\). However, imagine we do not observe the true \(\boldsymbol{x}\). Instead we can measure three proxies, \(x_1\), \(x_2\) and \(x_3\) that are all quite similar to the true \(x\). We create the data as follows: data in the form \[\begin{align*} \boldsymbol{x} & \sim N(0,1) \\ y & = \boldsymbol{x} + \boldsymbol{u} & u &\sim N(0,0.3)\\[2ex] \boldsymbol{x}_{1} &= \boldsymbol{x} + \boldsymbol{\epsilon}_1 & \boldsymbol{\epsilon}_1 & \sim_{i.i.d} N(0, 0.1)\\ \boldsymbol{x}_{2} &= \boldsymbol{x} + \boldsymbol{\epsilon}_1 & \boldsymbol{\epsilon}_2 & \sim_{i.i.d} N(0, 0.1)\\ \boldsymbol{x}_{3} &= \boldsymbol{x} + \boldsymbol{\epsilon}_1 & \boldsymbol{\epsilon}_3 & \sim_{i.i.d} N(0, 0.1)\\ \end{align*}\] and estimate linear regression model \[\begin{equation*} y_{i} = \beta_{0} + \beta_{1} x_{1i} + \beta_{2} x_{2i} + \beta_{3} x_{3i} + e_{i} \end{equation*}\] One may think that this will not cause too many problems as we have not just one but three fairly precise variables (note that \(\sigma_\epsilon = 0.1\), i.e. the observed variables do not have much noise.

Next, we create a small dataset using exactly the same DGP:

n = 25
x = np.random.normal(size=n)
y = x + np.random.normal(scale=0.3, size=n)
x1 = x + np.random.normal(scale=0.1, size=n)
x2 = x + np.random.normal(scale=0.1, size=n)
x3 = x + np.random.normal(scale=0.1, size=n)
X = np.column_stack((x1, x2, x3))

The code creates normally distributed true \(\boldsymbol{x}\), \(\boldsymbol{y}\) that is a linear function of \(\boldsymbol{x}\) plus some random noise, and thereafter the vectors \(\boldsymbol{x}_1\), \(\boldsymbol{x}_2\) and \(\boldsymbol{x}_3\), the obervable data. We also create the design matrix \(\mathsf{X}\) that we use below.

Let’s fit linear regression model with all the observable data:

from sklearn.linear_model import LinearRegression

m = LinearRegression()
_ = m.fit(X, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

m.score(X, y)  # training R2

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The training \(R^2\) looks pretty good. However, when we compute cross-validated \(R^2\):

from sklearn.model_selection import cross_val_score

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

cv = cross_val_score(m, X, y, cv=10)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

cv.mean()  # CV R2

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The cross-validation \(R^2\) is not impressive at all.

The problem is that we introduce highly correlated features and so the model severely overfits.

15.2 Forward selection

There are several solutions to this problem. A popular algorithm is forward selection where one first picks the best 1-feature model, thereafter tries adding all remaining features one-by-one to build the best two-feature model, and thereafter the best three-feature model, and so on, until the model performance starts to deteriorate.

Let us play through this algorithm with the example data we created:

X1 = np.column_stack((x1,))

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

_ = m.fit(X1, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

m.score(X1, y)  # R2 on training data

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

X2 = np.column_stack((x2,))

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

m.fit(X2, y).score(X2, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

X3 = np.column_stack((x3,))

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

m.fit(X3, y).score(X3, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We use column_stack to create matrices out of individual vectors (remember: the fit method requires X to be a matrix, not vector!) and thereafter fit the model and output it’s \(R^2\) on the training data. Remember: the score method computes \(R^2\) in case of linear regression. Also be aware that models of similar number of features can be compared with just training \(R^2\), cross-validation is not needed. Out of these three features \(\boldsymbol{x}_2\) gives the best \(R^2\).

Next, we’ll add the second feature. As \(\boldsymbol{x}_1\) is now taken, we only have to test \(\boldsymbol{x}_1\) and \(\boldsymbol{x}_3\) and see if any of these improves our model:

X21 = np.column_stack((x2,x1))

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

m.fit(X21, y).score(X21, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

X23 = np.column_stack((x2,x3))

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

m.fit(X23, y).score(X23, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The feature combination \(\boldsymbol{x}_2\) and \(\boldsymbol{x}_3\) gave us a slightly better result.

But is the best two-feature model better than the best one-feature model? As these models contain different number of features, we need to compute cross-validated \(R^2\) instead. So we test our best single-feature model and the best two-feature model using cross-validation:

cross_val_score(m, X2, y, cv=10).mean()

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

cross_val_score(m, X23, y, cv=10).mean()

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Not surprisingly, both models overfit, but the two-feature model is noticeably better than the single-feature model.

Finally, we can also evaluate the 3-feature model. As we only have 3 features, there is only a single model:

X123 = np.column_stack((x1, x2, x3))

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

cross_val_score(m, X123, y, cv=10).mean()

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Hence the three-feature model turned out worse than the two-feature model. We conclude that based on forward-selection, the best model is \[\begin{equation*} y_{i} = \beta_{0} + \beta_{2} x_{2i} + \beta_{3} x_{3i} + e_{i}. \end{equation*}\]

15.3 Ridge and Lasso regression

Ridge and Lasso are methods that are related to forward selection. These methods penalize large \(\beta\) values and hence suppress or eliminate correlated variables. These do not need looping over different combinations of variables like forward selection, however, one normally has to loop over the penalty parameter alpha to find the optimal value.

Both methods are living in sklearn.linear_regression. We demonstrate this with ridge regression below:

from sklearn.linear_model import Ridge, Lasso

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

m = Ridge(alpha = 1)  # penalty value 1

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

cross_val_score(m, X, y, cv=10).mean()

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

m = Ridge(alpha = 0.3)  # penalty value 0.1

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

cross_val_score(m, X, y, cv=10).mean()

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

As evident from the above, penalty values 1 and 0.3 give virtually equal results. The cross-validated \(R^2\) is also comparable to what forward-selection suggested.

Lasso works in an analogous fashion as ridge, but as it’s penalty is not globally differentiable (it is \(L_1\) penalty), then you may run into more problems with convergence.

Both methods have more options, in particular they can normalize the features before fitting.