Chapter 10 Linear Regression
We assume you have loaded the following packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Below we load more as we introduce more.
10.1 Solving Linear Regression Tasks Manually
10.1.1 The task
In case of simple regression, the task is to find parameters \(\beta_0\) and \(\beta_1\) such that the mean squared error (MSE) is minimized wher MSE is defined as \[\begin{equation} MSE = \frac{1}{n}\sum_i (y_i - \hat y(x_i))^2 \end{equation}\] where the predicted value \[\begin{equation} \hat y(x_i) = \beta_0 + \beta_1\cdot x_i. \end{equation}\] Here \(n\) is the number of observations, \(x\) is our exogenous variable, and \(y\) is the outcome variable, and \(i\) indexes the observations. Normally we want to use software to perform this optimization (even more, this problem can be solved analytically) but it is instructive to attempt to solve the problem by hand.
Let us experiment with iris data and estimate the relationship between petal width and length of versicolor flowers. This is one of the most popular statistics and machine learning dataset, the version we use here originates from R datasets. You can download it from the Bitbucket repo of these notes. The dataset itself contains three species, and as their leaves may have different relationship, we filter out only one of those, versicolor. Also, as the variable names in this file are not suitable for modeling below, we rename variable called Petal.Length to plength, and Petal.Width to pwidth.
First we load the data, and thereafter filter with rename chained at the end of the filtering operation:
= pd.read_csv("../data/iris.csv.bz2", sep="\t")
iris # select only versicolor species,
# make variable names easier to handle
= iris[iris.Species == "versicolor"].rename(
versicolor ={"Petal.Length": "plength", "Petal.Width":"pwidth"})
columns# check if data looks reasonable
3) versicolor.head(
## Sepal.Length Sepal.Width plength pwidth Species
## 50 7.0 3.2 4.7 1.4 versicolor
## 51 6.4 3.2 4.5 1.5 versicolor
## 52 6.9 3.1 4.9 1.5 versicolor
="k")
plt.scatter(versicolor.plength, versicolor.pwidth, edgecolor"Petal length (cm)")
plt.xlabel("Petal width (cm)") plt.ylabel(
The plot reveals that lengths and widths are positively correlated. This is to be expected, longer leaves are wider too.
10.1.2 Solve it manually
Denote petal length by \(l\) and petal width by \(w\). Solving the problem manually involves the following steps:
- Pick suitable values for \(\beta_0\) and \(\beta_1\).
- Compute the predicted petal width, \(\hat w_i = \beta_0 + \beta_1\cdot l_i\).
- Computing the deviations \(e_i = w_i - \hat w_i\).
- Compute MSE as \(MSE = \frac{1}{N} \sum_i e_i^2\).
Let us first write standalone code to perform these steps, and thereafter put all this into a function.
- For a start, we can pick \(\beta_0 = 0\) and \(\beta_1 = 1\). This would amount to quadratic (or circular) leaves:
= 0, 1 beta0, beta1
- Now we compute the predicted values. Remember, the petal length is
called
plength
and petal widht is calledpw
in theversicolor
dataframe:
= beta0 + beta1*versicolor.plength # note: vectorized operators!
hat_w 3) # this is a series! hat_w.head(
## 50 4.7
## 51 4.5
## 52 4.9
## Name: plength, dtype: float64
The results are too wide–the true values are roughly 1.5, not 4.5. Apparently the choice of \(\beta_0\) and \(\beta_1\) was not a good one.
- Compute deviations \(e_i\):
= versicolor.pwidth - hat_w
e 3) e.head(
## 50 -3.3
## 51 -3.0
## 52 -3.4
## dtype: float64
- Finally, compute MSE. Note that we can use the built-in
np.mean
to achieve this:
= np.mean(e**2)
mse mse
## 8.719800000000003
We get an \(MSE \approx 8.72\). This number in isolation does not tell much, but one can pick a different set of parameters \(\beta_0\) and \(\beta_1\), and compare the resulting MSE-s. This allows us to tell which choice is better.
10.1.3 Create a function to make it more compact
Now we can wrap these steps into a function to make the code more compact. The function might take two arguments, \(\beta_0\) and \(\beta_1\), so we can easily play with different values:
def mse(beta0, beta1):
= beta0 + beta1*versicolor.plength # note: vectorized operators!
hat_w = versicolor.pwidth - hat_w
e = np.mean(e**2)
mse return mse
Now we can quickly try a few different parameter values to find out which combination works better:
0, 1) # this is what we did mse(
## 8.719800000000003
0, 0.5) # this is better mse(
## 0.6671999999999998
-0.5, 0.5) # even better! mse(
## 0.11320000000000001
So out of these three attempts, the best parameter values are \(\beta_0 = -0.5\) and \(\beta_1 = 0.5\).
10.1.4 Visualize the Regression Line
So far we just computed \(MSE\) but we can do even better–we can also visualize the regression line. Matplotlib does not have a ready-made function to plot a regression line but it is easy to do by hand. We just create a number of sepal length values in a suitable range, predict the corresponding widths, and plot the results.
= -0.5, 0.5
beta0, beta1 # start by plotting the data
="k")
plt.scatter(versicolor.plength, versicolor.pwidth, edgecolor# create 11 evenly spaced lengths between the smallest and largest
# value in the data
= np.linspace(versicolor.plength.min(), versicolor.plength.max(), 11)
length # predict width
= beta0 + beta1*length
hatw ='r')
plt.plot(length, hatw, c"Petal length")
plt.xlabel("Petal width") plt.ylabel(
Despite the impressive \(MSE\) value, the figure shows that our line does not actually align well with data. We might continue trying different values to find an even better one.
Exercise 10.1 Write a function that takes \(\beta_0\) and \(\beta_1\) as arguments and a) computes \(MSE\) and b) plot the regression line. Experiment with the parameter values and find a combination that makes the line align well with the data points.
See the solution
10.2 Linear Regression in python: statsmodels.formula.api
and sklearn
We discuss two popular libraries for doing linear regression in
python. The first one, statsmodels.formula.api
is useful if we want
to interpret the model coefficients, explore \(t\)-values, and assess
the overall model goodness. It is based on R-style
formulas, and it provides well-designed summary tables. The other,
sklearn
, is more suitable for predictive modeling. Instead of
formula it expects the design matrix, and it does not provide nicely
formatted summary tables. However, it can be easily swapped with
other predictive models in sklearn
module.
10.2.1 Statsmodels and its formula API
We import the statsmodels formula API functionality as
import statsmodels.formula.api as smf
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
## warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
It provides a number of statistical models, include ols
, ordinary
least squares, i.e. linear regression. The basic syntax of ols
is
smf.ols(formula, data)
The formula is an R-style formula, such as "y ~ x1 + x2"
and data is a data frame that contains the variables. For
instance, the versicolor petal regression can be done as
= smf.ols("pwidth ~ plength", data=versicolor) m
This line just sets up the linear regression model but does not
actually evaluate it, in order to get an answer, one has to add .fit()
:
= smf.ols("pwidth ~ plength", data=versicolor).fit() m
This results in evaluated model object m
that provides several
useful methods, in particular .params
for parameter estimates, and
.summary()
for a summary table:
m.params
## Intercept -0.084288
## plength 0.331054
## dtype: float64
m.summary()
Dep. Variable: | pwidth | R-squared: | 0.619 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.611 |
Method: | Least Squares | F-statistic: | 77.93 |
Date: | Tue, 17 Sep 2024 | Prob (F-statistic): | 1.27e-11 |
Time: | 19:52:49 | Log-Likelihood: | 34.709 |
No. Observations: | 50 | AIC: | -65.42 |
Df Residuals: | 48 | BIC: | -61.59 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.0843 | 0.161 | -0.525 | 0.602 | -0.407 | 0.239 |
plength | 0.3311 | 0.038 | 8.828 | 0.000 | 0.256 | 0.406 |
Omnibus: | 0.204 | Durbin-Watson: | 2.412 |
---|---|---|---|
Prob(Omnibus): | 0.903 | Jarque-Bera (JB): | 0.002 |
Skew: | -0.012 | Prob(JB): | 0.999 |
Kurtosis: | 3.017 | Cond. No. | 41.6 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
10.2.1.1 R-style formulas
The ability to enter formulas directly into the model is one of the strong sides of statsmodels.formula. This makes it easy to directly specify the dependent and independent variables given we have a data frame, there is no need to create an explicit design matrix (like in sklearn). These formulas originate from R where they are used in a similar fashion, that’s why they are called “R-style” formulas.
10.2.1.1.1 Specifying dependent and independent variables
The central part of the model formula is dependent - tilde -
independent specification. We can write "y ~ x"
and this will be
understood as a linear regression model
\[\begin{equation}
y_i = \beta_0 + x_i \cdot \beta_1 + \epsilon_i.
\end{equation}\]
In particular what is before tilde ~
will be dependent variable
(here \(y\)), and what is after tilde are independent variables (here
just \(x\)). Unless explicitly requested not to, the model also include
the intercept \(\beta_0\).
In case we have more independent variables, those can be added with
plus sign, for instance "y ~ x1 + x2 + x3"
. This corresponds to the
regression model
\[\begin{equation}
y_i = \beta_0 + x_{i1} \cdot \beta_1 + x_{i2} \cdot \beta_2 + x_{i3} \cdot \beta_3 + \epsilon_i.
\end{equation}\]
Hence the plus sign in the formula does not mean “add” but “include
into the model”. So the formula should be read as "use \(y\) as the
dependent variable, and include \(x_1\), \(x_2\) and \(x_3\) as explanatory
variables.
For instance, if we want to estimate the median house value (medv) as a function of number of rooms (rm), percentage of lower-status population (lstat) and closeness to Charles river (chas), we can run the regression model like:
= smf.ols("medv ~ rm + lstat + chas", data=boston).fit() m
10.2.1.1.2 Categorical variables and interaction effects
Besides just specifying the variables, the formulas allow many more options. We use titanic data for the examples below.
First, note that if the variable is categorical (i.e. string) by design, it will be automatically converted to approriate dummies. For instance, sex is a categorical variable (with categories female and male) and hence it will be automatically converted into dummies. For instance, if we want to estimate a model \[\begin{equation} \text{survived}_i = \beta_0 + \text{male}_i \cdot \beta_1 + \epsilon_i, \end{equation}\] we can just insert string variable sex into the model:
= pd.read_csv("../data/titanic.csv.bz2")
titanic = smf.ols("survived ~ sex", data=titanic).fit()
m m.summary()
Dep. Variable: | survived | R-squared: | 0.280 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.279 |
Method: | Least Squares | F-statistic: | 507.1 |
Date: | Tue, 17 Sep 2024 | Prob (F-statistic): | 3.78e-95 |
Time: | 19:52:50 | Log-Likelihood: | -697.97 |
No. Observations: | 1309 | AIC: | 1400. |
Df Residuals: | 1307 | BIC: | 1410. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.7275 | 0.019 | 38.049 | 0.000 | 0.690 | 0.765 |
sex[T.male] | -0.5365 | 0.024 | -22.518 | 0.000 | -0.583 | -0.490 |
Omnibus: | 37.339 | Durbin-Watson: | 1.611 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 39.798 |
Skew: | 0.419 | Prob(JB): | 2.28e-09 |
Kurtosis: | 2.834 | Cond. No. | 3.11 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The line sex[T.male]
shows the effect of male dummy with female
being the reference group. This is because the categories will be
ordered alphabetically, and the first (female) taken as the
reference category.
But sometimes the categories are not strings but have numeric lables. For instance, male may be labeled as 1 and female as 2. Or, as is the case of titanic, we have numeric labels for passenger classes. Let us analyze the survival by passenger class using a model like \[\begin{equation} \text{survived}_i = \beta_0 + \text{class2}_i \cdot \beta_1 + \text{class3}_i \cdot \beta_2 + \epsilon_i, \end{equation}\] If we just specify the model as
"survived ~ pclass", data=titanic) smf.ols(
pclass
will be treated as a numeric variable, and the estimate
should be interpreted as the effect per “one unit larger pclass
”.
This clearly does not make any sense. Instead, we have to tell
statsmodels that pclass
is a categorical variable by wrapping
pclass
into C
function:
= smf.ols("survived ~ C(pclass)", data=titanic).fit()
m m.summary()
Dep. Variable: | survived | R-squared: | 0.098 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.096 |
Method: | Least Squares | F-statistic: | 70.69 |
Date: | Tue, 17 Sep 2024 | Prob (F-statistic): | 7.10e-30 |
Time: | 19:52:50 | Log-Likelihood: | -845.26 |
No. Observations: | 1309 | AIC: | 1697. |
Df Residuals: | 1306 | BIC: | 1712. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.6192 | 0.026 | 24.084 | 0.000 | 0.569 | 0.670 |
C(pclass)[T.2] | -0.1896 | 0.038 | -5.011 | 0.000 | -0.264 | -0.115 |
C(pclass)[T.3] | -0.3639 | 0.031 | -11.732 | 0.000 | -0.425 | -0.303 |
Omnibus: | 14135.138 | Durbin-Watson: | 1.846 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 142.320 |
Skew: | 0.446 | Prob(JB): | 1.25e-31 |
Kurtosis: | 1.653 | Cond. No. | 4.51 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Now pclass
will automatically be converted into dummies with 1
being the reference catogory. The lines C(pclass)[T.2]
and
C(pclass)[T.3]
describe the effect of 2nd and 3rd class travel
respectively (and reveal a sad story of lower survival chance for
lower-class passengers).
We may also want to introduce gender and class interaction effects to test whether the
survival rate depends on class, and whether this dependence differs
for mean and women. This can be achieved with interaction effects in
the form \(\text{male} \times \text{class}\), the full model would look like
\[\begin{align*}
\text{survived}_i &= \beta_0 +
\text{class2}_i \cdot \beta_1 + \text{class3}_i \cdot \beta_2 +
\\
& +
\text{male}_i \cdot \beta_3 +
\\
& +
\text{class2}_i \cdot \text{male}_i \cdot \beta_4 +
\text{class3}_i \cdot \text{male}_i \cdot \beta_5 +
\epsilon_i.
\end{align*}\]
Interaction effects can be intered with asterisk *
. For instance,
x1 * x2
will include three variables: \(x_1\), \(x_2\), and \(x_1 \times x_2\). If any of the variables is categorical (either string or
converted to categorical using C
, the interaction effects will be
inserted for all categories. Hence the Titanic class and sex example
can be done as:
= smf.ols("survived ~ C(pclass)*sex", data=titanic).fit()
m m.summary()
Dep. Variable: | survived | R-squared: | 0.370 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.368 |
Method: | Least Squares | F-statistic: | 153.2 |
Date: | Tue, 17 Sep 2024 | Prob (F-statistic): | 4.02e-128 |
Time: | 19:52:50 | Log-Likelihood: | -609.86 |
No. Observations: | 1309 | AIC: | 1232. |
Df Residuals: | 1303 | BIC: | 1263. |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.9653 | 0.032 | 29.974 | 0.000 | 0.902 | 1.028 |
C(pclass)[T.2] | -0.0785 | 0.049 | -1.587 | 0.113 | -0.176 | 0.019 |
C(pclass)[T.3] | -0.4745 | 0.042 | -11.414 | 0.000 | -0.556 | -0.393 |
sex[T.male] | -0.6245 | 0.043 | -14.436 | 0.000 | -0.709 | -0.540 |
C(pclass)[T.2]:sex[T.male] | -0.1161 | 0.064 | -1.801 | 0.072 | -0.243 | 0.010 |
C(pclass)[T.3]:sex[T.male] | 0.2859 | 0.054 | 5.340 | 0.000 | 0.181 | 0.391 |
Omnibus: | 113.251 | Durbin-Watson: | 1.746 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 142.526 |
Skew: | 0.805 | Prob(JB): | 1.12e-31 |
Kurtosis: | 3.136 | Cond. No. | 13.3 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
As visible from the table, all the variables from the model above are
included, and the all the pclass
categories are taken into account.
The lines containing colon, C(pclass)[T.2]:sex[T.male]
and
C(pclass)[T.3]:sex[T.male]
, are the interaction effects for male
and the corresponding passenger class.
10.2.1.1.3 Transforming variables
## [1] 1000 10
One can perform certain computations directly in the model formula. In this example we use a random subset of diamonds data from R’s ggplot2 package. The dataset contains information on diamonds’ price (variable price), mass (carat), and other characteristics we do not discuss here. A few sample lines of the dataset are
diamonds.head()
## carat cut color clarity depth table price x y z
## 0 1.40 Ideal G VS2 62.1 55.0 10427 7.12 7.05 4.40
## 1 0.52 Good D SI2 63.7 55.0 1293 5.11 5.10 3.25
## 2 0.70 Very Good F VS2 60.9 61.0 2812 5.66 5.71 3.46
## 3 1.74 Premium I SI2 62.2 58.0 7858 7.65 7.61 4.75
## 4 0.30 Ideal E VVS2 62.3 54.1 779 4.28 4.34 2.69
We model the price as a function of mass. First we use a linear-linear regression model \[\begin{equation} \text{price}_i = \beta_0 + \text{carat}_i \cdot \beta_1 + \epsilon_i: \end{equation}\]
= smf.ols("price ~ carat", data=diamonds).fit()
m m.summary()
Dep. Variable: | price | R-squared: | 0.838 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.838 |
Method: | Least Squares | F-statistic: | 5152. |
Date: | Tue, 17 Sep 2024 | Prob (F-statistic): | 0.00 |
Time: | 19:52:52 | Log-Likelihood: | -8765.7 |
No. Observations: | 1000 | AIC: | 1.754e+04 |
Df Residuals: | 998 | BIC: | 1.755e+04 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -2129.9155 | 96.843 | -21.993 | 0.000 | -2319.955 | -1939.876 |
carat | 7543.6403 | 105.102 | 71.774 | 0.000 | 7337.394 | 7749.887 |
Omnibus: | 380.954 | Durbin-Watson: | 1.895 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 4043.711 |
Skew: | 1.435 | Prob(JB): | 0.00 |
Kurtosis: | 12.424 | Cond. No. | 3.69 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The results reveal that the price is closely related to the carat weight (coefficient value 7543.6 with a huge \(t\)-value). The model is also fairly good in terms of \(R^2 = 0.838\). But the plot below reveals that linear-linear model is not the best fit for the data:
The figure suggest that the true price-weight curve is more convex,
not the fact that the low-weight diamonds’ (ct < 0.5) price is consistently
underpredicted, while those in range 0.5-1 are overpredicted.
We may try log-log transformation instead. We may use functions
directly to transform variables, in this case we can use np.log
for
log transform and write np.log(price)
instead of price
:
\[\begin{equation}
\log\text{price}_i = \beta_0 + \log\text{carat}_i \cdot \beta_1 + \epsilon_i:
\end{equation}\]
= smf.ols("np.log(price) ~ np.log(carat)", data=diamonds).fit()
m m.summary()
Dep. Variable: | np.log(price) | R-squared: | 0.928 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.928 |
Method: | Least Squares | F-statistic: | 1.284e+04 |
Date: | Tue, 17 Sep 2024 | Prob (F-statistic): | 0.00 |
Time: | 19:52:53 | Log-Likelihood: | -115.55 |
No. Observations: | 1000 | AIC: | 235.1 |
Df Residuals: | 998 | BIC: | 244.9 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 8.4414 | 0.010 | 810.896 | 0.000 | 8.421 | 8.462 |
np.log(carat) | 1.6729 | 0.015 | 113.332 | 0.000 | 1.644 | 1.702 |
Omnibus: | 16.775 | Durbin-Watson: | 1.940 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 26.702 |
Skew: | 0.119 | Prob(JB): | 1.59e-06 |
Kurtosis: | 3.764 | Cond. No. | 2.09 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This model looks better as suggested by both \(R^2 = 0.928\) and the \(t\)-value. And here is the corresponding plot:
Now the trend line looks perfectly straight and the linear model fits very well.
10.2.2 Scikit-learn and LinearRegression
Another very popular python library for linear regression is scikit-learn (usually abbreviated as sklearn). While statsmodels is mostly geared toward inference and provides nice summary tables with coefficients, \(t\) statistics and \(R^2\); sklearn is designed for predictive modeling. It does not provide similar output tables, and it does not have formula API. Instead, one has to create the design matrix manually, and use this when fitting the model. This may be somewhat complex if one needs some feature engineering, such as creating dummies. As an advantage though, it contains a large number of other predictive models with similar API, so swapping one model for another is very easy. It also contains other tools, such as cross-validation, that are commonly used for predictive modeling. However, some of the methods are implemented slightly differently than the textbook cases in statsmodels, in particular sklearn-versions often contain regularization switched on by default. So the results may turn out to be slightly different than what one may expect based on other implementations.
Below, we demonstrate how to use sklearn’s LinearRegression, Section 11.2.2 below explains how to perform logistic regression.
10.2.2.1 Numeric variables only
Let us start with the example of estimating the petal width of
versicolor flowers (see Section 10.1.1).
Linear regression can be
performed using LinearRegression
function:
# we have to import the model..
from sklearn.linear_model import LinearRegression
# .. and create the model
= LinearRegression() m
The import call makes the LinearRegression
function available. The
second line of code constructs
the linear regression object that we can fit afterwards. It is
similar to statsmodels call m = smf.ols("y ~ x", data=df)
,
i.e. setting up the model without fitting it.
LinearRegression
takes
various parameters, but here we are usually
happy with the defaults. In particular, we are happy with the dafault
behavior of adding the intercept to the model automatically, so we
do not have to add a constant column to \(\mathsf{X}\).
Next, we have to construct the design matrix. Here it is easy, as it just contains a single numeric variable. But it must be a matrix (or data frame), not a vector. The following code fill fail:
## create design matrix
= versicolor.plength
X # problem: X is a series, not matrix!
= LinearRegression()
m = m.fit(X, versicolor.pwidth) _
## ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.
Let us now ensure that our design matrix is a matrix and not a
vector. We can achieve this by adding additional brackets around
pl
(see Section 3.3.1 for more explanations):
# create design matrix
= versicolor[["plength"]]
X # now X is a data frame
= LinearRegression().fit(X, versicolor.pwidth) m
Note the similarities and differences between statsmodels and
sklearn: in both cases you first set up the model (either with
ols()
or LinearRegression()
) and thereafter fit it with .fit()
method. However, in case of statsmodels you supply data to the
ols()
function, in case of sklearn you supply it to .fit()
method. Also, sklearn requires you to supply the design matrix X
and target y separately. statsmodels will create these objects
itself from the formula.
Now m
is a fitted model. We can query the parameters as
# intercept m.intercept_
## -0.08428835489833664
# the estimated parameters, here just one m.coef_
## array([0.3310536])
You can see that these numbers are exactly the same as in Section @ref(#linear-regression-statsmodels).
One can query a quick model goodness summary using .score
method.
In case of linear regression (and other regression models) it returns
\(R^2\):
m.score(X, versicolor.pwidth)
## 0.6188466815001425
Again, we get exactly the same number as when using statsmodels
above. Note that .score
method expects two arguments: the design
matrix X and target y, in this order. (It is the same order as
for the
.fit
method.) Internally it predicts the values for
every row of X, computes errors between predictions and y, and
transforms this to \(R^2\). But by supplying X and y to the
.score
-method we can calculate \(R^2\) on different data–on
validation data instead of training data (see Section
13 for more information).
But sklearn does not have a function to print a similar summary
table as what .summary
does in
statsmodels. We cannot easily see standard errors, \(p\)-values or
statistical significance.
sklearn is designed for predictive modeling where such tables are of
little value. In complex
cases, for instance in image processing, we may want to include values
of millions of
pixels into the model and here individual specification of each pixel
is clearly infeasible. We prefer to use the design matrix approach
instead. In a similar fashion, we are not interested in interpreting
values of
millions of coefficients. sklearn is designed for exactly such
tasks.
TBD: exercise with a single variable (Hubble)
Exercise 10.2 Take the iris data. Extract data for virginica (where Species = virginica) and run a regression: \[\begin{equation} \text{sepal width}_i = \beta_0 + \beta_1 \cdot \text{sepal length}_i + \beta_2 \cdot \text{petal width}_i + \beta_3 \cdot \text{petal length}_i + \epsilon_i. \end{equation}\] Compute \(R^2\).
See the solution
10.2.2.2 Categorical variables
As sklearn does not include a formula method, one cannot rely on it
to convert categoricals to
dummies. Instead, one has to make dummies manually. A
good choice will be to use pd.get_dummies
function. As an example,
let’s use the males
dataset
and estimate a regression model
\[\begin{equation}
\log w_i = \beta_0 + \beta_1 \cdot \text{school}_i +
\beta_2 \cdot \text{residence}_i +
\epsilon_i.
\end{equation}\]
Here \(\log w\) is log hourly wage (in 1980-s, called wage in the dataset),
school is years of
schooling, and residence is the residence region (“rural_area”,
“north_east”, “nothern_central” and “south”).
As the first step, we have to create the design matrix X by including all the relevant variables and converting categoricals to dummies:
= pd.read_csv("../data/males.csv.bz2", sep="\t")\
males "wage", "school", "residence"]]
[[# a look at the original data
3) males.sample(
## wage school residence
## 3184 2.238261 14 NaN
## 306 1.680934 10 nothern_central
## 1186 1.680934 12 south
The simplest approach we can do is just to convert all the explanatory
variables (here school and residence) with pd.get_dummies
and
let it sort out what is numeric and what is not. We also
ask one category to be dropped as the reference category:
= pd.get_dummies(males[["school", "residence"]],
X # do not include 'wage' into X!
=True)
drop_first3) X.sample(
## school residence_nothern_central residence_rural_area residence_south
## 3138 6 0 0 0
## 1520 12 0 0 0
## 1810 12 1 0 0
One can see that school was preserved as a numeric column, but the four-category residence was transformed into three columns (with north_east dropped as the reference category).
For consistency, we also separate the outcome, \(\log w\), into a vector \(\mathbf{y}\):
= males.wage y
Now this is a ready-made numeric matrix we can use for fitting
sklearn models. We also need the outcome variable, here just the
log-income males.wage
. Now we can fit the model as
= LinearRegression()
m = m.fit(X, y) _
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
we assign the result of .fit
to a temporary variable—it returns
the fitted object and we do not want it to be printed here.
As no output is visible, we may manually examine the results:
# vector of parameters (but not intercept) m.coef_
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
We can see that .coef_
is a vector of four values–one for each
column of X, in the same order as in the columns. So .coef[0]
tells us that those who have spent one year more at school earn 8%
more, at least in average. We also see that income in all regions is
smaller than in the reference region, North-East.
We can compute \(R^2\) using the .score
method, there are no
differences with the only numeric case in Section
10.2.2.1:
m.score(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Exercise 10.3 Take the males dataset and run a regression: \[\begin{equation} \log w_i = \beta_0 + \beta_1 \cdot \text{school}_i + \beta_2 \cdot \text{ethn}_i + \epsilon_i. \end{equation}\] where school is years of schooling, and ethn is race categories.
Ensure you have correct number of coefficients. Compute \(R^2\).
See the solution
10.3 Model Goodness
10.3.1 Create Random Data for Experiments
Random data is useful for learning or testing methods for various reasons as they allow to design exactly the features we are interested in trying our methods on, and when manually creating data, we will also know what exactly do we put in there and hence also what we ought to get out. If our results do not agree, something is wrong either in our code or our understanding of the method.
For linear regression we can create data exactly as specified by the linear regression model: \[\begin{equation} y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i. \end{equation}\] Here \(x\) is our explanatory variable. We do not have to specify precisely how it looks like, but random normals are a good bet. \(\epsilon\) is the random disturbance term, a term necessary to knock the points off from the straight line to make the artificial data to resemble real data. This should be normal to ensure that the estimated standard errors are correct. We also have to choose the values for parameters \(\beta_0\) and \(\beta_1\), in this example we choose 1 and 2. Let us create a dataset with 100 observations:
= 100 n
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= 1, 2 beta0, beta1
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= np.random.normal(size=n) # create random x x
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= np.random.normal(size=n) # random disturbance term eps
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= beta0 + beta1*x + eps # create y exactly as specified by linear regression definiton y
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
# finally, keep this data in a dataframe for future
= pd.DataFrame({"x":x, "y":y}) randomDF
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Let’s plot the resulting \(x\) and \(y\):
= plt.scatter(x, y, edgecolor="k")
_ = plt.xlabel("x")
_ = plt.ylabel("y")
_ = plt.show() _
We can see an upward-trending cloud of points.
Exercise 10.4
- Modify the parameters \(\beta_0\) and \(\beta_1\) and create a downward trending point cloud
- Modify the
scale
parameters ofnp.random.normal
when creating \(\epsilon\) to create a point cloud that is very narrow (dots are almost perfectly lined up).
Show your data on a plot.
See the solution
10.3.2 Compute SSE, TSS, and \(R^2\) manually
In this example we use our random data frame we created above. As the first step, let’s compute TSS using the definition \[\begin{equation} \text{TSS} = \sum_{i=1}^{N} (y_{i} - \bar{y} )^{2}. \end{equation}\] This definition can be directly translated into computer code:
= np.sum((y - np.mean(y))**2)
TSS TSS
## 519.5528490036102
This is the total variation in the data (in \(y\)). How much will be left unexplained by linear regression? First we have to fit a linear regression model in order to be able to compute this:
= smf.ols("y ~ x", data=randomDF).fit() # fit the model
m = m.params[0] # assign the best solution coefficients to beta0, beta1
beta0 = m.params[1] beta1
Now we can compute the predicted values in a fairly similar way as when we created random data. Just remember that we compute expected values as predictions, and hence we just ignore \(\epsilon\):
= beta0 + beta1*x yhat
Now we have both true values y
and predicted values yhat
. We can
use the definition of SSE and compute deviations \(e\) and thereafter
SSE:
= y - yhat
e = np.sum(e**2)
SSE SSE
## 99.95983754370855
Finally, we can also compute \(R^2\) from the definition:
= 1 - SSE/TSS
R2 R2
## 0.8076041008428501
We may want to compare our “home-made” \(R^2\) with the one provided by
statsmodels rsquared
attribute:
m.rsquared
## 0.8076041008428501
One can see that the the summary method provide a similar \(R^2\).
Exercise 10.5 Use the versicolor data to estimate a regression model \[\begin{equation} \text{Petal width}_{i} = \beta_{0} + \beta_{1} \cdot \text{Petal length} + \epsilon_{i}, \end{equation}\] the exact same model we used in Solving Linear Regression Tasks Manually. Compute \(R^2\) manually as we did above.
See the solution
Exercise 10.6 Modify your data generation procedure, in particular the scale
parameter for \(\epsilon\) so that you get:
- \(R^2 > 0.99\)
- \(R^2 < 0.1\).
Show the corresponding data points on a plot.
See the solution