Chapter 11 Logistic Regression
We assume you have loaded the following packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Below we load more as we introduce more.
Examples in these section use treatment data. This is a dataset that contains information of certain job training programs in 1970s. The data looks like this:
= pd.read_csv("../data/treatment.csv.bz2", sep="\t")
treatment 3) treatment.sample(
## treat age educ ethn married re74 re75 re78 u74 u75
## 617 False 34 9 black True 9796.33 8951.61 10344.1 False False
## 972 False 27 15 other True 3722.61 4654.84 19210.4 False False
## 2508 False 36 8 hispanic True 19592.70 11995.20 16993.9 False False
The main variables we are concerned here are treat, whether someone participated or not in the program (True/False) and age. Education is given in years, and re and u variables denote the real income in 1974, 75, and 78, and unemployment in the corresponding years. Below, our main task is to estimate how participation (treat) depends on age.
11.1 Do it manually
Before we embark on the analysis of participation and age, let’s start with a corresponding plot:
= plt.scatter(treatment.age, treatment.treat, s=1)
_ = plt.xlabel("Age")
_ = plt.ylabel("Participated")
_ = plt.show() _
This plot just tells we what we expect to find: there only two possible values for treat, and age is coded as integer values only. Unfortunately, this figure is hard to understand, so we add some random noise to each datapoint:
## add uniform noise in interval [-0.2, 0.2] to the x values...
= treatment.age + np.random.uniform(-0.2, 0.2, size=len(treatment))
x ## ...and uniform noise in interval [-0.2, 0.2] to the x values:
= treatment.treat + np.random.uniform(-0.1, 0.1, size=len(treatment))
y = plt.scatter(x, y, s=1, alpha=0.5)
_ = plt.xlabel("Age")
_ = plt.ylabel("Participated")
_ = plt.show() _
This plot is much clearer: we can see that all individual observations are knocked off from their correct location, but thanks to that, we can actually see how many dots there are: it seems like most individuals are untreated (participation is 0), and young people are more likely to participate.
Note that seaborn has built-in functionality for random noise (jitter), so one can achieve the figure above in somewhat simpler manner using seaborn:
import seaborn as sns
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
## warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
= sns.regplot("age", "treat", ci=None,
_ = {"color":"black"},
line_kws = {"s":1, "alpha":0.5},
scatter_kws = 0.4, y_jitter=0.2,
x_jitter =treatment)
data= plt.show() _
The black line corresponds to the linear probability model (LPM), a linear regression model to estimate probabilities.
As LPM-s have several less-then-desirable properties, we model the probabilities using logistic regression instead. Logistic regression models the probability using logistic function (sigmoid function): \[\begin{align*} \Pr(Y_i = 1) &= \Lambda ( \eta_i ) = \\ & = \frac{1}{1 + e^{-\eta_i}} \end{align*}\] where \[\begin{equation*} \eta_i = \beta_0 + \beta_1 \cdot \mathit{age}_i. \end{equation*}\] In an analogous fashion as in case of linear regression, we need values for the coefficients, \(\beta_0\) and \(\beta_1\). Lets pick \(1\) and \(-0.1\):
## add predicted logistic line to the scatterplot
= np.linspace(17,60, 100)
age # age, uniformly spaced in 17-60 years
= 1
beta0 = -0.1
beta1 = beta0 + age*beta1 # this is how we predicted for OLS
eta = 1/(1 + np.exp(-eta)) # now we predict probability like this instead
pr = plt.plot(age, pr, c="black")
_ = plt.scatter(x, y, s=1, alpha=0.5)
_ # add the datapoints, knocked off for visibility
= plt.show() _
The computed logistic curve broadly resembles the distribution of data points. We should try different \(\beta_0\), \(\beta_1\) values and improve the fit. However, it is hard to do just by visual inspection, we need some sort of metric, such as log-likelihood, but the latter is out of scope of these notes.
11.2 Logistic Regression in python: statsmodels.formula.api
and sklearn
As in case with linear regression, we can use both libraries–statsmodels and sklearn–for logistic regression too. The usage is fairly similar as in case of linear regression, but both libraries come with their own quirks.
11.2.1 Statsmodels and its formula API
When using statsmodels, we need to specify a similar formula as in case of linear regression (see Section 10.2.1). However, before we can do that, we need to address a quirk in statsmodels–namely, it does not allow the outcome to be a logical or string variable. It must be 0/1 (numerical) variable, otherwise you get an error
ValueError: endog has evaluated to an array with multiple columns that has shape (2675, 2). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).
So in order to fit a model where we estimate the probability of treatment, we first have to create a new variable that is just the numeric version of treat. We call it T below:
= treatment.treat.astype(int) T
Note that here we created a workspace variable T, not a data variable. For what we do here, both approaches will work. Now we import the module and estimate the logistic regression model:
import statsmodels.formula.api as smf
= smf.logit('T ~ age', data=treatment).fit() m
## Optimization terminated successfully.
## Current function value: 0.221883
## Iterations 8
This code is very similar to linear regression, the only difference is
using smf.logit
instead of smf.ols
.
Exactly as linear regression, logistic regression uses
R-style formulas (see Section 10.2.1.1). Hence the
mode we use is
\[\begin{align*}
\Pr(Y_i = 1)
&= \Lambda ( \eta_i ) = \\
& = \frac{1}{1 + e^{-\eta_i}}
\end{align*}\]
where
\[\begin{equation*}
\eta_i = \beta_0 + \beta_1 \cdot \mathit{age}_i.
\end{equation*}\]
Note
that the formula specifies just
the link function \(\eta\), the logistic
function part of the model is already embedded in smf.logit
.
We can see the estimated values with
m.params
## Intercept 1.034257
## age -0.122942
## dtype: float64
This indicates that our example plot above, using values 0 and -0.1, was fairly close.
We can also get a full summary output as
m.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
## Logit Regression Results
## ==============================================================================
## Dep. Variable: T No. Observations: 2675
## Model: Logit Df Residuals: 2673
## Method: MLE Df Model: 1
## Date: Tue, 17 Sep 2024 Pseudo R-squ.: 0.1176
## Time: 19:52:50 Log-Likelihood: -593.54
## converged: True LL-Null: -672.65
## Covariance Type: nonrobust LLR p-value: 2.760e-36
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 1.0343 0.330 3.135 0.002 0.388 1.681
## age -0.1229 0.012 -10.052 0.000 -0.147 -0.099
## ==============================================================================
## """
The output is similar to that of linear regression, with certain differences that are appropriate for logistic regression.
For completeness, let’s do here another plot that also contains the fitted logistic curve (marked red):
## add predicted logistic line to the scatterplot
= np.linspace(17,60, 100)
age # age, uniformly spaced in 17-60 years
= 1
beta0 = -0.1
beta1 = beta0 + age*beta1 # this is how we predicted for OLS
eta = 1/(1 + np.exp(-eta)) # now we predict probability like this instead
pr = plt.plot(age, pr, c="black")
_ = m.params[0] + m.params[1]*age
etaX # link computed with fitted values
= 1/(1 + np.exp(-etaX))
prX # fitted probabilities
= plt.plot(age, prX, c="red")
_ = plt.scatter(x, y, s=1, alpha=0.5)
_ # use the knocked off-datapoints from above
= plt.show() _
For marginal effects, there is a dedicated method .get_margeff()
that computes the effects, it also has its own .summary()
method
that displays a regression table-like table. These two can be chained
as
m.get_margeff().summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
## Logit Marginal Effects
## =====================================
## Dep. Variable: T
## Method: dydx
## At: overall
## ==============================================================================
## dy/dx std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## age -0.0075 0.001 -9.181 0.000 -0.009 -0.006
## ==============================================================================
## """
By default, it computes the average marginal effect, average of all individual marginal effects. These figures are directly interpretable, here one year older individuals are 0.75 pct points less likely to participate, in average.
One can also specify different type of marginal effects using the at
argument. For instance, at = "mean"
will produce marginal effect at
the mean of each explanatory variable.
11.2.2 Scikit-learn and LogisticRegression
Doing logistic analysis using sklearn is in many ways similar how to do linear regression in sklearn. Here we assume you are familiar with that section.
Next, we proceed in an analogous fashion as in case of linear regression:
- import the function
LogisticRegression
from sklearn - set up the model
- create the numeric-only design matrix X
- fit the logistic regression model
- analyze the results.
The fact that we can use the same approach with logistic regression as in case of linear regression is a big advantage of sklearn: the same approach applies to other models too, so it is very easy to experiment with different models. Very little additional code and no additional data processing is needed.
First import and set up the model:
# we need to import the model
from sklearn.linear_model import LogisticRegression
# create the model
= LogisticRegression() m
We can specify various options for LogisticRegression
but for now we
are happy with the default values.
Next, we construct the design matrix, we just need to pull out the “age” variable and ensure the result is a data frame (or matrix):
= treatment[["age"]] # remember: double brackets for matrix
X = treatment.treat # just for consistency y
And now we fit the model:
= m.fit(X, y) _
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
As in case of linear regression, we can query the estimates as
# intercept m.intercept_
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
# the estimated parameters, here just one m.coef_
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In case of logistic regression,
score
-method reports accuracy (percentage of correct
predictions) instead of \(R^2\):
# 93% correct m.score(X, y)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Note that LogisticRegression
, unlike smf.logit
, can handle logical
outcome variables, we do not have to transform those to numbers.
When you want to include categorical variables, you have to transform
those to dummies. See examples in LinearRegression
: Categorical
variables and
Section Using pd.cut.
Exercise 11.1 Take treatment data and model participation (variable treat). Now include all other variables in the model as explanatory variables! (But do not include treat itself).
Compute accuracy–do you get a better number than 93%?
See the solution