Chapter 11 Logistic Regression

We assume you have loaded the following packages:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Below we load more as we introduce more.

Examples in these section use treatment data. This is a dataset that contains information of certain job training programs in 1970s. The data looks like this:

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
treatment.sample(3)
##       treat  age  educ   ethn  married      re74     re75     re78    u74    u75
## 757   False   38     9  black     True  20388.10  22407.7  16255.0  False  False
## 884   False   48    17  other     True      0.00      0.0      0.0   True   True
## 1009  False   19    13  other    False   6857.43  15217.7  17732.7  False  False

The main variables we are concerned here are treat, whether someone participated or not in the program (True/False) and age. Education is given in years, and re and u variables denote the real income in 1974, 75, and 78, and unemployment in the corresponding years. Below, our main task is to estimate how participation (treat) depends on age.

11.1 Do it manually

Before we embark on the analysis of participation and age, let’s start with a corresponding plot:

_ = plt.scatter(treatment.age, treatment.treat, s=1)
_ = plt.xlabel("Age")
_ = plt.ylabel("Participated")
_ = plt.show()

plot of chunk unnamed-chunk-3

This plot just tells we what we expect to find: there only two possible values for treat, and age is coded as integer values only. Unfortunately, this figure is hard to understand, so we add some random noise to each datapoint:

## add uniform noise in interval [-0.2, 0.2] to the x values...
x = treatment.age + np.random.uniform(-0.2, 0.2, size=len(treatment))
## ...and uniform noise in interval [-0.2, 0.2] to the x values:
y = treatment.treat + np.random.uniform(-0.1, 0.1, size=len(treatment))
_ = plt.scatter(x, y, s=1, alpha=0.5)
_ = plt.xlabel("Age")
_ = plt.ylabel("Participated")
_ = plt.show()

plot of chunk unnamed-chunk-4

This plot is much clearer: we can see that all individual observations are knocked off from their correct location, but thanks to that, we can actually see how many dots there are: it seems like most individuals are untreated (participation is 0), and young people are more likely to participate.

Note that seaborn has built-in functionality for random noise (jitter), so one can achieve the figure above in somewhat simpler manner using seaborn:

import seaborn as sns
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
##   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
_ = sns.regplot("age", "treat", ci=None, 
                line_kws = {"color":"black"}, 
                scatter_kws = {"s":1, "alpha":0.5},
                x_jitter = 0.4, y_jitter=0.2,
                data=treatment)
_ = plt.show()

plot of chunk unnamed-chunk-5

The black line corresponds to the linear probability model (LPM), a linear regression model to estimate probabilities.

As LPM-s have several less-then-desirable properties, we model the probabilities using logistic regression instead. Logistic regression models the probability using logistic function (sigmoid function): \[\begin{align*} \Pr(Y_i = 1) &= \Lambda ( \eta_i ) = \\ & = \frac{1}{1 + e^{-\eta_i}} \end{align*}\] where \[\begin{equation*} \eta_i = \beta_0 + \beta_1 \cdot \mathit{age}_i. \end{equation*}\] In an analogous fashion as in case of linear regression, we need values for the coefficients, \(\beta_0\) and \(\beta_1\). Lets pick \(1\) and \(-0.1\):

## add predicted logistic line to the scatterplot
age = np.linspace(17,60, 100)
# age, uniformly spaced in 17-60 years
beta0 = 1
beta1 = -0.1
eta = beta0 + age*beta1  # this is how we predicted for OLS
pr = 1/(1 + np.exp(-eta))  # now we predict probability like this instead
_ = plt.plot(age, pr, c="black")
_ = plt.scatter(x, y, s=1, alpha=0.5)
# add the datapoints, knocked off for visibility
_ = plt.show()

plot of chunk unnamed-chunk-6

The computed logistic curve broadly resembles the distribution of data points. We should try different \(\beta_0\), \(\beta_1\) values and improve the fit. However, it is hard to do just by visual inspection, we need some sort of metric, such as log-likelihood, but the latter is out of scope of these notes.

11.2 Logistic Regression in python: statsmodels.formula.api and sklearn

As in case with linear regression, we can use both libraries–statsmodels and sklearn–for logistic regression too. The usage is fairly similar as in case of linear regression, but both libraries come with their own quirks.

11.2.1 Statsmodels and its formula API

When using statsmodels, we need to specify a similar formula as in case of linear regression (see Section 10.2.1). However, before we can do that, we need to address a quirk in statsmodels–namely, it does not allow the outcome to be a logical or string variable. It must be 0/1 (numerical) variable, otherwise you get an error

ValueError: endog has evaluated to an array with multiple columns that has shape (2675, 2). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).

So in order to fit a model where we estimate the probability of treatment, we first have to create a new variable that is just the numeric version of treat. We call it T below:

T = treatment.treat.astype(int)

Note that here we created a workspace variable T, not a data variable. For what we do here, both approaches will work. Now we import the module and estimate the logistic regression model:

import statsmodels.formula.api as smf
m = smf.logit('T ~ age', data=treatment).fit()
## Optimization terminated successfully.
##          Current function value: 0.221883
##          Iterations 8

This code is very similar to linear regression, the only difference is using smf.logit instead of smf.ols. Exactly as linear regression, logistic regression uses R-style formulas (see Section 10.2.1.1). Hence the mode we use is \[\begin{align*} \Pr(Y_i = 1) &= \Lambda ( \eta_i ) = \\ & = \frac{1}{1 + e^{-\eta_i}} \end{align*}\] where \[\begin{equation*} \eta_i = \beta_0 + \beta_1 \cdot \mathit{age}_i. \end{equation*}\] Note that the formula specifies just the link function \(\eta\), the logistic function part of the model is already embedded in smf.logit. We can see the estimated values with

m.params
## Intercept    1.034257
## age         -0.122942
## dtype: float64

This indicates that our example plot above, using values 0 and -0.1, was fairly close.

We can also get a full summary output as

m.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                            Logit Regression Results                           
## ==============================================================================
## Dep. Variable:                      T   No. Observations:                 2675
## Model:                          Logit   Df Residuals:                     2673
## Method:                           MLE   Df Model:                            1
## Date:                Sun, 24 Mar 2024   Pseudo R-squ.:                  0.1176
## Time:                        09:03:15   Log-Likelihood:                -593.54
## converged:                       True   LL-Null:                       -672.65
## Covariance Type:            nonrobust   LLR p-value:                 2.760e-36
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept      1.0343      0.330      3.135      0.002       0.388       1.681
## age           -0.1229      0.012    -10.052      0.000      -0.147      -0.099
## ==============================================================================
## """

The output is similar to that of linear regression, with certain differences that are appropriate for logistic regression.

For completeness, let’s do here another plot that also contains the fitted logistic curve (marked red):

## add predicted logistic line to the scatterplot
age = np.linspace(17,60, 100)
# age, uniformly spaced in 17-60 years
beta0 = 1
beta1 = -0.1
eta = beta0 + age*beta1  # this is how we predicted for OLS
pr = 1/(1 + np.exp(-eta))  # now we predict probability like this instead
_ = plt.plot(age, pr, c="black")
etaX = m.params[0] + m.params[1]*age
    # link computed with fitted values
prX = 1/(1 + np.exp(-etaX))
    # fitted probabilities
_ = plt.plot(age, prX, c="red")
_ = plt.scatter(x, y, s=1, alpha=0.5)
    # use the knocked off-datapoints from above
_ = plt.show()

plot of chunk unnamed-chunk-11

For marginal effects, there is a dedicated method .get_margeff() that computes the effects, it also has its own .summary() method that displays a regression table-like table. These two can be chained as

m.get_margeff().summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
##         Logit Marginal Effects       
## =====================================
## Dep. Variable:                      T
## Method:                          dydx
## At:                           overall
## ==============================================================================
##                 dy/dx    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## age           -0.0075      0.001     -9.181      0.000      -0.009      -0.006
## ==============================================================================
## """

By default, it computes the average marginal effect, average of all individual marginal effects. These figures are directly interpretable, here one year older individuals are 0.75 pct points less likely to participate, in average.

One can also specify different type of marginal effects using the at argument. For instance, at = "mean" will produce marginal effect at the mean of each explanatory variable.

11.2.2 Scikit-learn and LogisticRegression

Doing logistic analysis using sklearn is in many ways similar how to do linear regression in sklearn. Here we assume you are familiar with that section.

Next, we proceed in an analogous fashion as in case of linear regression:

  1. import the function LogisticRegression from sklearn
  2. set up the model
  3. create the numeric-only design matrix X
  4. fit the logistic regression model
  5. analyze the results.

The fact that we can use the same approach with logistic regression as in case of linear regression is a big advantage of sklearn: the same approach applies to other models too, so it is very easy to experiment with different models. Very little additional code and no additional data processing is needed.

First import and set up the model:

# we need to import the model
from sklearn.linear_model import LogisticRegression
# create the model
m = LogisticRegression()

We can specify various options for LogisticRegression but for now we are happy with the default values.

Next, we construct the design matrix, we just need to pull out the “age” variable and ensure the result is a data frame (or matrix):

X = treatment[["age"]]  # remember: double brackets for matrix
y = treatment.treat  # just for consistency

And now we fit the model:

_ = m.fit(X, y)

As in case of linear regression, we can query the estimates as

m.intercept_  # intercept
## array([1.03377561])
m.coef_  # the estimated parameters, here just one
## array([[-0.12292312]])

In case of logistic regression, score-method reports accuracy (percentage of correct predictions) instead of \(R^2\):

m.score(X, y)  # 93% correct
## 0.930841121495327

Note that LogisticRegression, unlike smf.logit, can handle logical outcome variables, we do not have to transform those to numbers.

When you want to include categorical variables, you have to transform those to dummies. See examples in LinearRegression: Categorical variables and Section Using pd.cut.

Exercise 11.1 Take treatment data and model participation (variable treat). Now include all other variables in the model as explanatory variables! (But do not include treat itself).

Compute accuracy–do you get a better number than 93%?

See the solution