Chapter 11 Logistic Regression

We assume you have loaded the following packages:

import numpy as np
import pandas as pd

## /home/otoomet/R/x86_64-pc-linux-gnu-library/4.4/reticulate/python/rpytools/loader.py:120: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
##   return _find_and_load(name, import_)

import matplotlib.pyplot as plt

Below we load more as we introduce more.

Examples in these section use treatment data. This is a dataset that contains information of certain job training programs in 1970s. The data looks like this:

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
treatment.sample(3)

##       treat  age  educ   ethn  married     re74     re75     re78    u74    u75
## 1182  False   28    12  other     True  13714.9  13606.5  19210.4  False  False
## 1750  False   30    16  other     True  26058.2  28645.2  29259.0  False  False
## 306   False   22    12  black     True  14790.5  14680.6  14038.4  False  False

The main variables we are concerned here are treat, whether someone participated or not in the program (True/False) and age. Education is given in years, and re and u variables denote the real income in 1974, 75, and 78, and unemployment in the corresponding years. Below, our main task is to estimate how participation (treat) depends on age.

11.1 Do it manually

Before we embark on the analysis of participation and age, let’s start with a corresponding plot:

_ = plt.scatter(treatment.age, treatment.treat, s=1)
_ = plt.xlabel("Age")
_ = plt.ylabel("Participated")
_ = plt.show()

plot of chunk unnamed-chunk-3

This plot just tells we what we expect to find: there only two possible values for treat, and age is coded as integer values only. Unfortunately, this figure is hard to understand, so we add some random noise to each datapoint:

## add uniform noise in interval [-0.2, 0.2] to the x values...
x = treatment.age + np.random.uniform(-0.2, 0.2, size=len(treatment))
## ...and uniform noise in interval [-0.2, 0.2] to the x values:
y = treatment.treat + np.random.uniform(-0.1, 0.1, size=len(treatment))
_ = plt.scatter(x, y, s=1, alpha=0.5)
_ = plt.xlabel("Age")
_ = plt.ylabel("Participated")
_ = plt.show()

plot of chunk unnamed-chunk-4

This plot is much clearer: we can see that all individual observations are knocked off from their correct location, but thanks to that, we can actually see how many dots there are: it seems like most individuals are untreated (participation is 0), and young people are more likely to participate.

Note that seaborn has built-in functionality for random noise (jitter), so one can achieve the figure above in somewhat simpler manner using seaborn:

import seaborn as sns
_ = sns.regplot("age", "treat", ci=None, 
                line_kws = {"color":"black"}, 
                scatter_kws = {"s":1, "alpha":0.5},
                x_jitter = 0.4, y_jitter=0.2,
                data=treatment)

## TypeError: regplot() got multiple values for argument 'data'

_ = plt.show()

plot of chunk unnamed-chunk-5

The black line corresponds to the linear probability model (LPM), a linear regression model to estimate probabilities.

As LPM-s have several less-then-desirable properties, we model the probabilities using logistic regression instead. Logistic regression models the probability using logistic function (sigmoid function): \[\begin{align*} \Pr(Y_i = 1) &= \Lambda ( \eta_i ) = \\ & = \frac{1}{1 + e^{-\eta_i}} \end{align*}\] where \[\begin{equation*} \eta_i = \beta_0 + \beta_1 \cdot \mathit{age}_i. \end{equation*}\] In an analogous fashion as in case of linear regression, we need values for the coefficients, \(\beta_0\) and \(\beta_1\). Lets pick \(1\) and \(-0.1\):

## add predicted logistic line to the scatterplot
age = np.linspace(17,60, 100)
# age, uniformly spaced in 17-60 years
beta0 = 1
beta1 = -0.1
eta = beta0 + age*beta1  # this is how we predicted for OLS
pr = 1/(1 + np.exp(-eta))  # now we predict probability like this instead
_ = plt.plot(age, pr, c="black")
_ = plt.scatter(x, y, s=1, alpha=0.5)
# add the datapoints, knocked off for visibility
_ = plt.show()

plot of chunk unnamed-chunk-6

The computed logistic curve broadly resembles the distribution of data points. We should try different \(\beta_0\), \(\beta_1\) values and improve the fit. However, it is hard to do just by visual inspection, we need some sort of metric, such as log-likelihood, but the latter is out of scope of these notes.

11.2 Logistic Regression in python: `statsmodels.formula.api` and `sklearn`

As in case with linear regression, we can use both libraries–statsmodels and sklearn–for logistic regression too. The usage is fairly similar as in case of linear regression, but both libraries come with their own quirks.

11.2.1 Statsmodels and its formula API

When using statsmodels, we need to specify a similar formula as in case of linear regression (see Section 10.2.1). However, before we can do that, we need to address a quirk in statsmodels–namely, it does not allow the outcome to be a logical or string variable. It must be 0/1 (numerical) variable, otherwise you get an error

ValueError: endog has evaluated to an array with multiple columns that has shape (2675, 2). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).

So in order to fit a model where we estimate the probability of treatment, we first have to create a new variable that is just the numeric version of treat. We call it T below:

T = treatment.treat.astype(int)

Note that here we created a workspace variable T, not a data variable. For what we do here, both approaches will work. Now we import the module and estimate the logistic regression model:

import statsmodels.formula.api as smf

## ModuleNotFoundError: No module named 'statsmodels'

m = smf.logit('T ~ age', data=treatment).fit()

## NameError: name 'smf' is not defined

This code is very similar to linear regression, the only difference is using smf.logit instead of smf.ols. Exactly as linear regression, logistic regression uses R-style formulas (see Section 10.2.1.1). Hence the mode we use is \[\begin{align*} \Pr(Y_i = 1) &= \Lambda ( \eta_i ) = \\ & = \frac{1}{1 + e^{-\eta_i}} \end{align*}\] where \[\begin{equation*} \eta_i = \beta_0 + \beta_1 \cdot \mathit{age}_i. \end{equation*}\] Note that the formula specifies just the link function \(\eta\), the logistic function part of the model is already embedded in smf.logit. We can see the estimated values with

m.params

## NameError: name 'm' is not defined

This indicates that our example plot above, using values 0 and -0.1, was fairly close.

We can also get a full summary output as

m.summary()

## NameError: name 'm' is not defined

The output is similar to that of linear regression, with certain differences that are appropriate for logistic regression.

For completeness, let’s do here another plot that also contains the fitted logistic curve (marked red):

## add predicted logistic line to the scatterplot
age = np.linspace(17,60, 100)
# age, uniformly spaced in 17-60 years
beta0 = 1
beta1 = -0.1
eta = beta0 + age*beta1  # this is how we predicted for OLS
pr = 1/(1 + np.exp(-eta))  # now we predict probability like this instead
_ = plt.plot(age, pr, c="black")
etaX = m.params[0] + m.params[1]*age
    # link computed with fitted values
prX = 1/(1 + np.exp(-etaX))
    # fitted probabilities
_ = plt.plot(age, prX, c="red")
_ = plt.scatter(x, y, s=1, alpha=0.5)
    # use the knocked off-datapoints from above
_ = plt.show()

plot of chunk unnamed-chunk-11

For marginal effects, there is a dedicated method .get_margeff() that computes the effects, it also has its own .summary() method that displays a regression table-like table. These two can be chained as

m.get_margeff().summary()

## NameError: name 'm' is not defined

By default, it computes the average marginal effect, average of all individual marginal effects. These figures are directly interpretable, here one year older individuals are 0.75 pct points less likely to participate, in average.

One can also specify different type of marginal effects using the at argument. For instance, at = "mean" will produce marginal effect at the mean of each explanatory variable.

11.2.2 Scikit-learn and `LogisticRegression`

Doing logistic analysis using sklearn is in many ways similar how to do linear regression in sklearn. Here we assume you are familiar with that section.

Next, we proceed in an analogous fashion as in case of linear regression:

import the function LogisticRegression from sklearn
set up the model
create the numeric-only design matrix X
fit the logistic regression model
analyze the results.

The fact that we can use the same approach with logistic regression as in case of linear regression is a big advantage of sklearn: the same approach applies to other models too, so it is very easy to experiment with different models. Very little additional code and no additional data processing is needed.

First import and set up the model:

# we need to import the model
from sklearn.linear_model import LogisticRegression
# create the model
m = LogisticRegression()

We can specify various options for LogisticRegression but for now we are happy with the default values.

Next, we construct the design matrix, we just need to pull out the “age” variable and ensure the result is a data frame (or matrix):

X = treatment[["age"]]  # remember: double brackets for matrix
y = treatment.treat  # just for consistency

And now we fit the model:

_ = m.fit(X, y)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

As in case of linear regression, we can query the estimates as

m.intercept_  # intercept

LogisticRegression()

m.coef_  # the estimated parameters, here just one

LogisticRegression()

In case of logistic regression, score-method reports accuracy (percentage of correct predictions) instead of \(R^2\):

m.score(X, y)  # 93% correct

LogisticRegression()

Note that LogisticRegression, unlike smf.logit, can handle logical outcome variables, we do not have to transform those to numbers.

When you want to include categorical variables, you have to transform those to dummies. See examples in LinearRegression: Categorical variables and Section Using pd.cut.

Exercise 11.1 Take treatment data and model participation (variable treat). Now include all other variables in the model as explanatory variables! (But do not include treat itself).

Compute accuracy–do you get a better number than 93%?

See the solution

Chapter 11 Logistic Regression

11.1 Do it manually

11.2 Logistic Regression in python: statsmodels.formula.api and sklearn

11.2.1 Statsmodels and its formula API

11.2.2 Scikit-learn and LogisticRegression

11.2 Logistic Regression in python: `statsmodels.formula.api` and `sklearn`

11.2.2 Scikit-learn and `LogisticRegression`