Chapter 13 Logistic regression

In Section 11.3 we introduced linear regression, one of the most widely used tool to analyze relationships. Linear regression is an excellent choice for analyzing many relationship–but not all of those. In particular, it requires that the outcome variable, $y$, is continuous, or at least close to continuous. This was the case with both temperature, baby birth weight, and iris flower size. Note that the explanatory variable $x$ does not need to be continuous, as was the case with smoking (Section @(relation-cat-dummies)).

However, for a large class of problems, this is not the true. For instance, the question whether someone survived the shipwreck, whether a tweet will be retweeted, and whether an oil drill gets stuck in the drillhole cannot be described with continuous outcome. The passenger either survived or not, and a tweet was retweeted or not. Even if we describe these outcomes with numbers (e.g. survival as “1” and death as “0”), the result is not a continuous problem. We need different tools for this type of tasks.

13.1 What is logistic regression and what is it good for?

Consider policymakers during economically challenging times. Unemployment is large and work is nowhere to be found. Government is spending lot of money on benefits and the voices that are concerned about the effect on workers’ motivation and governments coffers are growing louder. But are workers actually receiving the benefits? And who are the workers who get benefits?

Here we use benefits data. We focus on two variables: ui tells whether an unemployed person receives benefits (yes/no), and tenure is tenure at the job the worker lost, i.e. how many years did they work on that job. An example of the relevant data looks like

library(Ecdat)
data(Benefits)
Benefits %>%
   select(ui, tenure) %>%
   head(4)

##    ui tenure
## 1 yes     21
## 2  no      2
## 3 yes     19
## 4 yes     17

This can be understood as the first worker applied for and received unemployment benefits, and they had been working for 21 years on the previous job. The second worker had been working on the previous job for 2 years, and did not apply for benefits.

But how is receiving benefits related to job tenure? One might guess that the larger the replacement rate, the more worthwhile the workers find to apply for the benefits. But is it actually the case? Let’s find it out!

A good place to start such an analysis is a plot–just plot receiving benefits as a function of replacement rate. We do this using geom_jitter() to kick the points a little off from the only two values (“yes” and “no”) to make them better visible (see Section 11.4.2):

ggplot(Benefits,
       aes(tenure, ui)) +
   geom_jitter(height=0.1, width=0.2,
               size=0.5, 
               alpha=0.3) + 
   labs(y = "Receives benefits",
        x = "Job tenure (years)")

The figure reveals that at every year of tenure, it is common to see both those who receive and those who do not receive benefits. But it is hard to tell how these things are related–are benefits receivers more common among workers with less tenure, or among those with more tenure.

Next, let’s add a trend line to this figure. For this to be possible, we need to convert the variable on y-axis, ui, to a numeric. See Section 11.4.1. This can be done as as.numeric(ui == "yes"):

Benefits %>%
   mutate(
      UI = as.numeric(ui == "yes")
   ) %>%
   ggplot(aes(tenure, UI)) +
   geom_jitter(height=0.1, width=0.2,
               size=0.5, 
               alpha=0.3) +
   geom_smooth(method = "lm",
               se = FALSE) +
   labs(y = "Receives benefits",
        x = "Job tenure (year)")

The trend line is upward sloping. This shows that it is more common the claim benefits for those who have been working at the job for many years.

But compare this plot with the on in Section 11.3.1 where we introduced linear regression. In that plot, the dots (global temperature anomaly) lined very well up with the trend line. Why is it not the case here?

The main culprit is the fact that the outcome is a binary variable. Benefit status can only be “yes” or “no” – “0” or “1” and nothing in between. Someone either receives benefits or does not receive benefits. But a line cannot touch just one or another of these values, a line also connects everything in between. So we necessarily see values like “0.1” and “0.5”, numbers that do not make any sense in terms of receiving benefits.

The way to overcome this problem is to interpret the outcome not as the outcome value–whether someone receives benefits–but the probability of the outcome, probability that someone receives benefits. So a value “0.5” would mean fifty-fifty probability that someone receives or does not receive benefits, while “0.99” means that the person almost certainly receives benefits. Taking this view, the trend line suggests that the probability for someone with 20 years of tenure to claim benefits is approximately 80%, but for a person with just one year of tenure, it is more like 55%.

13.1.1 Fix the draft text below

TBD

In fact, this approach is widely used and a linear regression model that describes probability is called (LPM, see Section~). But LPM-s also have another problem. You can see that the line falls below zero around age 46. Should we interpret it as the 50-year olds have a negative probability to participate? That is obviously nonsense. In a similar fashion, the line will exceed probability 1 somewhere (the age where this happens will be negative in this case, so it is not a problem here). We can obviously hack the model in a way that we set probability to zero if the predicted probability is negative. But what should we do with the 48-year old participant then (the oldest participant in Figure~ is 48 years old)? If the participation probability at that age is zero then we should not see even a single participant in that age category. But we see a few, so we need to set the probability not to zero but to a small positive number… If you are still with me then you probably agree that making linear regression to work with probabilities needs a lot of hacking, and the model is not a nice and intuitive any more. So we need another model that

There is a wide range of applications with binary outcomes where can such a model is handy. For instance, if someone attends college, gets a job, defaults a loan, that an email is spam, or that an image depicts a cat are all binary-outcome questions. And linear regression is not well suited to answer such questions.

(aka ) is the most popular model designed for exactly this type of tasks, the tasks with . Binary outcome'' means these questions can only have two answers--''0'' or1’‘, true'' orfalse’‘, cat'' ornot a cat’‘. This makes it distinct from linear regression that is designed to measure , i.e. outcomes that can take all sorts of values. Whether the outcome is numeric or something else plays almost no role for logistic regression, we can always transform two possible outcomes into 0'' and1’’. This is what we did with with treated and non-treated above.

Mathematically, it is essentially a transformation of linear regression model that is interpreted as probability. The transformation is done using (aka ) \[\begin{equation} \label{eq:logistic-function} \Lambda(\eta) = \frac{\me^{\eta}}{\me^{\eta} + 1} = \frac{1}{1 + \me^{-\eta}}. \end{equation}\] Here one can understand $\eta$ as the ``output’’ of linear regression, and $\Lambda(\eta)$ is of $\eta$, (see Figure~). It has two properties that make it a perfect fit for probability modeling: More specifically, $\eta$ in the logistic regression formula~ is not called

linear regression output'' but \emph{link}, \index{logistic regression!link|textbf} \emph{linear predictor} or \emph{linear index}.  But it is calculated in exactly the same way as the predicted value for linear regression: $\eta_{i} = \beta_{0} + \beta_{1} \cdot x_{i}$ in case of a single explanatory variable, or in vectorized form as $\theta_{i} =  \vec{\beta}^{\transpose} \cdot \vec{x}_{i}$ in case of multiple explanatory variables.  So we can write the logistic probability in a slightly longer form as \begin{equation}   \label{eq:logistic-probability}   \Lambda(\vec{x})  = \frac{1}{1 +      \displaystyle\me^{-\vec{\beta}^{\transpose} \cdot \vec{x}_{i}}}. \end{equation} This is the expression for the probability that the outcome is

1’’ for given values of $\vec{x}$. For completeness, we state it once again:

This must be understood as the rule to compute the probability that the outcome $Y=1$ if the value of the explanatory variable is $\vec{x}$. Exactly as in case of linear regression, we have to find such parameter vector $\vec{\beta}$ that gives the ``best’’ fit with data.

Note another important difference between logistic and linear regression models. Namely, the logistic regression~ does not contain an error term while the linear regression~ does. This is because in case of linear regression we are modeling outcome value, and we need an error term to take into account the fact that the modeled and actual values almost always differ. But in case of logistic regression, we model probability, which means that the event may happen or not happen. Probability describes a process that is already stochastic, so we do not need an additional error term.

Let us demonstrate these calculations using treatment data above (Figure~). But before we can even calculate anything, we have to specify which event are we modeling—are we modeling probability of treatment or non-treatment? In this case it seems more natural to model probability of treatment, $\Pr(T=1)$ instead of $\Pr(T=0)$. It is often useful to model probability of the rare'' events, or probability ofinterventions’‘. Treatment checks both boxes here as only $\sim 7\%$ of cases in data are treated, and treatment is more ``active’’ process than non-treatment. But both approaches are equally valid, one has to make a decision and stick with that.

As we look at how the treatment probability depends on age, we have a single explanatory variable $x$, namely age, and we can write \[\begin{equation} \label{eq:logit-treatment-age-link} \begin{split} \eta_{i} &= \beta_{0} + \beta_{1} \cdot \mathit{age}_{i} \\[2ex] \Pr(T_{i} = 1) &= \frac{1}{1 + \displaystyle\me^{-\eta_{i}}} = \frac{1}{1 + \displaystyle\me^{-\beta_{0} - \beta_{1} \cdot \mathit{age}_{i}}}. \end{split} \end{equation}\] In order to actually calculate the probability of treatment, we have to pick $\beta_{0}$ and $\beta_{1}$ values.

<<>>= ## Need this below pct30 <- treatment[age == 30, mean(treat)] @ For instance, let’s just guess that the values $0$ and $-0.1$ for $\beta_{0}$ and $\beta_{1}$ respectively, and compute the participation probability for a 30-year old person. We have \[\begin{equation} \label{eq:example-participation-probability-30-year-old} \Pr(T=1|\mathit{age}= 30) = \frac{1}{1 + \displaystyle\me^{-\beta_{0} - \beta_{1} \cdot \mathit{age}}} = \frac{1}{1 + \displaystyle\me^{-0 + 0.1 \cdot 30}} = \frac{1}{1 + \displaystyle\me^{3}} \approx 0.047. \end{equation}\] So our model, given the choice of parameters, predicts that rougly 5% of 30-year olds will participate. The actual number in data is . Figure~ shows how the modeled participation probability depends on age for three different sets of parameters. The figure suggests that out of the three combinations displayed there, the one we calculated above $(0, -0.1)$ (blue curve) is close to actual data. The red curve $(0, 0.05)$ gets age dependency completely wrong, and the green curve $(0,-0.05)$ suggests participation probabilities that are too high. But it is hard to select good combination of parameters just by visual inspection even for this simple case with a single explanatory variable only. The best set of parameters for logistic regression is usually computed using Maximum Likelihood method (see~). The corresponding probability is shown by the dashed black curve.

\begin{figure}[ht] <<message=FALSE>>= m <- glm(treat ~ age, data=treatment, family=binomial()) f <- function(x, b0, b1) plogis(b0 + b1*x) palette <- scales::hue_pal()(3) %>% c(“black”) ggplot(treatment, aes(age, as.numeric(treat))) + geom_jitter(height=0.04, width=0.3, size=0.5, col=“skyblue2”, alpha=0.5) + geom_function(fun = f, args=list(b0=0, b1=0.05), col=palette[1]) + geom_function(fun = f, args=list(b0=0, b1=-0.05), col=palette[2]) + geom_function(fun = f, args=list(b0=0, b1=-0.1), col=palette[3]) + ## Mark ML estimator geom_function(fun = f, args=list(b0=coef(m)[1], b1=coef(m)[2]), col=palette[4], linetype=“dashed”) + labs(y=“Participation”, x=“Age”) @

\end{figure}

When we compute the best possible coefficients (the dashed black line in Figure~), we get the following results: <<glm, results=“asis”>>= summary(m) %>% xtable() %>% print(booktabs=TRUE, include.rownames=TRUE, sanitize.rownames.function=latexNames) @ The results table, as provided by common software packages, looks rather similar to the linear regression table (see Table~). We see similar columns for estimates, standard error, $z$-value and $p$-value (obviously, different software packages provide somewhat different output). The meaning of the parameters is rather similar to that of linear regression with two main differences: first, the interpretation of logistic coefficients is quite different from that of the linear regression coefficients, so it is explained in the next section ().

Second, instead of $t$-values, logistic regression estimates are typically reported with $z$-values. From practical standpoint, these are fairly similar. Just instead of critical $t$ value, we are concerned with critical $z$-values (for 5%-significance level it is 1.96, see Table~). In a similar fashion, $z$-value measures distance between the estimated coefficient and $H_{0}$ value, and in exactly the same way, the software normally assumes $H_{0}: \beta = 0$. The difference between $z$ and $t$ values is primarily in the assumptions. In case of linear regression, for the $t$ values to be correct, the error term $\epsilon$ must be normally distributed. In logistic regression, for $z$ values to be correct, the sample size must be large.

Logistic regression is in many ways similar to linear regression, including by being an interpretable model. Unfortunately, interpretation of logistic regression results is more complicated than in case of linear regression. There are two related reasons for that. First, logistic regression is a non-linear model, and hence the slope depends on the values of the explanatory variables (see Figure~). And second, because the slope depends on the explanatory variables, we cannot just interpret the parameters $\beta_{0}$ and $\beta_{1}$ directly in terms of probability.

There are two popular ways to overcome these limitations: and .

Marginal effect (ME) is slope of the logistic function on the figure where we have probability on the $y$-axis and the explanatory variable $x$ (not the link function $\eta$) on the $x$-axis. Marginal effect answers the same question as slope $\beta_{1}$ in case of linear regression: . In the example above, ME will answer the question . In case of multiple logistic regression we should also the add the phrase . So, in this sense marginal effects are very similar to linear regression coefficients. However, there are two major differences, both of these related to the fact that we now have a non-linear model:

As marginal effect is just slope, we can compute it by taking the derivative of the logistic probability. For instance, in order to compute the marginal effect of age in the example above, we take derivative of the treatment probability~: \[\begin{equation} \label{eq:logit-age-marginal-effect} \pderiv{\mathit{age}} \frac{1}{1 + \displaystyle\me^{-\eta}} = -\frac{\me^{-\eta}}{ (1 + \displaystyle\me^{-\eta})^{2} } \beta_{1} \end{equation}\] where $\eta = \beta_{0} + \beta_{1} \cdot \mathit{age}$. This is straightforward to compute, but normally we let statistical software do the work.

Figure~ demonstrates the meaning of marginal effects. The thick black curve is the logistic curve as a function of the link $\eta$. Its slope differs at different points, here we have marked $\eta_{1} = -2.5$ where the slope is $0.069$, and $\eta = 0.5$ where the slope is $0.23$. These numbers—$0.069$ and $0.23$—are the . But we are interested in instead–how much more likely it is to participate for those who are one year older. Now we have to take into account that $\eta$ depends on $x$ as $\eta = \beta_{0} + \beta_{1} x$. Hence one unit larger $x$ means $\beta_{1}$ units larger $\eta$ and hence the marginal effect of $x$ is just the marginal effect of $\eta$, multiplied by $\beta_{1}$.

As marginal effects depend on $\vec{x}$, we cannot just provide marginal effects that apply universally. Obviously, in case $\eta$ is very small or very large, the effect will also be very small, while the $\eta$ values near 0 are associated with larger effects. Typically, one of these three options is reported: Example of marginal effect output is in the table below:

The basics of this table are quite similar to that of the logistic coefficients table above. is average marginal effect, software computes the marginal effects for every individual in these data and takes the average. stands for the standard error of , $z$ and $p$ are the corresponding $z$ and $p$ values, and the two last columns are CI for AME.

AME is directly interpretable in a similar fashion like the $\beta$-s in linear regression. The number $\Sexpr{round(ame, 4)}$ means that: This can be phrased somewhat better using percentage points:

And as explained above, if we are working with multiple logistic regression, we should add ``’’ to the sentence above.

Another popular way to interpret logistic regression results is by using . Odds ratio is simply the ratio of the one group to the other, in the example above it will be the probability of participation over the probability of non-participation, \[\begin{equation} \label{eq:odds-ratio} r = \frac{\Pr(Y=1|\vec{x})}{\Pr(Y=0|\vec{x})} \end{equation}\] If we compute the probabilities using the sample averages, we get \[\begin{equation} \label{eq:odds-ratio-participation} r = \frac{N_{y=1}}{N_{y=0}} = \frac{ \Sexpr{sum(treatment$treat)} }{ \Sexpr{sum(!treatment$treat)} } = \Sexpr{ round(sum(treatment$treat)/sum(!treatment$treat), 3) }. \end{equation}\] Odds ratios are popular to describe the probabilities of certain kind of events, such winning chances in certain horse races. But unfortunately, these ratios are not used widely, and hence people tend not to understand the values well.

It turns out that logit coefficients are directly interpretable as effects on logarithms of odds ratios, . From~ we can express $\me^{\vec{\beta}^{\transpose} \cdot \vec{x}_{i}}$ as \[\begin{equation} \label{eq:logit-odds} \me^{\vec{\beta}^{\transpose} \cdot \vec{x}_{i}} = \frac{\Pr(Y=1|\vec{x})}{1 - \Pr(Y=1|\vec{x})} = \frac{\Pr(Y=1|\vec{x})}{\Pr(Y=0|\vec{x})}. \end{equation}\] This is exactly odds ratio.

We can use this idea to find the effect on the odds ratio. Consider two vectors of explanatory variables, $\vec{x}_{1}$ and $\vec{x}_{2}$. The latter is otherwise equal to the former, except one of $\vec{x}_{2}$ components, $x_{2i}$, is larger by one unit compared to $x_{1i}$. So while $\vec{x}_{1} = (1, x_{11}, x_{12}, \dots, x_{1i}, \dots, x_{1K})$, the $\vec{x}_{2} = (1, x_{11}, x_{12}, \dots, (x_{1i} + 1), \dots, x_{1K})$. Hence the odds ratio for case $\vec{x}_{2}$ is \[\begin{multline} \label{eq:odds-ratio-effect} \frac{\Pr(Y=1|\vec{x}_{2})}{\Pr(Y=0|\vec{x}_{2})} = \me^{ \beta_{0} + \beta_{1}\,x_{11} + \beta_{2}\,x_{12} + \dots + \beta_{i}\,(x_{1i} + 1) + \dots + \beta_{K}\, x_{1K} } =\\= \me^{ \beta_{0} + \beta_{1}\,x_{11} + \beta_{2}\,x_{12} + \dots + \beta_{i}\,x_{1i} + \dots + \beta_{K}\, x_{1K} } \me^{\beta_{i}} = \frac{\Pr(Y=1|\vec{x}_{1})}{\Pr(Y=0|\vec{x}_{1})} \me^{\beta_{i}}. \end{multline}\] So $\me^{\beta}$ describes the multiplicative effect on odds ratio: if $x$ is larger by one unit, the odds ratio is larger by $\me^{\beta}$ units. <<include=FALSE>>= ca <- coef(m)[“age”] eca <- exp(ca) ecaPct <- eca*100 ecaDiff <- 100 - ecaPct @ For instance, the age effect in the model~ above is . Hence the odds ratio effect is \[\begin{equation*} \me^{\Sexpr{round(ca, 3)}} = \Sexpr{round(eca, 3)}. \end{equation*}\] This means that odds of one year older individuals is times that of younger individuals. Or alternatively, one year older individuals have % lower odds to participate. Note that unlike in case of marginal effects, this number– %–is measured in percentages (of the baseline rate), not percentage points.

Odds ratios have two advantages over marginal effects: they are easier to compute (you only need to take exponent) and they are stable–odds ratios are constant and independent of personal characteristics. This contrasts to marginal effects that depend on the other parameters. But as odds ratios are harder to understand, and as nowadays the software to compute marginal effects is easily available, the odds ratios have become less popular.