Chapter 11 Regression Models

This chapter discusses linear and logistic regression. These are one of the most basic and most widely used statistical models to describe relationship in data. Linear regression is used to describe continuous outcome variables, and logistic regression is used for binary outcome variables. Below, we discuss how to implement and evaluate these in R. We assume that you know the theoretical side of the models, here we only discuss the implementation in R.

11.1 Linear regression

This section describes linear regression. First, we discuss how to optimize the values manually, but most of the section is devoted to built-in R modeling tools, namely formulas that you can use with dedicated model functions, such as lm().

11.1.1 Working With Iris Virginica Data

We demonstrate the usage of linear regression–function lm–with iris data. Iris data, first used by Ronald Fisher 1936, is included in the built-in R dataset iris. It contains sepal and petal measures for three species of iris, setosa, virginica, and versicolor. See Section 14.2 for more about the data and the flower parts. More specifically, we analyze the relationship between the petal width and length for iris virginica only, focusing on a single species only will keep the data more homogeneous.

We start by filtering and plotting virginica data (see dplyr package for more details):

data(iris)
virginica <- iris %>%
   filter(Species == "virginica")
head(virginica, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          6.3         3.3          6.0         2.5 virginica
## 2          5.8         2.7          5.1         1.9 virginica
## 3          7.1         3.0          5.9         2.1 virginica

The table above displays a few examples of the data. We see that petals are approximately 4-5 cm long and 1-2 cm wide.

plot of chunk unnamed-chunk-2

Next, let’s make a plot of petal lengths versus width to have an idea about what kind of relationship can we expect:

ggplot(
   virginica,
   aes(Petal.Length, Petal.Width)) +
   geom_jitter() +
   labs(x = "Petal length (cm)",
        y = "Petal width (cm)")

Not surprisingly, their width and length are related–longer leaves tend to be wider, but the relationship is quite noisy. Just be looking at the figure, we can also guess that just a straight line will be a good enough way to summarize this relationship. There is no reason to pick anything more complex like a curve.

Note that we have added a little jitter to data because the measurements have limited precision (0.1 cm) and otherwise some datapoints will overlap.
See Plotting Data: The ggplot way for more details about ggplot.

Next, let’s describe the relationship using linear regression.

11.1.2 Manually Compute and Minimize \(SSE\)

While statistical software provides easy ways to solve the linear regression problems, it is instructive to compute and optimize the sum of squared errors, \(SSE\), manually. As a reminder, we define \(SSE\) as \[\begin{equation} SSE(\beta_{0}, \beta_{1}) = \sum_{i=1}^{N} (y_{i} - \hat y_{i} )^{2} \end{equation}\] where \(\hat y_i = \beta_0 + \beta_1 \cdot x_i\) is the predicted value for case \(i\). It depends on the corresponding explanatory variable \(x_i\), and the unknown coefficients \(\beta_0\) and \(\beta_1\). Solving the linear regression problem means finding vector \((\beta_0, \beta_1)\) so that \(SSE\) is as small as possible.

Attempt a. We can start by picking just arbitrary values for \(\beta_0\) and \(\beta_1\) in a reasonable range. Take \(\beta_0 = 0\) and \(\beta_1 = 0.5\), i.e. assume that 0-length leaves are of zero width (this is a reasonable assumption), and that 1 cm longer leaves are 0.5 cm wider. This may or may not be correct but at least it does not feel outrageously wrong. Let us compute \(SSE\) and plot the corresponding regression line:

## define x and y for clarity
x <- virginica$Petal.Length
y <- virginica$Petal.Width
## define parameter values
beta0 <- 0  
beta1 <- 0.5
## Compute predicted width
yHat <- beta0 + beta1*x
## Compute SSE by its definition
SSE <- sum((y - yHat)^2)
SSE
## [1] 33.16

The code does the following: first, it defines \(x\) and \(y\) for clarity. Thereafter, it assigns \(\beta_0\) and \(\beta_1\) to the values we picked above. Next, we compute \(\hat y\), yHat, by following the definition of linear regression prediction above. And finally we compute SSE but following its definition.

The value 33.16 does not tell much by itself. But we can compare it with \(SSE\) values computed for different \(\mathbb{\beta}\)-s and hence tell which parameter vector is better. See below for the plot.

Attempt b. Now use parameters \(\beta_0 = 0\) and \(\beta_1 = 0.4\):

beta0 <- 0
beta1 <- 0.4  # only change one parameter value
yHat <- beta0 + beta1*x
SSE <- sum((y - yHat)^2)
SSE
## [1] 6.0676

The new result, 6.0676, is much better than the previous one, and hence \((0, 0.4)\) is a better vector of parameters than \((0, 0.5)\). (See below for the plot.)

Attempt c. Set \(\beta_0 = -0.1\) while keeping our previous best value \(\beta_1 = 0.4\):

beta0 <- -0.1
beta1 <- 0.4
yHat <- beta0 + beta1*x
SSE <- sum((y - yHat)^2)
SSE
## [1] 4.6196

This \(SSE\), 4.6196 is even smaller, indicating that we have found an even better combinations of \(\beta_0\) and \(\beta_1\).

Exercise 11.1 It is a good exercise to continue the manual optimization further. Repeat this procedure many times and see how small \(SSE\) will you get. I recommend to make a function sse(beta0, beta1) to make the calculations easier.

You may consider following coordinate descent approach: manipulating one parameter (either \(\beta_0\) or \(\beta_1\)) while keeping the other constant until you find a smaller \(SSE\), and thereafter manipulating the other parameter. You can repeat the procedure many times over.

See a solution

plot of chunk unnamed-chunk-6

All three attempts depicted as the corresponding prediction lines.

Finally, let’s plot all three predicted models:

## Create data frame of the
## three attempted parameters
lines <- data.frame(
   attempt = c("a", "b", "c"),
   intercept = c(0, 0, -0.1),
   slope = c(0.5, 0.4, 0.4))
## Make a scatterplot with
## the predicted lines
ggplot(virginica,
       aes(Petal.Length, Petal.Width)) +
   geom_jitter() +
   labs(x = "Petal length (cm)",
        y = "Petal width (cm)") +
   geom_abline(data=lines,
               aes(intercept=intercept,
                   slope=slope,
                   col=attempt))

The code adds straight lines to the previous by specifying their intercept (\(\beta_0\)) and slope (\(\beta_1\)). The lines are distinguished by their color, denoting our three attempts, here labeled as letters a, b and c. The figure shows clearly that the last attempt, c, is the best one, but it still seems to predict too large values. We should perhaps make the intercept \(\beta_0\) even smaller to get the predicted values into the middle of the point cloud.

Exercise 11.2 Use the same virginica data. However, now analyze the relationship between sepal width and sepal length, not petal width and petal length.

  • Optimize the parameters \((\beta_0, \beta_1)\) manually until the resulting \(SSE < 5\).

See the solution

11.1.3 Using R modeling tools

R language is designed for statistical work and it includes several unique tools to help with modeling. Here we mainly discuss formulas, special objects in R language that are designed for creating statistical models. Internally, formulas and the provided dataset are used to construct the design matrix, a data structure that is used for the actual computation. In this section we discuss formulas in linear regression context using the lm function.

11.1.3.1 lm and formulas

Linear regression in R is typically run with lm function. In its basic form, it takes two arguments: formula and data. The formula defines the outcome and explanatory variables in the form

outcome ~ var1 + var2 + ...

where outcome is the name of the dependent variable (without quotes), and var1, var2, etc are the names of independent variables (without quotes). data is just the dataset where these variables “live”. If they are not found in the dataset, R looks for the variables in the workspace (and in other parent environments). This formula is equivalent to the mathematical notation \[\begin{equation} \mathit{outcome}_i = \beta_0 + \beta_1 \cdot \mathit{var1}_i + \beta_2 \cdot \mathit{var2}_i + \epsilon_i \end{equation}\] where \(i\) counts the observations.

So the example regression model with petal width and length of virginica flower, \[\begin{equation} \text{petal width}_i = \beta_0 + \beta_1 \cdot \text{petal length}_i + \epsilon_i \end{equation}\] can be run as

m <- lm(Petal.Width ~ Petal.Length, data=virginica)

Remember the basic formula:

outcome ~ explanatory + explanatory...

So the outcome variable goes left and explanatory variables go right, they must be separated by tilde.

This code fits the model and stores the result in variable m. We can get nice output table using summary method:

summary(m)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length, data = virginica)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6337 -0.1608  0.0041  0.1812  0.4503 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    1.1360     0.3794   2.995  0.00434 **
## Petal.Length   0.1603     0.0680   2.357  0.02254 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2627 on 48 degrees of freedom
## Multiple R-squared:  0.1038, Adjusted R-squared:  0.08508 
## F-statistic: 5.557 on 1 and 48 DF,  p-value: 0.02254

The output displays all the estimates, t- and p-values, statistical significance and some diagnostic data. In particular, we can see that \(R^2 = 0.104\), not an impressive result. This confirms our visual impression that the point cloud is far from a straight line.

One can also extract coefficients alone, without the table with coef method:

coef(m)
##  (Intercept) Petal.Length 
##     1.136031     0.160297

This results with a named vector of coefficients, it is handy when one needs to use the estimated values in code. confint method displays the corresponding confidence intervals:

confint(m)
##                   2.5 %    97.5 %
## (Intercept)  0.37326439 1.8987982
## Petal.Length 0.02357139 0.2970225

We can use the formula approach and include more explanatory variables, e.g. we can include all the information about sepals in order to estimate the petal width as \[\begin{equation} \text{petal width}_i = \beta_0 + \beta_1 \cdot \text{petal length}_i + \beta_2 \cdot \text{sepal length}_i + \beta_3 \cdot \text{sepal width}_i + \epsilon_i \end{equation}\]

m <- lm(Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width,
        data=virginica)
summary(m)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width, 
##     data = virginica)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51430 -0.17403  0.00328  0.16883  0.44503 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.50026    0.39454   1.268 0.211192    
## Petal.Length  0.14968    0.12084   1.239 0.221751    
## Sepal.Length -0.09258    0.10803  -0.857 0.395875    
## Sepal.Width   0.43869    0.11698   3.750 0.000493 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2348 on 46 degrees of freedom
## Multiple R-squared:  0.3136, Adjusted R-squared:  0.2689 
## F-statistic: 7.006 on 3 and 46 DF,  p-value: 0.00056

We can see that \(R^2\) almost tripled over the previous model; it also looks like the only significant predictor for petal width is sepal width, a variable we did not include in the previous model.

More importantly, we see that our formula included four terms: three flower characteristics and the intercept. Note that plus sign, +, in the formula is not mathematical addition–after all, we were not adding together these lengths and widths. Instead, + means to include the variable in the model. So we just included petal length and sepal dimensions into the model in a way that linear regression considers appropriate. In a similar fashion, minus sign - means to exclude the variable from model. For instance, we can exclude the intercept, denoted by 1, from the first model:

m <- lm(Petal.Width ~ Petal.Length - 1, data=virginica)
summary(m)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length - 1, data = virginica)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63253 -0.22492  0.05783  0.18080  0.54895 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## Petal.Length 0.362951   0.007181   50.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2833 on 49 degrees of freedom
## Multiple R-squared:  0.9812, Adjusted R-squared:  0.9808 
## F-statistic:  2555 on 1 and 49 DF,  p-value: < 2.2e-16

Note that \(R^2\) is unreliable in case of the model does not contain intercept.

Finally, the asterisk * between two variables as in x * z means to include both x, z, and their interaction term \(x\times z\) into the model. For instance, if we want to estimate \[\begin{equation} \text{petal width}_i = \beta_0 + \beta_1 \cdot \text{petal length}_i + \beta_2 \cdot \text{sepal width}_i + \beta_3 \cdot \text{petal length}_i \times \text{sepal width}_i + \epsilon_i \end{equation}\] we can do it as

m <- lm(Petal.Width ~ Petal.Length*Sepal.Width, data=virginica)
summary(m)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length * Sepal.Width, data = virginica)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47266 -0.15242  0.02304  0.15318  0.47864 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)  
## (Intercept)               -4.3377     2.6763  -1.621   0.1119  
## Petal.Length               0.8914     0.4636   1.923   0.0607 .
## Sepal.Width                2.0713     0.9249   2.240   0.0300 *
## Petal.Length:Sepal.Width  -0.2862     0.1586  -1.804   0.0777 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2288 on 46 degrees of freedom
## Multiple R-squared:  0.3487, Adjusted R-squared:  0.3063 
## F-statistic: 8.211 on 3 and 46 DF,  p-value: 0.0001753

The table shows four coefficients: intercept, petal length, sepal width, and the interaction effect of petal length and sepal with. The latter is denoted by : as Petal.Length:Sepal.Width.

These three symbols, +, -, and * cover the modeling tools that are most frequently needed for linear and logistic regression modeling. Next we discuss a few other tools that are occasionally very handy.

11.1.4 Categorical variables and factor

Categorical variables are special kind of variables where the values are not numbers, but a small number of selected categories. See Section 2.1.7 for more.

Categorical variables are handled fairly transparently by the model formulas. In particular, all text (character) variables are automatically transformed to categoricals, with the reference category being the first one (in alphabetical order). However, sometimes the categories are coded as numbers. In that case, the models assume that the categories are similar numbers as all others, i.e. one can do all standard mathematics with these. If this is not desired, one has to explicitly convert numbers to categories with the factor function.

For instance, let’s use the orange tree data to model the tree size as a function of age for different trees. The dataset looks like

orange <- read_delim("../data/orange-trees.csv")
orange %>%
   slice(c(1,2, 7,8))
## # A tibble: 4 × 3
##    tree   age circumference
##   <dbl> <dbl>         <dbl>
## 1     1   118            30
## 2     1   484            58
## 3     1  1582           145
## 4     2   118            33

In particular, note the tree is a numeric variable.

We want to to model the tree size in a way that they all grow at the same rate, but may have different initial size. Formally, the size of tree \(i\) at age \(t\) will be: \[\begin{equation} \text{circumference}_{it} = \beta_0 + \beta_1 \cdot \text{tree}_{i} + \beta_2 \cdot \text{age}_{t} + \epsilon_{it}. \end{equation}\] This means we assume all trees grow at the same rate \(\beta_2\), but they may have different initial sizes \(\beta_0 + \beta_1 \cdot \text{tree}_i\). Obviously, the tree id \(\text{tree}_i\) is a category. But if we model it as

lm(circumference ~ tree + age, data = orange) %>%
   summary()
## 
## Call:
## lm(formula = circumference ~ tree + age, data = orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.310 -15.599  -4.168  20.197  42.397 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.256793  12.132031   0.763    0.451    
## tree         2.714286   2.840950   0.955    0.347    
## age          0.106770   0.008288  12.883  3.3e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.77 on 32 degrees of freedom
## Multiple R-squared:  0.8391, Adjusted R-squared:  0.8291 
## F-statistic: 83.44 on 2 and 32 DF,  p-value: 2.017e-13

The problem is, however, that \(\text{tree}\) is a number, and hence we only get a single value for \(\beta_2\). This can be interpreted as “if tree is larger by one unit, then circumference is larger by 2.7 mm”. This does not make much sense, as there are no inherent relationship between tree id and its circumference.

As a solution, we need to convert it to a categorical:

lm(circumference ~ factor(tree) + age,
   data = orange) %>%
   summary()
## 
## Call:
## lm(formula = circumference ~ factor(tree) + age, data = orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.505  -8.790   3.737   7.650  21.859 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.113936   7.572732   0.147 0.884072    
## factor(tree)2 35.714286   8.157252   4.378 0.000142 ***
## factor(tree)3 -5.571429   8.157252  -0.683 0.500025    
## factor(tree)4 39.714286   8.157252   4.869 3.65e-05 ***
## factor(tree)5 11.571429   8.157252   1.419 0.166690    
## age            0.106770   0.005321  20.066  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.26 on 29 degrees of freedom
## Multiple R-squared:  0.9399, Adjusted R-squared:  0.9295 
## F-statistic:  90.7 on 5 and 29 DF,  p-value: < 2.2e-16

Now the tree id, numbers 1-5, will be treated as categories. Tree 1 will be the reference category, and we can see that trees 2 and 4 are statistically significantly larger than tree 1.

Exercise 11.3 Use titanic data to model the survival probability as \[\begin{equation} \text{survived}_{i} = \beta_0 + \beta_1 \cdot \text{pclass}_{i} + \beta_2 \cdot \text{sex}_{i} + \epsilon_{i}. \end{equation}\] Note that it is asked to use linear regression, not logistic regression here, to model probability. This is called linear probability model.

Do it in two ways: first, use pclass as a number, and second, convert pclass to a categorical. Interpret the results in both cases, including the intercept.

See the solution

11.1.5 Using functions in formulas

Often we want to estimate models that include functions, such as logarithms of exponents of certain variables. We demonstrate this using Treatment data, a dataset about job training programs in the U.S. in 1970s. Let’s analyze how does post-training income (variable re78) depend on education (educ) and age.

However, when using a linear model to describe income, it is typically better to model logarithm of it. So we want to estimate \[\begin{equation} \log(\mathit{re78}_i) = \beta_0 + \beta_1 \cdot \mathit{educ}_{i} + \beta_2 \cdot \mathit{age}_{i} + \epsilon_{i}. \end{equation}\] Fortunately, this is easy to achieve by just writing log(re78) ~ in the model formula. There is an additional issue though–many people earn no income, and hence their re78 is 0. We cannot compute the logarithm of 0. Instead, we can only look at those who actually earn a positive income:

treatment <- read_delim("../data/treatment.csv.bz2")
treatment %>%
   filter(re78 > 0) %>%
   lm(log(re78) ~ educ + age,
      data = .) %>%
   summary()
## 
## Call:
## lm(formula = log(re78) ~ educ + age, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5497 -0.2395  0.1326  0.4328  2.1902 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.827642   0.088168   88.78   <2e-16 ***
## educ        0.111117   0.005142   21.61   <2e-16 ***
## age         0.019198   0.001516   12.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7431 on 2341 degrees of freedom
## Multiple R-squared:  0.1908, Adjusted R-squared:  0.1901 
## F-statistic:   276 on 2 and 2341 DF,  p-value: < 2.2e-16

The result shows that one additional year of education is associated with roughly 11 pct points more income. Income also grows in age, but at a much slower pace. Finally, \(R^2\) of the model is 0.19, this is a typical results for such regression models.

Exercise 11.4 What is a better way to model age dependency–just \(\mathit{age}\) or \(\log \mathit{age}\)? Replace age with log age in the model above, and see if it improves \(R^2\).

See the solution

But modeling the dependence on age through a single linear term \(\beta_2 \cdot \mathit{age}_{i}\) may be too restrictive. This assumes that the income will grow each year at the same pace, no matter if the worker is young or old. But in practice, the income typically peaks in fifties, and will either plateau or start falling right before the retirement. A common solution is to add a quadratic term in age, \(\beta_3 \cdot \mathit{age}_{i}^2\), to the model, so it will look like \[\begin{equation} \log(\mathit{re78}_i) = \beta_0 + \beta_1 \cdot \mathit{educ}_{i} + \beta_2 \cdot \mathit{age}_{i} + \beta_3 \cdot \mathit{age}_{i}^2 + \epsilon_{i}. \end{equation}\] Such term can be included into the formula by I(age^2). Remember, the ordinary mathematical operators like + and * have different meaning in formulas. The I() function, however, preserves the ordinary meaning of the operators, so I(age^2) means to add an additional term \(\mathit{age}^2\):

treatment %>%
   filter(re78 > 0) %>%
   lm(log(re78) ~ educ + age + I(age^2),
      data = .) %>%
   summary()
## 
## Call:
## lm(formula = log(re78) ~ educ + age + I(age^2), data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6291 -0.2333  0.1229  0.4229  2.0966 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.5761289  0.2071765  31.742  < 2e-16 ***
## educ         0.1062689  0.0051469  20.647  < 2e-16 ***
## age          0.0989778  0.0120690   8.201 3.89e-16 ***
## I(age^2)    -0.0011108  0.0001667  -6.662 3.35e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7363 on 2340 degrees of freedom
## Multiple R-squared:  0.2059, Adjusted R-squared:  0.2049 
## F-statistic: 202.2 on 3 and 2340 DF,  p-value: < 2.2e-16

We see that the age term now has a much larger value than above, while age^2 terms is negative and highly significant. Hence the model captures a similar plateauing property–the income initially grows in age, but flattens out after a certain number of years.

11.1.6 Polynomial regression

When data is noisy, then a linear fit–just describing the relationship with a single line–is usually good enough. But sometimes it clearly misses the mark. A common way to adjust for curved relationship is to use polynomial regression.

plot of chunk curved-relationship

Hump-shaped relationship between age and income (re78) in Treatment data.

Here is an example from Treatment data (see Section 14.5). We start by showing the relationship between age and income:

treatment <- read_delim(
   "../data/treatment.csv.bz2")
ggplot(treatment,
       aes(age, re78)) +
   geom_jitter(size = 0.6) +
   geom_smooth(se = FALSE,
               col = "orangered2",
               linewidth = 2) +
   geom_smooth(se = FALSE,
               method = "lm",
               col = "dodgerblue2",
               linewidth = 2)

The example shows two smoothing curves: a line (blue) and a curve (red). Out of these two, the red curve seems better to capture the steady income growth through twenties, and slight decline in fifties. Line, obviously, misses these effects.

This problem can be addressed by including higher polynomials of age into the model. R formulas normally assign a different meaning to common mathematical operations, such as + and *. However, you can use function I() to insulate the math operations from the special interpretation. In this example, we can estimate the model \[\begin{equation*} \mathit{re78}_i = \beta_0 + \beta_1 \cdot \mathit{age}_i + \beta_2 \cdot \mathit{age}_i^2 + \epsilon_i \end{equation*}\] as

m <- lm(re78 ~ age + I(age^2), data = treatment)
summary(m)
## 
## Call:
## lm(formula = re78 ~ age + I(age^2), data = treatment)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24251 -10596   -680   8106  99225 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18567.922   3887.278  -4.777 1.88e-06 ***
## age           2192.763    228.663   9.589  < 2e-16 ***
## I(age^2)       -28.073      3.131  -8.966  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15320 on 2672 degrees of freedom
## Multiple R-squared:  0.04016,    Adjusted R-squared:  0.03945 
## F-statistic:  55.9 on 2 and 2672 DF,  p-value: < 2.2e-16

The result shows that both \(\mathit{age}\) and \(\mathit{age}\) are statistically significant. The coefficient of age^2 is negative, hence we have a hump-shaped relationship.

Unfortunately, the polynomial effects are hard to interpret, but these can be fairly easily visualized. First, we do it through prediction and thereafter using the geom_smooth() function.

In order to compute and plot a predicted relationship, you need to create a vector of ages. On this plot, the age ranges from 17 to 55, so we can just create a similar vector of ages, and for each age we can predict income. Finally, for ggplot, all this must be in a data frame:
plot of chunk predict-age

Prediction line, computed and plotted separately

## predict for given age range
predictionData <- data.frame(age = 17:55)
predictionData$re78 <- predict(
   m, newdata = predictionData)
## plot
ggplot(treatment, aes(age, re78)) +
   geom_jitter(size = 0.6)  +
   geom_line(data = predictionData,
             col = "orangered2",
             linewidth = 2)

The result shows a hump-shaped relationship that captures both low income in the early career, and decreasing income before retirement.

Alternatively, a similar curve can be plotted using geom_smooth(). You need to set arguments method = "lm" (to use linear regression) and formula = y ~ x + I(x^2) (to use quadratic regression):
plot of chunk quadratic-smooth

Prediction line, computed and plotted separately

## predict for given age range
predictionData <- data.frame(age = 17:55)
predictionData$re78 <- predict(
   m, newdata = predictionData)
## plot
ggplot(treatment, aes(age, re78)) +
   geom_jitter(size = 0.6)  +
   geom_line(data = predictionData,
             col = "orangered2",
             linewidth = 2)

The result is similar to the one above. Note that the formula needs to be specified as y ~ x + I(x^2), despite the variables being called re78 and age. This is because here ggplot thinks in terms of figure axis, not in terms of data.

In practice, the second approach with geom_smooth() may be easier, but manual prediction as above allows to use more complex models that geom_smooth() cannot handle.

11.2 Summary

Main modeling tools:

Basic formula

outcome ~ var1 + var2 + ...
  • +: include the variable to the model
  • *: include both variables and their interaction effect
  • :: is the interaction effect

Formulas

log(outcome) ~ var1 + I(var1^2) + ...
  • log(outcome): include logarithm of outcome, instead of plain outcome
  • I(var1^2): include var1-squared