Chapter 14 Predictive modeling

14.1 Prediction versus inferences

So far we have mainly discussed inferential modeling. We computed linear regression slopes or logit marginal effects, and in each case we asked what does the number mean? Such insights are valuable and a major way of learning about natural and social phenomena. But this is not always what we need.

For instance, imagine we are considering whether to have a picnic in a park tomorrow. We all agree that we’ll have the picnic, but only if it is not raining. How will inferential modeling help us with our decision? Sure, it will–we can learn that in case of north-westerly winds, and if the Ocean temperature is above a certain level, and in case of high-pressure area over British Columbia… and so on and so forth. Understanding the processes behind weather are definitely valuable, but … should we have picnic tomorrow? What I want to know is a simple yes/no answer: will it be raining tomorrow?

This is an example of predictive modeling. Instead of interpreting the wind patterns and air humidity, we want just have an answer, not an interpretation. This is not to say that interpretation is not valuable, but we happily leave this decision to the weather forecasters to do.

There are many other similar problems where we are mostly just interested in the outcome, not interpretation of the process. For instance, doctors’ immediate task is to find diagnosis given the symptoms. How are symptoms related to illness is a valid and important question–but I’d happily leave this to medical researchers for another time, and expect a quick answer to another question–what is the diagnosis–instead.

14.2 Predicting linear regression

14.2.1 How to predict with linear regression

Let’s repeat the global temperature trend analysis from Section 11.3.1, using HadCRUT temperature data. For a refresher, the dataset looks like

hadcrut <- read_delim("data/hadcrut-annual.csv")
hadcrut %>%
   sample_n(5)

## # A tibble: 5 × 4
##    Time `Anomaly (deg C)` `Lower confidence limit (2.5%)`
##   <dbl>             <dbl>                           <dbl>
## 1  2010            0.680                           0.649 
## 2  1986            0.0957                          0.0644
## 3  1859           -0.281                          -0.424 
## 4  1899           -0.355                          -0.486 
## 5  1857           -0.467                          -0.616 
## # ℹ 1 more variable: `Upper confidence limit (97.5%)` <dbl>

From 1960 onward, the temperature shows a fairly steady linear trend:

hadcrut %>%
   filter(Time >= 1960) %>%
   ggplot(aes(Time, `Anomaly (deg C)`)) +
   geom_point() +
   geom_smooth(method="lm", se=FALSE)

We estimate this trend with linear regression, and make corresponding predictions.

We can do the linear regression as (see Section 11.3.1):

hadcrut %>%
   filter(Time >= 1960) %>%
   lm(`Anomaly (deg C)` ~ Time, data = .)

## 
## Call:
## lm(formula = `Anomaly (deg C)` ~ Time, data = .)
## 
## Coefficients:
## (Intercept)         Time  
##   -35.45587      0.01795

As a refresher–as the variable names in HadCRUT dataset contain spaces and parenthesis, we need to quote those with backticks like `Anomaly (deg C)`.

Previously, we just interpreted the results, in particular we noted that the global temperature is increasing 0.18 degrees per decade. But now our focus is on prediction–how much will be global temperature at year 2050, 2100 and so on.

In case of linear regression, it is easy to use the formula, and write, e.g.

Temperature in 2100 = -35.456 + 0.018 \(\times\) 2100 = 2.245

Below we use the predict() method instead, it is somewhat more robust (less room for typos), and it is much easier for logistic regression problems.

As a first step, we want to store the fitted model into a variable, here we store it in m:

m <- hadcrut %>%
   filter(Time > 1960) %>%
   lm(`Anomaly (deg C)` ~ Time, data = .)

Now the variable m contains the fitted model, we can see it by just typing m:

## 
## Call:
## lm(formula = `Anomaly (deg C)` ~ Time, data = .)
## 
## Coefficients:
## (Intercept)         Time  
##   -35.92526      0.01819

These are exactly the same results as above.

We can just predict the values by using predict(m):

predict(m)

##            1            2            3            4 
## -0.259831343 -0.241643977 -0.223456611 -0.205269245 
##            5            6            7            8 
## -0.187081879 -0.168894513 -0.150707147 -0.132519781 
##            9           10           11           12 
## -0.114332415 -0.096145049 -0.077957683 -0.059770317 
##           13           14           15           16 
## -0.041582951 -0.023395585 -0.005208219  0.012979147 
##           17           18           19           20 
##  0.031166513  0.049353879  0.067541246  0.085728612 
##           21           22           23           24 
##  0.103915978  0.122103344  0.140290710  0.158478076 
##           25           26           27           28 
##  0.176665442  0.194852808  0.213040174  0.231227540 
##           29           30           31           32 
##  0.249414906  0.267602272  0.285789638  0.303977004 
##           33           34           35           36 
##  0.322164370  0.340351736  0.358539102  0.376726468 
##           37           38           39           40 
##  0.394913834  0.413101201  0.431288567  0.449475933 
##           41           42           43           44 
##  0.467663299  0.485850665  0.504038031  0.522225397 
##           45           46           47           48 
##  0.540412763  0.558600129  0.576787495  0.594974861 
##           49           50           51           52 
##  0.613162227  0.631349593  0.649536959  0.667724325 
##           53           54           55           56 
##  0.685911691  0.704099057  0.722286423  0.740473789 
##           57           58           59           60 
##  0.758661155  0.776848522  0.795035888  0.813223254 
##           61           62           63 
##  0.831410620  0.849597986  0.867785352

This prints a large array of numbers. These are predicted temperature values for all these years we included into the model, -, in the same order as years are stored in the dataset. So we can see that the predicted value for is -0.27 and for it is 0.86.

But these are just the values for the actual years we have in data. If we want to compute predicted temperature for, e.g., 2050, we need to supply a new dataset to predict(). For instance,

predict(m, newdata = data.frame(Time = 2050))

##        1 
## 1.358844

So the predicted temperature anomaly for 2050 is 1.36.

The newdata = data.frame() tells predict() that this time we do not want to hear what we know better–namely, what was the temperature 1960-2023, but instead, want it to tell us what does it think the temperature in 2050 will be. data.frame(Time = 2050), in turn, creates a new dataset with a column Time:

data.frame(Time = 2050)

##   Time
## 1 2050

This new dataset is used for prediction. The column name, Time, must be exactly the same as used in the lm() above, where we asked for lm(`Anomaly (deg C)` ~ Time).

But we cannot ignore interpretation in predictive modeling either. What does the number 1.36 mean? It must be understood in the same way as the numbers in HadCRUT data: these are global surface temperature anomalies, in °C, above 1961-1990 average. Hence the model predicts that by 2050, the average global surface temperature is 1.36°C above 1961-1990 average.

The other extremely important point to understand is what kind of assumptions are our predictions based on. Here it is very simple–we describe the temperature 1960-2023 by a linear trend and we assume the same trend continues into 2050. But we do not know if it is correct. We cannot know. It has hold steady for the last 60 years, so it is a plausible assumption, but it is still an assumption. See more in Section ?? below.

14.2.2 RMSE/R2: model goodness

TBD

RMSE is ``typical’’ prediction error \[\begin{equation*} \mathit{RMSE} = \sqrt{ \frac{1}{N} \sum_{i} (y_{i} - \hat y_{i})^{2} } \end{equation*}\]

\(R^2\) (called “r-squared”). \[\begin{equation*} R^{2} = 1 - \frac{ \sum_{i} (y_{i} - \hat y_{i})^{2} }{ \sum_{i} (y_{i} - \bar y)^{2} } \end{equation*}\]

\(R^{2} = 1\): perfect prediction
\(R^{2} = 0\): useless model
- As good as predicting mean value for everyone

14.3 Categorization: predicting categorical outcomes

So far we were predicting linear regression results. As linear regression expects a continuous outcome, such as basketball score or temperature, measures such as \(R^2\) and RMSE are appropriate.

But not all prediction tasks are like that. There is an important class of problems, called classification or categorization, where the outcome is categorical. Here are a few examples of classification problems:

image categorization: does it depict a cat or a dog?
email categorization: is it spam or not?
loan processing: approve the application or not?
analyzing bank transaction: fraudulent or not?

None of these case are concerned of a numeric outcome. Instead, it is a category, cat versus dog, spam or non-spam, and so on. Obviously, linear regression is not suitable for such tasks–what would you decide about email being spam if the model responds “7.4”? Fortunately, there are other tools that are designed for just such tasks. Above we introduced logistic regression, am model that is designed to describe how probability of one category depends on the other characteristics. This is one of the models we will use below as well.

In a similar fashion, we cannot easily compute values like RMSE or \(R^2\), as the concepts like “approve - not approve” do not make much sense. We discuss these tools below.

14.3.1 Predicting logistic regression results

Logistic regression results can be predicted in a fairly similar way as linear regression results. Using info180 package²³

As a refresher: the most recent version can be installed by install.packages("https://faculty.uw.edu/otoomet/info180.tar.gz").↩︎