Chapter 11 How are values related

This section discusses relationships in data. Often these are relationships that we are interested in, e.g. how is income related to education, or how is global temperature changing over years. Relationships are worth analyzing as they tell us something about the problem. Even if a positive correlation is not proof that more education gives you better salary, it may still provide some evidence that this is true.

11.1 Visualizing trend lines

One of the simplest and most powerful tools to understand simple relationships is to visualize these using similar tools like what we were using in Section 10. We demonstrate the trends using Ice extent data. This can be loaded as:

ice <- read_delim("ice-extent.csv")

Now we have loaded the dataset and stored it under name ice in the R workspace. The dataset looks like

ice %>%
   head(3)

## # A tibble: 3 × 7
##    year month `data-type` region extent  area  time
##   <dbl> <dbl> <chr>       <chr>   <dbl> <dbl> <dbl>
## 1  1978    11 Goddard     N        11.6  9.04 1979.
## 2  1978    11 Goddard     S        15.9 11.7  1979.
## 3  1978    12 Goddard     N        13.7 10.9  1979.

For our purpose, the important variables are year, month, region, extent, and area. Region means the hemisphere, “N” for north and “S” for south. Extent and area are sea ice extent and area (in millions of km2). Area is just the sea surface area that is covered by ice, extent is defined as sea surface area where there is at least 15% of ice concentration. This is a widely used measure, as satellite measurements of extent are more precise than that of the area.

When you look at the figures in Section 10.3, you may have have got an impression that the monthly ice area is falling over years. Let us now focus not on the plot, but how to show this trend. First, let’s plot the September ice area in north over years:

ice %>%
   filter(area > 0,
          month == 9,
          region == "N") %>%
   ggplot(aes(x=year, y=area)) +
   geom_point()

plot of chunk unnamed-chunk-5

We filtered only valid values (area > 0), September month (month == 9) and northern hemisphere (region == "N"). We also choose to use only points and not lines to avoid interference with the trend line later. The dots seem to be heading downward at a fairly steady pace, although there is a lot of yearly fluctuation around this trend.

Next, let’s add a trend line on it. This can be done using ggplot’s geom_smooth()–you just add another layer as + geom_smooth(method="lm", se=FALSE):

ice %>%
   filter(area > 0,
          month == 9,
          region == "N") %>%
   ggplot(aes(x=year, y=area)) +
   geom_point() +
   geom_smooth(method="lm", se=FALSE)

We tell geom_smooth to mark a trend line, not curve (method = "lm" for linear model, see Section 11.3). We also tell it not to plot the confidence region (se = FALSE for standard errors). As a result, we have a single clean trend line. The line suggest that September ice area has fallen from approximately 5 million km2 at 1980 to approximately 3 million by 2020.

Exercise 11.1 Repeat the plot, but this time leave out the method="lm" and se=FALSE. What happens? Which plot do you like more?

geom_smooth works perfectly with data groups (see Section 10.3). For instance, let’s replicate the same example but using data for three month–March, June and September:

ice %>%
   filter(area > 0,
          month %in% c(3, 6, 9),
          region == "N") %>%
   ggplot(aes(x=year, y=area,
              col=factor(month))) +
   geom_point() +
   geom_smooth(method="lm", se=FALSE)

The differences are quite small:

we select March, June and September this time, not just September
we add col = factor(month) to the aesthetics. This means separate the dots (and trend lines) into three monthly groups, each of which is displayed with a different color. factor(month) is needed to convert numeric month into a categorical variable (see Section 10.3).
finally, we use month %in% c(3, 6, 9) to specify that we want to preserve only those months that are in the given list (See Section 5.3.2.2).

Now we see three differently colored sets of points, and also three trend lines, one for each month, displayed with the respective color. The figure shows that while September ice area is trending down fairly fast, the trend for March is very small, almost non-existing. While we may see ice-free Septembers in a few decades, completely ice-free Arctic Ocean is still far away.

Exercise 11.2 Remove the function factor from col=factor(month). What happens? How many trend lines will you see now?

See the solution

We can do a quick back-of-the-envelope calculation about when will the ice completely vanish. It took approximately 40 years for the ice to fall from 5 to 3 M km2, hence it will take another 60 years to fall to 0. Obviously, this is only true if the current trend continues.

11.2 Strength of the relationship: correlation

There are two ways to describe the strength of a relationship: correlation and slope. Here we discuss correlation, slope is explained below in the Linear Regression section.

11.2.1 What is correlation

Correlation is a measure that describes how closely are two variables related. It can be “+1” if both variables are related exactly in a linear fashion, and there is no wiggle room for one to change while the other stays the same. The “+” in front of “+1” stresses that the relationship is positive, i.e. if one variable increases, then the other necessarily increases too. Alternatively, if correlation is “-1”, then there no wiggle room either, but if one variable increases, the other one necessarily decreases.

Here are a few examples that display two randomly generated variables, that are correlated to a different degree. The top-left image shows relationship with correlation exactly “-1”. Hence all the data points are exactly on the blue line (“1” means perfect correlation), and when \(x\) increases then \(y\) falls (this is what “-” in front of “1” means). The next image, “-0.5”, shows a point cloud that is clearly negatively related–larger \(x\) values are associated with smaller \(y\) values, but the relationship is far from perfect. When \(x\) grows then \(y\) tends to decrease, but this is not always the case–sometimes larger \(x\) corresponds to larger \(y\) instead. When correlation is 0 (middle left image), then the line is almost flat, and we cannot see any obvious relationship. These variables are unrelated. Further, the plot shows a few possible positive correlations. Now larger \(x\) values correspond to larger \(y\) values. The largest positive value displayed here is 0.99, the points align very well with the line, but not perfectly.

You can play and learn more on dedicated websites, such as Guess The Correlation.

Note that correlation is agnostic about the scale. In these figures both \(x\) and \(y\) are scaled in a similar fashion, but that is not necessary. Correlation measures how well do the points fall on the line, not what is the slope of the line. Whatever are the measurement units we are using, we can always scale the plot down to roughly a square, and see how well the points line up there.

Because correlation describes the relationship between two variables in such an intuitive and easily understandable manner, it is widely used in practice.

11.2.2 Example: how similar are iris flower parts

Now it is time for an example. We use iris dataset, first used by R. Fisher in 1936. It consists of measures of three different iris species, 50 flowers from each—their sepal and petal length and width (see the figure). Below, we focus on virginica flowers only, combining multiple species on the same figure will make the picture unnecessarily complex.

It is intuitively obvious that the different measures of flowers are positively related: longer petals tend to be wider too, and large flowers have both large petals and large sepals. We may not even think separately about length and width but instead about the “size” of flowers. Given the shape remains broadly similar, larger petals are both longer and wider.

Let us now compute the correlation. First we load data and keep only virginicas:

virginica <- read_delim("iris.csv") %>%
   filter(Species == "virginica")

The dataset contains the following variables:

virginica %>%
   head(3)

## # A tibble: 3 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          6.3         3.3          6           2.5 virginica
## 2          5.8         2.7          5.1         1.9 virginica
## 3          7.1         3            5.9         2.1 virginica

The first four columns are the dimensions–sepal length, sepal width, petal length, petal width; and the last column is the species.

Let use focus on the relationship between sepal length and petal length. This can be visualized as

ggplot(virginica,
       aes(x = Sepal.Length,
           y = Petal.Length)) +
   geom_point() +
   geom_smooth(method="lm",
               se=FALSE) +
   labs(x = "Sepal length, cm",
        y = "Petal length, cm")

As expected, we see a positive relationship: flowers with longer sepals also tend to have longer petals. Larger flowers are, well, larger. The trend line seems to fit the point cloud rather well–the points do not spread too much around it, and there are no clear outliers. We do not spot any problems highlighted above.

Now we can compute the correlation. We use function cor() that prints the correlation matrix:

virginica %>%
   select(Sepal.Length, Petal.Length) %>%
   cor()  # prints correlation matrix

##              Sepal.Length Petal.Length
## Sepal.Length    1.0000000    0.8642247
## Petal.Length    0.8642247    1.0000000

The matrix consists of two rows and two columns, one for each variable. The most important number here is 0.864–the correlation between petal length and sepal length. The 1.000 on the diagonal means that each variable is perfectly correlated with itself (they always are). So the only purpose they serve is to remind the reader that this table is about correlation. We also see that both the bottom-left and top-right cell of the table contain the same number–0.864. This just means that we can compute the correlation in both ways, between petal length and sepal length, and between sepal length and petal length, and the result is exactly the same.

The value 0.864 itself means that the lengths of petals and sepals are closely correlated. This is confirmed by the impression that the petal-sepal length figure is most similar to the figure that depicts correlation 0.9 Section 11.2.1. So flowers with longer sepals have longer petals too, the relationship is very strong but not perfect.

We can also print the correlation of more than two variables in an analogous table:

virginica %>%
   select(!Species) %>%  # keep all lengths, remove 'Species'
   cor()

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.4572278    0.8642247   0.2811077
## Sepal.Width     0.4572278   1.0000000    0.4010446   0.5377280
## Petal.Length    0.8642247   0.4010446    1.0000000   0.3221082
## Petal.Width     0.2811077   0.5377280    0.3221082   1.0000000

The output is a similar correlation matrix. However, this time it contains four rows and four columns, one for each variable that was left in data. The cells contain the correlation values for the corresponding variables. For instance, we can see that correlation between sepal width and sepal length is 0.457, and correlation between petal width and petal length is 0.322. All the diagonal values are 1.000–all variables are always perfectly correlated with themselves.

This example shows that the sizes of all flower parts are positively related. Maybe somewhat surprisingly, the closest relationship there is between lengths–between petal length and sepal length. The link between length and width is weaker, and the weakest correlation (0.281) is there between petal width and sepal length. Seems like flowers are happy to have these parts of different shape.

11.2.3 Example: basketball score and time played

Let’s take another example, this time from the world of basketball. It is intuitively obvious that the more time players play, the more they score. It is also clear that the relationship is far from perfect, sometimes they just get lucky and score a lot, other times they may be mainly playing in defense where opportunities to score are not as good.

Let’s look at James Harden’s, a NBA basketball player, 2021-2022 game season results. How did his total score depend on the minutes he played? The following analysis is based on basketball-referece.com data. See the Appendix below about how this data can be processed into the results in this section.

During the 2021-2022 season, Harden played in 65 games. His shortest contribution was 28 min and 39 sec and his longest play was 44 min and 4 sec During his shortest game he scored 17 points and during the longest game he scored 32 points. So the longer game resulted in more points.

But how is the number of points related to time played? Intuitively, staying in the game for more minutes should lead to more points scored. But how close is the relationship? Let’s make a plot of Harden’s playing time (in minutes) and his score (in total points): plot of chunk rel-harden-minutes-score

We see a cloud of points that, in general, tend to be upward sloping. However, the relationship is far from clear, and just by guess, one might think that the correlation is around 0.2 (see the figures above in Section 11.2.1). When computing the actual correlation, it turn out to be 0.202. So playing time and player’s score are positively related, but it seems like the score is mostly determined by other factors, the playing time only plays a secondary role.

11.2.4 Caveats of correlation

But one should be aware that correlation describes the linear relationship, i.e. how well do the points fit on a line. If the relationship is non-linear, correlation may display surprising results.

Here are several examples of data where correlation is approximately the same, but data is arranged in a very different fashion. The first figure displays a case that is similar to the previous figure–a noisy but positive relationship that approximately follows a line.

The second figure, however, shows a cloud of points where \(x\) and \(y\) are not related at all. But in addition to the correlation-zero cloud, there is a single outlier up-right from the cloud. This outlier causes data to be highly correlated.

The third figure shows perfectly aligned data. But this time it is not aligned in a linear fashion–along a line, but along a curve. As the curve still resembles the a line somewhat, the correlation is high here too.

The fourth image shows data that are aligned in an almost perfect linear fashion–except one outlier. That outlier causes the correlation to fall from 1.0 down to approximately 0.8.

These caveats demonstrate that while correlation is a useful way to analyze “closeness” of relationships, one should always check the corresponding plots too. The correlation number alone may be fairly misleading. It does not necessarily mean that any of these plots is “wrong” (although this may be the case). It means that a single number, correlation, may not represent data in a way that we expect.

Take for instance the 2nd image with an uncorrelated point cloud and a large outlier. Such pictures are common if we analyze e.g. settlement size and income. Typical datasets (and typical countries or states) have many small towns, and a few big cities with much higher income. In that case we may see that most of the small towns have similar low income, but the single outlier, the only big city in data, is much wealthier. Size and income are genuinely positively correlated, it is not just an error in data.

The examples above assume that both variables, \(x\) and \(y\), are numeric. In that case we can compute correlation (more specifically, Pearson correlation coefficient) as \[\begin{equation} r_{xy} = \frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum\limits_{i=1}^n (x_i-\bar{x})^2} \sqrt{\sum\limits_{i=1}^n (y_i-\bar{y})^2}}. \end{equation}\] Here \(i\) describes the rows and \(x_i\) and \(y_i\) are the values of variables \(x\) and \(y\) in row \(i\). \(\bar x\) and \(\bar y\) are average values of \(x\) and \(y\). But these days we almost always use statistical software can do these calculations.

11.3 Measuring the relationship: linear regression

Sometimes correlation is all we want to do. The knowledge that flower shape can change, but some flowers tend to get taller overall, and other wider overall, may be exactly what we want. But in other times we want to know more. For instance, how much wider tend sepals be if they are 1cm longer? How fast does the ice cover decrease from year to year? How much more do people with more education earn? Such questions cannot be addressed with correlation. The best correlation can do is to tell that ice cover is fluctuating a lot and the yearly values are dispersed widely around the trend line. But it does not answer the how fast question.

This is why we need linear regression for. Linear regression is the most popular way to compute trend lines, and behind the scenes, the visualizations we did in Section 11.1 were computed with linear regression as well.

Linear regression is one particular way to find the trend lines through data (there are other ways). The linear part refers to the fact that we are looking for a line (not curve or a step function), regression refers to the concept of regression to the mean, the fact explained by Galton (1886).¹⁷ It is a little bit of an accident as the “regression” part of these models is usually not relevant. This is also why linear regression is sometimes called linear model (LM), or ordinary least squares (OLS). Here ordinary refers to the fact that it is the most popular of the least squares models, and least squares is related to the definition of OLS: the line is chosen such a way that the squared errors are as small as possible.

Below, we introduce linear regression with an example, thereafter provide the more formal definition of it, and finally discuss other examples and problems.

11.3.1 Linear regression: an example

A simple way to understand linear regression is to analyze trends in time. Below, we use HadCRUT data, a temperature data collected by Hadley Centre/Climatic Research Unit Temperature, and maintained by the UK Met Office. It covers years from 1850 onward, although the earlier results are less precise.

The dataset includes four columns: year, temperature anomaly relative to the 1961-1990 period (global yearly average), and the lower and upper 2.5% confidence bands of the latter:

hadcrut <- read_delim("data/hadcrut-annual.csv")
hadcrut %>%
   sample_n(5)

## # A tibble: 5 × 4
##    Time `Anomaly (deg C)` `Lower confidence limit (2.5%)`
##   <dbl>             <dbl>                           <dbl>
## 1  1935            -0.206                          -0.303
## 2  1892            -0.508                          -0.644
## 3  2010             0.680                           0.649
## 4  1996             0.277                           0.243
## 5  1867            -0.357                          -0.553
## # ℹ 1 more variable: `Upper confidence limit (97.5%)` <dbl>

Below, we only analyze the anomaly and ignore the confidence bounds.

Let’s start with a plot that displays the global temperature:

hadcrut %>%
   ggplot(aes(Time, `Anomaly (deg C)`)) +
   geom_line() +
   geom_smooth(method="lm", se=FALSE)

The plot shows a zig-zaggy pattern, but from 1900 onward, the temperature seems to be heading consistently upward. We also display the trend line (blue), but a line like this seems not to capture the trend well.

Hence, next we only look for the time period 1961 onward, as that part of data seems to follow a linear trend much more closely:

hadcrut %>%
   filter(Time > 1960) %>%
   ggplot(aes(Time, `Anomaly (deg C)`)) +
   geom_point() +
   geom_smooth(method="lm", se=FALSE)

Indeed, the global surface temperature in the last 60 years seem to be well approximated with a linear trend. The lowest datapoint in this figure is in early 1960-s, and the highest in 2023.

But how strong exactly is the trend? One can just look at the figure and estimate that the temperature seems to have been increasing from -0.25 to +0.8 or so through the last 60 years, i.e. 0.175 degrees per decade. But it would be nice to get the exact number without the laborious calculations. This is exactly what we can do with linear regression.

hc1960 <- hadcrut %>%
   filter(Time > 1960)
lm(`Anomaly (deg C)` ~ Time, data=hc1960)

## 
## Call:
## lm(formula = `Anomaly (deg C)` ~ Time, data = hc1960)
## 
## Coefficients:
## (Intercept)         Time  
##   -35.92526      0.01819

Here lm() is a function that computes the linear regression (linear model). It normally takes two arguments: first the formula–what to calculate, and thereafter the dataset.

The formula relates two variables in data: the dependent one (here temperature anomaly) and the independent one (here year). We can use the computed values for making predictions. For instance, the temperature anomaly in year 2023 is predicted by the model to be

-35.92526 + 0.01819*2023

## [1] 0.87311

i.e. 0.84C above the baseline (1961-1990 temperature). Next, we discuss what do these numbers mean.

11.3.2 Intercept and slope

Before discussing these numbers any further, let’s discuss talk about lines. Linear regression models the pattern in data using a line, in the figures above, it is the blue line. And it describes the line by telling you two numbers, one of them called intercept (-35.9253 in the example above) and slope (0.0182 above, but named not “slop” but Time). What do these numbers mean?

Definition of intercept/slope as descriptors of line

The figure here shows a line (blue) on a plane. It broadly describes the trend in six datapoints (black), but this description applies to all lines, not just to those that describe data.

Besides of the line, the figure displays the axes, labeled as \(x\) (horizontal) and \(y\) (vertical). These define the “location” on the plane. The blue line intersects with the \(y\) axis at a certain point–it is where \(x=0\) (that is where the \(y\)-axis is) and at a certain height. The height is called intercept.

Slope, in turn, describes the slope of the line. More precisely, slope tells how much larger is \(y\) if \(x\) is larger by the one unit. On the figure, the horizontal red line shows \(x\) difference by one unit, and the vertical red line shows how much larger is the corresponding \(y\)–this is exactly what slope is.

These are the two numbers that linear regression provides.

In case of linear regression, these two numbers have very distinct meaning.

Intercept is the height, at which the trend line cuts the \(y\) axis, the \(y\) value that corresponds to the \(x\) value zero. In the example above, the intercept is -35.925. This means that when \(x\) is zero, \(y\)-value will be -35.925. Now in the example above, we are looking for global temperature anomaly as \(y\) and year as \(x\). So we can re-phrase the intercept that “At year 0, the global temperature anomaly was -35.925 degrees”. We’ll come back to talk about this number later.

Slope is how much larger is \(y\) if \(x\) is larger by one unit. Here it is 0.0182 units. But again, while correct, such language is rather confusing. As we measure \(x\), time, in years instead of “units”, and \(y\), temperature, in degrees C, we can re-phrase it as “as year is larger by 1, temperature is larger by 0.0182C”. Or even better: “Global temperature is increasing by 0.0182C from year to year”¹⁸. Obviously, both of these numbers are not what the actual temperature does (these are the black dots that jump up and down), but what the temperature trend, the blue line, does. So even better way to explain the number would be “Global temperature is trending upward by 0.0182C from year to year”.

Now let’s return to the intercept -35.925. This means temperature anomaly compared to 1961-1990 average in degrees C. Or in other words, the temperature at year 0 should have been 35.925 degrees colder than 1961-1990 average. Now if you know anything about temperature, that would be very .. chilly, close to -20C or 0F. It is about 10C colder that average in Qaanaaq town (widely known as “Thule”) in Northern Greenland. Was the weather during Han dynasty really that cold?

Obviously not. The intercept, again, describes the blue trend line, not the actual data pattern. As the images in the previous section indicate, the blue line is quite a good approximation of the temperature trend since 1961. It is much worse approximation for the 1850-223 time period. And it will probably be a miserable one to describe temperature through the last two millennia. So a better wording for intercept will add a qualifying condition: “At year 0, the global temperature anomaly was -35.925 degrees, if the current trend stayed the same”. And here we have good reasons to believe that it did not stay the same.

11.3.3 Linear regression: prediction

TBD

\[\begin{equation*} y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i \end{equation*}\]

11.3.4 Linear regression: definition

It is fairly easy to see that one can draw a number of lines through the data points, and one cannot tell which one of the similar lines is the best one by just looking at the image. Even more, there is not “the best line”, unless we define it in a more precise way. This is exactly what linear regression is defined.

Let’s look at a hypothetical dataset that looks like this:

plot of chunk rel-ols-def-rnd-data

It depicts 5 data points and four lines that describe the relationship. All lines are rather similar, but they are not exactly the same. Which one is the best?

Obviously, the answer to such questions is “it depends”. Linear regression has it’s own particular way of answering the question about the best. It is based on residuals, the differences between the predicted and actual value:

plot of chunk rel-ols-def-residuals

This figure depicts the regression line (purple) and residuals (red vertical lines). The residuals stretch from the predicted values at the line (purple dots) to the actual data points (black dots). Regression line is defined as the line where sum of squared residuals is as small as possible. We can write the sum of squared residuals as \[\begin{equation} \mathit{SSE} = e_1^2 + e_2^2 + e_3^2 + e_4^2 + e_5^2 \end{equation}\] or, when using the standard math notation, \[\begin{equation} \mathit{SSE} = \sum_{i=1}^5 e_i^2. \end{equation}\] (SSE stands for sum of squared errors.) So if we want to get the regression line, we choose from among all possible ones such a line where SSE is as small as possible.

Every line is defined by two parameters, intercept (often denoted by \(\beta_0\)) and slope (often denoted by \(\beta_1\)). So manipulating the line is the same thing as manipulating intercept \(\beta_0\) and slope \(\beta_1\). When performing a linear regression, these two parameters are usually the most important results.

Why do we want to define the regression line using squared residuals, and not something else? There are several answers to this questions. First, there are models that use something else. For instance, if you define line based on sum of absolute values of residuals, not squares of residuals, then you get what is called median regression. It is a less popular method, but it is a better choice in certain contexts. Second, when squaring residuals, then we are exaggerating large residuals. Large residual is large to begin with, and squaring it makes it extra large. So as a result, larger residuals weight more in deciding how the line should go. Linear regression works hard to avoid few large residuals (unlike median regression). Many small residuals are much more tolerable than a few large ones. This may or may not be what we want–in particular, it means that linear regression is sensitive to outliers.

11.3.5 Linear regression: iris flower example

Above, in Section 11.2.2, we found that \[\begin{equation} cor(\text{petal length}, \text{sepal length}) = 0.864 \end{equation}\]

This told use that the lengths of these two flower parts are closely related, if one is longer, then the other does not have much wiggle room and is typically longer too.

But how much longer are the petals where sepals are longer? This is not what the correlation coefficient tells us. Here again, we need a linear trend. We can compute the slope and intercept as

lm(Petal.Length ~ Sepal.Length, data=virginica)

## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = virginica)
## 
## Coefficients:
##  (Intercept)  Sepal.Length  
##       0.6105        0.7501

We see that the coefficient for sepal length is 0.7501. This means that virginica flowers that have 1cm longer sepals tend to have 0.75cm longer petals.

More generally, we say that flowers that have one unit longer sepals thend to have 0.75 units longer petals. As here the units are the same (centimeters for both lengths), we can also tell that 1mm (or 1 mile if you wish) longer sepals are associated with 0.75mm (or 0.75 mile) longer petals.

11.3.6 Linear regression: playing time and score in basketball

Above in Section 11.2.3 we displayed the relationship between the game time and score for James Harden for his game 2021-2022 season. See also Appendix 11.5 for how to do the data processing. Here we just copy the final code from the appendix and store it into a new data frame:

harden <- read_delim("harden-21-22.csv.bz2")

cleanHarden <- harden %>%
   select(MP, FG, `3P`, FT) %>%
   mutate(FG = as.numeric(FG)) %>%
   filter(!is.na(FG)) %>%  # remove those cases where 'fg' is NA
   mutate(`3P` = as.numeric(`3P`),
          FT = as.numeric(FT)) %>%
   mutate(score = FT + 2*FG + 3*`3P`) %>%
   separate(MP, into=c("min", "sec"),
            convert=TRUE) %>%
   mutate(duration = min + sec/60)

As a reminder, the correlation between time played and score was 0.2:

cleanHarden %>%
   select(score, duration) %>%
   cor()

##              score  duration
## score    1.0000000 0.2022786
## duration 0.2022786 1.0000000

But again, this number does not tell how much more will Harden score if he will play more time. Again, we need linear regression:

lm(score ~ duration, data=cleanHarden)

## 
## Call:
## lm(formula = score ~ duration, data = cleanHarden)
## 
## Coefficients:
## (Intercept)     duration  
##      5.8323       0.5575

Here we see that the duration coefficient is 0.5575. This tells that one more unit of duration is associated with 0.56 units of score. Now as we measure duration in minutes, and score are just the points, we can tell that when Harden played one more minute, he typically scored 0.56 more points.

11.3.7 Pitfalls: correlation versus causation

Note that neither correlation nor linear regression tell us whether one variable causes the other one to grow or decrease.

11.3.8 Pitfalls: regression to mean

TBD

11.4 Linear regression: categorical variables

All linear regression examples above were done assuming that the explanatory variable, \(x\) is continuous. This included years (for global temperature), sepal size (for flowers), and game time (for basketball). But linear regression can also be used for binary variables.

Next, we discuss what are binary (dummy) variables, thereafter we give examples about where they may be useful for linear regression, and finally we discuss how to interpret the corresponding regression results.

11.4.1 Binary (dummy) variables

The explanatory variables we looked at earlier, such as year or game time can take all kinds of values (they are continuous), although the “all kinds” is often limited to certain range (e.g. game time cannot be negative). Linear regression helps to analyze the how the outcome depends on the explanatory variable. But what to do if we are interested in relationship between some kind of outcome and a categorical variable?

There are a lot of similar categorical variables that are important to analyze. Smoking–with two possible options (smoker/non-smoker) is one example of binary variables. These can take two values only, and usually these values are some sort of categories, such as male/female, science/arts, working/not working, and so on.¹⁹ All these examples are categorical–the categories are not numbers–but binary variable can also be numeric.

Below, we analyze how the babies’ birth weight depends on whether the mother smokes or not. Babies’ birth weight is considered a good indicator of their health–low weight is an indicator of certain problems in their development.

We use North Carolina births’ data. This is a sample of 1000 from North Carolina, that includes baby weight, and information about their mother and pregnancy. Let’s only keep the birth weight (weight, in pounds) and smoking status (habit, either “smoker” or “nonsmoker”), as we do not need the other variables here.

An example of the dataset looks like

births <- read_delim("data/ncbirths.csv") %>%
   select(habit, weight) %>%
   filter(!is.na(habit))
                           # also: remove missing smoking status
births %>%
   sample_n(4)

## # A tibble: 4 × 2
##   habit     weight
##   <chr>      <dbl>
## 1 smoker      7.69
## 2 smoker      7.13
## 3 smoker      7.94
## 4 nonsmoker   7.44

How many smokers/non-smokers do we have?

births %>%
   group_by(habit) %>%
   summarize(n())

## # A tibble: 2 × 2
##   habit     `n()`
##   <chr>     <int>
## 1 nonsmoker   873
## 2 smoker      126

What is the average birth weight for smokers/non-smokers?

births %>%
   group_by(habit) %>%
   summarize(mean(weight))

## # A tibble: 2 × 2
##   habit     `mean(weight)`
##   <chr>              <dbl>
## 1 nonsmoker           7.14
## 2 smoker              6.83

We see that babies of non-smoking mothers weigh a bit more– 7.14 versus 6.83, i.e. the non-smoking moms’ babies is 0.32 pounds heavier.

How is the birth weight distributed by smokers/non-smokers?

births %>%
   ggplot(aes(habit, weight)) +
   geom_boxplot()

The boxplot suggests that the distributions are fairly similar. If anything, the smoking moms’ babies have a slightly longer tail toward low birth weight.

But when we want to do any computations, then we cannot use values like “smoker” and “non-smoker”. We need numbers. A common approach is to convert categorical variables to dummy variables or dummies. These are “stupid” variables that only can take values “0” or “1”. Importantly, these are numeric variables.

The most common way to convert a categorical to dummy is to pick on of the value (typically the one that is first alphabetically, here “nonsmoker”), and say that is “0”. The other, “smoker”, will this be “1”. We can call such dummy smoker to stress that value “1” means smoking and value “0” means non-smoking. Here is an example how the results may look like:

id	weight	habit	smoker
1	7.63	nonsmoker	0
2	7.88	nonsmoker	0
160	6.69	smoker	1
161	5.25	smoker	1

The first two mothers in this sample are not smokers (habit = “nonsmoker”), and hence their smoker dummy is “0”. The next two moms are smokers, and their smoker = 1 accordingly.

TBD: reference category

The software typically converts categorical variables to dummies automatically as needed, but if you want to do it manually, you can use something like habit == "smoker"–this creates a logical variable with TRUE/FALSE values, see Section 2.5. Or if you want a numeric type, you can write as.numeric(habit == "smoker"). For instance:

births %>%
   mutate(smoker = as.numeric(habit == "smoker")) %>%
   slice(c(1,2, 160,161))

## # A tibble: 4 × 3
##   habit     weight smoker
##   <chr>      <dbl>  <dbl>
## 1 nonsmoker   7.63      0
## 2 nonsmoker   7.88      0
## 3 smoker      6.69      1
## 4 smoker      5.25      1

                           # select two rows of non-smokers
                           # and two of smokers

We can use this for re-computing the average birth weight by smoking status:

births %>%
   mutate(smoker = as.numeric(habit == "smoker")) %>%
   group_by(smoker) %>%
   summarize(mean(weight))

## # A tibble: 2 × 2
##   smoker `mean(weight)`
##    <dbl>          <dbl>
## 1      0           7.14
## 2      1           6.83

The results are exactly the same as above, just now the groups are not called “smoker” and “nonsmoker”, but “0” and “1”. This may seem like an un-necessary exercise, but this is what needs to be done if you want to compute using categorical variables. Even if the software does it automatically, knowing how dummies are done helps you to understand the results.

11.4.2 Analyzing the difference with linear regression

The dummy we created above, smoker, is numeric–despite it taking only two values, 0 and 1, it is still numeric.

Here we plot the data, together with the trend line. In order to make the data points better visible, we use geom_jitter(), this shifts the points a little bit (here by 0.1) off from their horizontal position, while leaving the vertical position unchanged:

births %>%
   mutate(smoker = as.numeric(
             habit == "smoker"
          )) %>%
   ggplot(aes(smoker, weight)) +
   geom_jitter(width = 0.1, height = 0) +
   geom_smooth(method = "lm",
               se = FALSE)

The x-axis (“smoker-axis”) may look a bit weird: after all, only values “0” and “1” should be present there. But let’s the axis values stay what they are for now.

The main message from the figure is that the trend line heads down–smokers’ babies tend to be smaller.

We can capture this trend line by linear regression as

births %>%
   mutate(smoker = as.numeric(habit == "smoker")) %>%
   lm(weight ~ smoker, data = .)

## 
## Call:
## lm(formula = weight ~ smoker, data = .)
## 
## Coefficients:
## (Intercept)       smoker  
##      7.1443      -0.3155

How to interpret these results?

The basic interpretation is the same as in case when just using numeric variables.

The intercept means that where smoker = 0, the babies’ average weight is 7.144lb. This corresponds exactly to what we computed above by just doing grouped averages–non-moking moms’ babies weigh 7.144lb in average.
the slope, here labeled “smoker”, tells that the moms who have smoker larger by one unit, have babies smaller by -0.316lb (in average). But not that smoker can only have two values–“0” and “1”. Hence “smoker larger by one unit” means a smoking mom, contrary to non-smoking mom. Hence the correct but clumsy sentence above can be re-phrased as “babies by smoking moms are smaller by -0.316lb (in average)”.

Comparing the results with those in Section 11.4.1, we can see that linear regression captured a) the average weight for non-smokers, and b) the weight difference between smokers and non-smokers.

Normally, we do not have to create the dummies manually–software does it for you behind the scenes. So you can just write

births %>%
   lm(weight ~ habit, data = .)

## 
## Call:
## lm(formula = weight ~ habit, data = .)
## 
## Coefficients:
## (Intercept)  habitsmoker  
##      7.1443      -0.3155

The results are identical to those above, except that now we have the slope labeled as “habitsmoker” instead of “smoker”.

11.5 Appendix: how to process Harden data

basketball-referece.com provides data about various players and various games. The website lets the user to export data as csv, this is what was used here, and the resulting dataset (in compressed form) is available in the BitBucket repo. You can hover over the column names there to understand what are the variables are. For our purpose, we need MP (minutes played), FG (field goals), 3P (3-point field goals), and FT (free throw scores). However, the dataset has a few problems. We now load it and solve the issues, one-by-one.

First load data

harden <- read_delim("harden-21-22.csv.bz2")
                           # it is in compressed format in the repo

and take a look:

harden %>%
   head(3)

## # A tibble: 3 × 30
##      Rk     G Date       Age    Tm    ...6  Opp   ...8   GS    MP    FG    FGA  
##   <dbl> <dbl> <date>     <chr>  <chr> <chr> <chr> <chr>  <chr> <chr> <chr> <chr>
## 1     1     1 2021-10-19 32-054 BRK   @     MIL   L (-2… 1     30:38 6     16   
## 2     2     2 2021-10-22 32-057 BRK   @     PHI   W (+5) 1     38:25 7     17   
## 3     3     3 2021-10-24 32-059 BRK   <NA>  CHO   L (-1… 1     33:09 6     16   
## # ℹ 18 more variables: `FG%` <chr>, `3P` <chr>, `3PA` <chr>, `3P%` <chr>,
## #   FT <chr>, FTA <chr>, `FT%` <chr>, ORB <chr>, DRB <chr>, TRB <chr>,
## #   AST <chr>, STL <chr>, BLK <chr>, TOV <chr>, PF <chr>, PTS <chr>,
## #   GmSc <chr>, `+/-` <chr>

As there are too many variables, we may want to select only the relevant ones:

harden %>%
   select(MP, FG, `3P`, FT) %>%
                           # remember: you need to quote `3P`
                           # with backticks because it is not
                           # a valid variable name
                           # (it starts with a number, not with a letter)
   head(3)

## # A tibble: 3 × 4
##   MP    FG    `3P`  FT   
##   <chr> <chr> <chr> <chr>
## 1 30:38 6     4     4    
## 2 38:25 7     3     3    
## 3 33:09 6     2     1

This gives us a much more understandable picture. We can see that minutes played is coded as MM:SS, and stored as text (“” = character). The scores (FG, 3P and FT) are all numbers, but for some reason stored as characters as well.

The first thing we need to do now is to understand why are numbers stored as characters. Normally numbers are read as numbers, and as we need the numbers for the plot, we need to convert those to numbers. But before just converting, it is worthwhile to understand the root of the problem. A simple option is to convert e.g. “FG” to numbers, and check if it resulted in any missings:

harden %>%
   select(MP, FG, `3P`, FT) %>%
   mutate(fg = as.numeric(FG)) %>%
   filter(is.na(fg)) %>%  # keep only those cases where 'fg' is NA
   head(3)

## # A tibble: 3 × 5
##   MP       FG       `3P`     FT          fg
##   <chr>    <chr>    <chr>    <chr>    <dbl>
## 1 Inactive Inactive Inactive Inactive    NA
## 2 Inactive Inactive Inactive Inactive    NA
## 3 Inactive Inactive Inactive Inactive    NA

We can see a number of games (18 in all) where Harden did not play. for some reason, in these cases all the numeric variables are not left empty, but instead have a text, either “Inactive”, “Did Not Dress” or “Did Not Play”.²⁰ As these seem to be a harmless ways to denote missing data, we can just proceed by removing these missings:

harden %>%
   select(MP, FG, `3P`, FT) %>%
   mutate(fg = as.numeric(FG)) %>%
   filter(!is.na(fg)) %>%  # remove those cases where 'fg' is NA
   head(3)

## # A tibble: 3 × 5
##   MP    FG    `3P`  FT       fg
##   <chr> <chr> <chr> <chr> <dbl>
## 1 30:38 6     4     4         6
## 2 38:25 7     3     3         7
## 3 33:09 6     2     1         6

But just removing the missings did nothing to convert the character columns into numbers. They were loaded in as text, and they are still text. We need to do the conversion ourselves. Instead of creating new lower-case columns for the numeric variables, we can just overwrite the existing (upper case) columns:

harden %>%
   select(MP, FG, `3P`, FT) %>%
   mutate(FG = as.numeric(FG)) %>%
   filter(!is.na(FG)) %>%  # remove those cases where 'fg' is NA
   mutate(`3P` = as.numeric(`3P`),
          FT = as.numeric(FT)) %>%
   head(3)

## # A tibble: 3 × 4
##   MP       FG  `3P`    FT
##   <chr> <dbl> <dbl> <dbl>
## 1 30:38     6     4     4
## 2 38:25     7     3     3
## 3 33:09     6     2     1

Not much changes visibly, except that the score columns are now of type “<dbl>”, i.e. numbers. Now we can compute Harden’s total score. In basketball, free throws (FT) give one point, field goals (FG) two points, and 3-pointers, well, three points. Hence we can add score as

harden %>%
   select(MP, FG, `3P`, FT) %>%
   mutate(FG = as.numeric(FG)) %>%
   filter(!is.na(FG)) %>%  # remove those cases where 'fg' is NA
   mutate(`3P` = as.numeric(`3P`),
          FT = as.numeric(FT)) %>%
   mutate(score = FT + 2*FG + 3*`3P`) %>%
   head(3)

## # A tibble: 3 × 5
##   MP       FG  `3P`    FT score
##   <chr> <dbl> <dbl> <dbl> <dbl>
## 1 30:38     6     4     4    28
## 2 38:25     7     3     3    26
## 3 33:09     6     2     1    19

But we are not done just yet. We need to convert playing time into some sort of decimal time, e.g. measured in minutes. This can be achieved with function separate(). It separates a text column into two or more parts based on the patterns in text. This is exactly what we need here–separate a text like “30:38” into two parts, “30” and “38”, where colon is the separator. There are many options about how to define the pattern and what to do with the original and resulting values, for now the default split pattern (anything that is neither a character nor a number) works well:

harden %>%
   select(MP, FG, `3P`, FT) %>%
   mutate(FG = as.numeric(FG)) %>%
   filter(!is.na(FG)) %>%  # remove those cases where 'fg' is NA
   mutate(`3P` = as.numeric(`3P`),
          FT = as.numeric(FT)) %>%
   mutate(score = FT + 2*FG + 3*`3P`) %>%
   separate(MP, into=c("min", "sec"),
            convert=TRUE) %>%
   head(3)

## # A tibble: 3 × 6
##     min   sec    FG  `3P`    FT score
##   <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1    30    38     6     4     4    28
## 2    38    25     7     3     3    26
## 3    33     9     6     2     1    19

So we tell separate() to split the variable “MP” into two new variables, called “min” and “sec”, based on anything that is neither a letter nor a number (this is the default behavior). We also tell it to convert the resulting “min” and “sec” columns to numbers.

Finally, we can compute the playing duration in minutes as

cleanHarden <- harden %>%
   select(MP, FG, `3P`, FT) %>%
   mutate(FG = as.numeric(FG)) %>%
   filter(!is.na(FG)) %>%  # remove those cases where 'fg' is NA
   mutate(`3P` = as.numeric(`3P`),
          FT = as.numeric(FT)) %>%
   mutate(score = FT + 2*FG + 3*`3P`) %>%
   separate(MP, into=c("min", "sec"),
            convert=TRUE) %>%
   mutate(duration = min + sec/60)

cleanHarden %>%
   head(3)

## # A tibble: 3 × 7
##     min   sec    FG  `3P`    FT score duration
##   <int> <int> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1    30    38     6     4     4    28     30.6
## 2    38    25     7     3     3    26     38.4
## 3    33     9     6     2     1    19     33.2

We store this into a new variable “cleanHarden” for the future reference.

Now we can plot the game duration versus score as

cleanHarden %>%
   ggplot(aes(duration, score)) +
   geom_smooth(method="lm", se=FALSE) +
   geom_point()

plot of chunk rel-harden-duration-score This is the same plot that was shown in Section 11.2.3.

The shortest and longest duration can be found by ordering the games with arrange(), there are other options too.

Galton, F. (1886) Regression towards mediocrity in hereditary stature, The Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263↩
Media typically reports temperature trend per decade, not per year. Obviously, 0.0182C per year is the same as 0.182C per decade.↩
Some of these, e.g. smoking, does not have to be binary. Instead of binary “smoker”/“non-smoker”, we can also record how many cigarettes a day someone is smoking, with possibly “0” for non-smokers. However, in many accessible datasets, smoking is coded as binary. The same applies to gender and many other variables.↩
Obviously, we could also just have looked at the table either in data viewer, or on the Basketball Reference website. But that option is not always there.↩