Chapter 15 Machine Learning Workflow
This section describes typical machine learning workflows–fitting the model, computing its performance, tinkering with it, and trying over again. This is how you determine which model is the best, and what hyperparameter values to use.
Obviously, ML involves many more tasks, in particular data cleaning and pre-processing, but we do not discuss it here.
First we provide the examples using base-R, and afterwards show how to achieve the related results using tidymodels.
15.1 Overfitting: random numbers improve results!
This section demonstrates many of the relevant tools by overfitting a classification (logistic regression) model. We do it by adding random data to the model. As random numbers are not related to the data and outcomes, they should not change the results. In particular, they should not substantially improve the results. However, if we assess the model performance on the training data, then the model’s performance will actually improve.
15.1.1 The baseline model
Below is an example using House votes data. We’ll use
First we load it and give the columns meaningful names:
read_delim("../data/house-votes-84.csv.bz2",
hv <-col_names = c("party", paste0("v", 1:16)))
head(hv, 3)
## # A tibble: 3 × 17
## party v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 republican n y n y y y n n n y ? y
## 2 republican n y n y y y n n n n n y
## 3 democrat ? y y ? y y n n n n y n
## v13 v14 v15 v16
## <chr> <chr> <chr> <chr>
## 1 y y n y
## 2 y y n ?
## 3 y y n n
Next, for simplicity, let’s only use data about the first 4 votes, and
convert it (either “y”, “n” or “?”)
into a logical variable with TRUE
for yea, FALSE
for
non-yea. We’ll aso convert party into republican (true/false):
hv %>%
hv <- select(party, v1:v4) %>%
# only 4 first votes for simplicity
mutate(across(v1:v4, function(x) x == "y"),
republican = party == "republican") %>%
select(!party)
head(hv, 3)
## # A tibble: 3 × 5
## v1 v2 v3 v4 republican
## <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 FALSE TRUE FALSE TRUE TRUE
## 2 FALSE TRUE FALSE TRUE TRUE
## 3 FALSE TRUE TRUE FALSE FALSE
As the last step in data preparation, I extract the column party as the outcome vector y. We do not need it here, but will make the notation simpler below.
hv$republican y <-
As this is a classification problem, we use logistic regression. First model republican using the four votes:
glm(republican ~ ., family = binomial(), data=hv) m <-
and thereafter predict:
predict(m, type="response") > 0.5 yhat <-
As we do not have preferences over type-1/type-2 errors, we can just use accuracy to evaluate the model:
mean(yhat == y)
acc <- acc
## [1] 0.9494253
So the model gets roughly 94.9 percent of the predictions right.
15.1.2 Adding random columns to data
Next, let’s add a column of random TRUE/FALSE-s to the data. Here we
use sample()
to pick the correct number of TRUE-s and FALSE-s:
sample(c(TRUE, FALSE), nrow(hv), replace=TRUE)
r <-# the random column
cbind(hv, r) hv1 <-
Now there is an additional, random column, in the updated data frame hv1:
%>%
hv1 sample_n(3)
## v1 v2 v3 v4 republican r
## 1 FALSE TRUE FALSE TRUE TRUE FALSE
## 2 TRUE TRUE FALSE TRUE TRUE FALSE
## 3 TRUE FALSE TRUE FALSE FALSE TRUE
We fit a similar model, this time also including the random column:
glm(republican ~ ., family="binomial", data=hv1)
m <- predict(m, type="response") > 0.5
yhat <- mean(y == yhat)
acc1 <- acc1
## [1] 0.9563218
This resulted in a slightly better accuracy, the difference is 0.69 percentage points.
If we add more random columns, the results will improve even more. Here an example with 10 columns:
sample(c(TRUE, FALSE), 10*nrow(hv), replace=TRUE) %>%
r10 <- # create random T/F for 10 columns
matrix(ncol=10) %>%
# transform it to a 10-column matrix
as.data.frame()
# and into a 10-column data frame
cbind(hv, r10)
hv10 <- glm(republican ~ ., family="binomial", data=hv10)
m <- predict(m, type="response") > 0.5
yhat <- mean(y == yhat)
acc10 <- acc10
## [1] 0.9609195
Now the accuracy is 96.09, an improvement of 1.15 percentage points.
If we add enough random data, we’ll eventually get perfect predictions:
sample(c(TRUE, FALSE), 40*nrow(hv), replace=TRUE) %>%
r40 <- # 40 columns of random data
matrix(ncol=40) %>%
as.data.frame()
cbind(hv, r40)
hv40 <- glm(republican ~ ., family="binomial", data=hv40) m <-
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predict(m, type="response") > 0.5
yhat <-mean(y == yhat)
## [1] 1
We’ll get convergence warnings, but the accuracy is now 1–adding enough random data will give us perfect results.
15.2 Training-validation split
We should compute model performance on validation data, not training data:
sample(nrow(hv), 0.8*nrow(hv))
iTrain <- hv[iTrain,]
hvTrain <- hv[-iTrain,] hvValid <-
For a check, let’s see the dimensions:
dim(hvTrain)
## [1] 348 5
dim(hvValid)
## [1] 87 5
The original data is split between 348 training and 87 validation cases, roughly in 80/20 ratio.
glm(republican ~ ., family = binomial, data=hvTrain)
m <- predict(m, type="response", newdata=hvValid) > 0.5
yhat <-mean(y == yhat)
## [1] 0.5172414
The performance on the training data in the original form is similar to performance on the complete data.
Now see how the performance looks like if we add random columns:
hv100[iTrain,]
hvTrain <- hv100[-iTrain,]
hvValid <- hv$republican[iTrain]
yTrain <- hv$republican[-iTrain]
yValid <- glm(republican ~ ., family="binomial", data=hvTrain) m <-
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predict(m, type="response", newdata=hvValid) > 0.5
yhat <-mean(yValid == yhat)
## [1] 0.9310345
Now the performance is noticeably less–only 90%.
We can see how the performance of the model changes if we continue adding random columns. We can just loop over number of columns and just continue adding random columns:
seq(1, 350, by=5)
rcolumns <- aValid <- numeric()
aTrain <- sample(nrow(hv), 0.8*nrow(hv))
iTrain <- 1
i <-for(rcols in rcolumns) {
sample(c(TRUE, FALSE), rcols*nrow(hv), replace=TRUE) %>%
r <- matrix(ncol=rcols)
cbind(hv, r)
X <- X[iTrain,]
XTrain <- X[-iTrain,]
XValid <- y[iTrain]
yTrain <- y[-iTrain]
yValid <- glm(republican ~ ., family = binomial(),
m <-data = XTrain)
predict(m, type="response") > 0.5
yhat <- mean(yTrain == yhat)
aTrain[i] <- predict(m, type = "response", newdata = XValid) > 0.5
yhat <- mean(yValid == yhat)
aValid[i] <- i + 1
i <-
}data.frame(rcolumns, aTrain, aValid) %>%
head(6)
## rcolumns aTrain aValid
## 1 1 0.9511494 0.9310345
## 2 6 0.9655172 0.9310345
## 3 11 0.9655172 0.9655172
## 4 16 0.9885057 0.9425287
## 5 21 0.9741379 0.9540230
## 6 26 1.0000000 0.9195402
Finally, let’s plot these results:
plot(rcolumns, aValid, type="l",
xlab="# of random columns", ylab="Accuracy",
ylim=c(0,1))
legend("bottomleft",
legend=c("Training", "Validation"),
lty=1, col=c("black", "red"))
lines(rcolumns, aTrain, col="red")
15.3 tidymodels workflow
For more advanced usage, it is common not to build such training and validation pipelines yourself, but to use ready-made packages.
library(tidymodels)
Below, we demonstrate this using the tidymodels approach, using decision trees. However, trees work better if we use more columns than just the first four votes (otherwise tree depth will not influence the results, as the trees will always and only rely on vote 4).
read_delim("../data/house-votes-84.csv.bz2",
hv <-col_names = c("party", paste0("v", 1:16)))
hv %>%
hv <- mutate(across(v1:v16, function(x) x == "y"),
republican = factor(party == "republican")) %>%
# classification tree requires a factor outcome
select(!party)
hv$republican y <-
Training-validation split
validation_split(hv, prop = 0.80) splitData <-
## Warning: `validation_split()` was deprecated in rsample 1.2.0.
## ℹ Please use `initial_validation_split()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
decision_tree(mode = "classification",
m <-tree_depth = tune(),
cost_complexity = 1e-5,
min_n = 3)
recipe(republican ~ ., data = hv) recipe <-
workflow() %>%
wf <- add_model(m) %>%
add_recipe(recipe)
wf
## ══ Workflow ════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = 1e-05
## tree_depth = tune()
## min_n = 3
##
## Computational engine: rpart
data.frame(tree_depth = 1:5)
grid <- grid
## tree_depth
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
wf %>%
res <- tune_grid(splitData,
grid = grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(accuracy,
precision, recall, f_meas) )
res
## # Tuning results
## # Validation Set Split (0.8/0.2)
## # A tibble: 1 × 5
## splits id .metrics .notes .predictions
## <list> <chr> <list> <list> <list>
## 1 <split [348/87]> validation <tibble [20 × 5]> <tibble [0 × 3]> <tibble [435 × 5]>
res %>%
metrics <- collect_metrics()
filter(metrics, .metric == "accuracy")
## # A tibble: 5 × 7
## tree_depth .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 accuracy binary 0.954 1 NA Preprocessor1_Model1
## 2 2 accuracy binary 0.954 1 NA Preprocessor1_Model2
## 3 3 accuracy binary 0.966 1 NA Preprocessor1_Model3
## 4 4 accuracy binary 0.954 1 NA Preprocessor1_Model4
## 5 5 accuracy binary 0.966 1 NA Preprocessor1_Model5

Different metrics as a function of tree depth
For better understanding, let’s plot how do the performance indicators depend on the tree depth:
ggplot(metrics,
aes(tree_depth, mean, col = .metric)) +
geom_point() +
geom_line()