Chapter 14 Machine Learning Workflow
14.1 Overfitting: random numbers improve results!
As random numbers are not related to your data and outcomes, including random numbers into the model should not change the results. In particular, they should not substantially improve the results. However, if we assess the model performance on the training data, then the model’s performance will actually improve.
hv <- read_delim("../data/house-votes-84.csv.bz2",
col_names = c("party", paste0("v", 1:16)))
head(hv, 3)
## # A tibble: 3 × 17
## party v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 republican n y n y y y n n n y ? y y y n y
## 2 republican n y n y y y n n n n n y y y n ?
## 3 democrat ? y y ? y y n n n n y n y y n n
Make it into a logical variable with TRUE
for yes, FALSE
for
non-yes:
hv <- hv %>%
select(party, v1:v4) %>%
# only 4 first votes for simplicity
mutate(across(v1:v4, function(x) x == "y"),
republican = party == "republican") %>%
select(!party)
y <- hv$republican
# simplify for the future
head(hv, 3)
## # A tibble: 3 × 5
## v1 v2 v3 v4 republican
## <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 FALSE TRUE FALSE TRUE TRUE
## 2 FALSE TRUE FALSE TRUE TRUE
## 3 FALSE TRUE TRUE FALSE FALSE
Fit a logit model:
Confusion matrix and accuracy:
## y
## yhat FALSE TRUE
## FALSE 254 9
## TRUE 13 159
## [1] 0.9494253
Add random variable to data:
r <- sample(c(TRUE, FALSE), nrow(hv), replace=TRUE)
hv1 <- cbind(hv, r)
m <- glm(republican ~ ., family="binomial", data=hv1)
yhat <- predict(m, type="response") > 0.5
mean(y == yhat)
## [1] 0.9563218
Add more random variables:
r10 <- sample(c(TRUE, FALSE), 10*nrow(hv), replace=TRUE) %>%
matrix(ncol=10)
hv10 <- cbind(hv, r10)
m <- glm(republican ~ ., family="binomial", data=hv10)
yhat <- predict(m, type="response") > 0.5
mean(y == yhat)
## [1] 0.9586207
Add even more random variables:
r100 <- sample(c(TRUE, FALSE), 100*nrow(hv), replace=TRUE) %>%
matrix(ncol=100)
hv100 <- cbind(hv, r100)
m <- glm(republican ~ ., family="binomial", data=hv100)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] 1
We see that the accuracy is now 1–adding random variables to the data in fact improves the results.
14.2 Training-validation split
We should compute model performance on validation data, not training data:
For a check, let’s see the dimensions:
## [1] 348 5
## [1] 87 5
The original data is split between 348 training and 87 validation cases, roughly in 80/20 ratio.
m <- glm(republican ~ ., family = binomial, data=hvTrain)
yhat <- predict(m, type="response", newdata=hvValid) > 0.5
mean(y == yhat)
## [1] 0.5172414
The performance on the training data in the original form is similar to performance on the complete data.
Now see how the performance looks like if we add random columns:
hvTrain <- hv100[iTrain,]
hvValid <- hv100[-iTrain,]
yTrain <- hv$republican[iTrain]
yValid <- hv$republican[-iTrain]
m <- glm(republican ~ ., family="binomial", data=hvTrain)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] 0.8505747
Now the performance is noticeably less–only 90%.
We can see how the performance of the model changes if we continue adding random columns. We can just loop over number of columns and just continue adding random columns:
rcolumns <- seq(1, 350, by=5)
aTrain <- aValid <- numeric()
iTrain <- sample(nrow(hv), 0.8*nrow(hv))
i <- 1
for(rcols in rcolumns) {
r <- sample(c(TRUE, FALSE), rcols*nrow(hv), replace=TRUE) %>%
matrix(ncol=rcols)
X <- cbind(hv, r)
XTrain <- X[iTrain,]
XValid <- X[-iTrain,]
yTrain <- y[iTrain]
yValid <- y[-iTrain]
m <- glm(republican ~ ., family = binomial(),
data = XTrain)
yhat <- predict(m, type="response") > 0.5
aTrain[i] <- mean(yTrain == yhat)
yhat <- predict(m, type = "response", newdata = XValid) > 0.5
aValid[i] <- mean(yValid == yhat)
i <- i + 1
}
data.frame(rcolumns, aTrain, aValid) %>%
head(6)
## rcolumns aTrain aValid
## 1 1 0.9511494 0.9310345
## 2 6 0.9655172 0.9310345
## 3 11 0.9655172 0.9655172
## 4 16 0.9885057 0.9425287
## 5 21 0.9741379 0.9540230
## 6 26 1.0000000 0.9195402
Finally, let’s plot these results:
plot(rcolumns, aValid, type="l",
xlab="# of random columns", ylab="Accuracy",
ylim=c(0,1))
legend("bottomleft",
legend=c("Training", "Validation"),
lty=1, col=c("black", "red"))
lines(rcolumns, aTrain, col="red")
14.3 tidymodels workflow
For more advanced usage, it is common not to build such training and validation pipelines yourself, but to use ready-made packages.
Below, we demonstrate this using the tidymodels approach, using decision trees. However, trees work better if we use more columns than just the first four votes (otherwise tree depth will not influence the results, as the trees will always and only rely on vote 4).
hv <- read_delim("../data/house-votes-84.csv.bz2",
col_names = c("party", paste0("v", 1:16)))
hv <- hv %>%
mutate(across(v1:v16, function(x) x == "y"),
republican = factor(party == "republican")) %>%
# classification tree requires a factor outcome
select(!party)
y <- hv$republican
Training-validation split
## Warning: `validation_split()` was deprecated in rsample 1.2.0.
## ℹ Please use `initial_validation_split()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this
## warning was generated.
## ══ Workflow ══════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
##
## ── Preprocessor ──────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ─────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = 1e-05
## tree_depth = tune()
## min_n = 3
##
## Computational engine: rpart
## tree_depth
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
res <- wf %>%
tune_grid(splitData,
grid = grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(accuracy,
precision, recall, f_meas)
)
## # Tuning results
## # Validation Set Split (0.8/0.2)
## # A tibble: 1 × 5
## splits id .metrics .notes .predictions
## <list> <chr> <list> <list> <list>
## 1 <split [348/87]> validation <tibble [20 × 5]> <tibble [0 × 3]> <tibble [435 × 5]>
## # A tibble: 5 × 7
## tree_depth .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 accuracy binary 0.954 1 NA Preprocessor1_Model1
## 2 2 accuracy binary 0.954 1 NA Preprocessor1_Model2
## 3 3 accuracy binary 0.966 1 NA Preprocessor1_Model3
## 4 4 accuracy binary 0.954 1 NA Preprocessor1_Model4
## 5 5 accuracy binary 0.966 1 NA Preprocessor1_Model5