Chapter 14 Machine Learning Workflow
We assume you have loaded the following packages:
## ModuleNotFoundError: No module named 'pandas'
## ModuleNotFoundError: No module named 'matplotlib'
Below we load more as we introduce more.
In this section we walk over a typical machine learning workflow using sklearn package. We assume you know what is overfitting and validation (see Section 13) and you are familiar with the basics of sklearn (see Section 10.2.2).
14.1 Boston Housing Data
Here we demonstrate small task to select the best linear regression model using validation. We use Boston Housing Dataset.
## NameError: name 'pd' is not defined
## NameError: name 'boston' is not defined
Our task is to predict the average house value medv as well as we can using all other features. We pick a subset of features
## NameError: name 'boston' is not defined
And we add a few features, namely \(\mathit{age}\times\mathit{rm}\) and
## NameError: name 'boston' is not defined
Now we have dataset that looks like
## NameError: name 'boston' is not defined
First we demonstate the workflow using training-validation approach. We split data into training and validation parts:
## ModuleNotFoundError: No module named 'sklearn'
## NameError: name 'boston' is not defined
## NameError: name 'boston' is not defined
## NameError: name 'train_test_split' is not defined
Now we can test different models in terms of \(R^2\):
## ModuleNotFoundError: No module named 'sklearn'
## NameError: name 'LinearRegression' is not defined
## NameError: name 'm' is not defined
## NameError: name 'm' is not defined
This was the model with all variables. We can try other combinations of variables:
## NameError: name 'boston' is not defined
## NameError: name 'train_test_split' is not defined
## NameError: name 'm' is not defined
## NameError: name 'm' is not defined
and
## NameError: name 'boston' is not defined
## NameError: name 'train_test_split' is not defined
## NameError: name 'm' is not defined
## NameError: name 'm' is not defined
14.2 Categorization: image recognition
Here we analyze MNIST digits. This is a dataset of handwritten digits, widely used for computer vision tasks. sklearn contains a low-resolution sample of this dataset:
## ModuleNotFoundError: No module named 'sklearn'
## NameError: name 'load_digits' is not defined
## NameError: name 'mnist' is not defined
## NameError: name 'mnist' is not defined
This loads the dataset and extracts the design matrix and labels y from there. We can take a look how does the data look with
## NameError: name 'X' is not defined
## NameError: name 'y' is not defined
X tells us that we have 1797 different digits, each of which contains 64 features. These features are pixels–the image consists of \(8\times8\) pixels, in the design matrix X the images are flattened into 64 consecutive pixels as features. A sample of data looks like
## NameError: name 'X' is not defined
You can see many “0”-s (background) and numbers between “1” and “15”, denoting various intensity of the pen. We can easily plot these images, just we have to reshape those back into \(8\times8\) matrices. This is what is leads to
## NameError: name 'X' is not defined
If you look at the matrix closely, you can see that it depicts a number “7”. This is much easier to see if we plot the result:
## NameError: name 'plt' is not defined
This is a low-resolution image of number “7”. Here is a larger example of the first 30 digits:
for i in range(30):
ax = plt.subplot(3, 10, i+1)
_ = ax.imshow(X[i].reshape((8, 8)), cmap='gray_r')
_ = ax.axis("off")
_ = ax.set_title(f"A: {y[i]}")## NameError: name 'plt' is not defined
## NameError: name 'plt' is not defined
As you can see, the numbers are of low quality and a bit hard to recognize for us. Computer will do it very well though–our brains are trained with high-quality images, not with low-quality images like here.
As a first step, let’s take an easy task and separate “0”-s and “1”-s. We’ll test a few different models in terms of how well do those perform. Extract all “0”-s and “1”-s:
## NameError: name 'y' is not defined
## NameError: name 'X' is not defined
## NameError: name 'y' is not defined
Next, we do training-validation split to validate our predictions:
## NameError: name 'train_test_split' is not defined
## NameError: name 'Xt' is not defined
We can use logistic regression for this binary classification task, and after we fit the model, we compute accuracy:
## ModuleNotFoundError: No module named 'sklearn'
## NameError: name 'LogisticRegression' is not defined
## NameError: name 'm' is not defined
## NameError: name 'm' is not defined
We got a perfect score–1.0. This means that the model is able to perfectly distinguish between these two digits. Indeed, these digits are in fact easy to distinguish, the pixel patterns look substantially different.
Let us try some more challenging tasks–to distinguish between “4”-s and “9”-s:
## NameError: name 'y' is not defined
## NameError: name 'X' is not defined
## NameError: name 'y' is not defined
## NameError: name 'train_test_split' is not defined
## NameError: name 'm' is not defined
## NameError: name 'm' is not defined
The result is still ridiculously good with only a single wrong prediction as shown by the confusion matrix:
## ModuleNotFoundError: No module named 'sklearn'
## NameError: name 'm' is not defined
## NameError: name 'confusion_matrix' is not defined
The mis-categorized image is
## NameError: name 'yhat' is not defined
## NameError: name 'plt' is not defined
## NameError: name 'plt' is not defined
## NameError: name 'plt' is not defined
Indeed, even human eyes cannot tell what is the digit.
Logistic regression allows to use more than two categories–this is called multinomial logit. So instead of distinguishing between just two types of digits, we can categorize all 10 different categories:
## ModuleNotFoundError: No module named 'sklearn'
## NameError: name 'ConvergenceWarning' is not defined
## NameError: name 'train_test_split' is not defined
## NameError: name 'LogisticRegression' is not defined
## NameError: name 'm' is not defined
## NameError: name 'm' is not defined
The results are still very-very good although we got more than a single case wrong. The confusion matrix is
## NameError: name 'm' is not defined
## NameError: name 'confusion_matrix' is not defined
We can see that by far the most cases are on the main diagonal–the model gets most of the cases right. The most problematic cases are mispredicting “8” as “1”.
14.3 Training-validation-testing approach
We start by separating testing, or hold-out data:
## NameError: name 'train_test_split' is not defined
## NameError: name 'Xw' is not defined
Now we do not touch the hold-out data until the very end. Instead, we split the work-data into training and validation parts:
## NameError: name 'train_test_split' is not defined
## NameError: name 'Xt' is not defined
Now we can test different models on training-validation data:
## NameError: name 'LogisticRegression' is not defined
## NameError: name 'mLogistic' is not defined
## NameError: name 'mLogistic' is not defined
The beauty of sklearn is that is easy to try different models. Let’s try a single nearest-neighbor classifier:
## ModuleNotFoundError: No module named 'sklearn'
## NameError: name 'KNeighborsClassifier' is not defined
## NameError: name 'm1NN' is not defined
## NameError: name 'm1NN' is not defined
This one achieved very good score on validation data. What about 5-nearest neighbors?
## ModuleNotFoundError: No module named 'sklearn'
## NameError: name 'KNeighborsClassifier' is not defined
## NameError: name 'm5NN' is not defined
## NameError: name 'm5NN' is not defined
The accuracy is almost as good as in single-nearest-neighbor case. We can also try decision trees:
## ModuleNotFoundError: No module named 'sklearn'
## NameError: name 'DecisionTreeClassifier' is not defined
## NameError: name 'mTree' is not defined
## NameError: name 'mTree' is not defined
Trees were clearly inferior here.
Instead of training-validation split, we can use cross-validation:
## ModuleNotFoundError: No module named 'sklearn'
## NameError: name 'cross_val_score' is not defined
## NameError: name 'cross_val_score' is not defined
## NameError: name 'cross_val_score' is not defined
## NameError: name 'cross_val_score' is not defined
Cross-validation basically replicated training-validation split results and the best model again appears to be 1-NN. But the lead in front of 5-NN is just tiny. But we can pick 1-NN as our preferred model.
Fianlly, the hold-out data gives us the final performance measure:
## NameError: name 'm1NN' is not defined
Now we have computed the final model accuracy, we should not change the model any more.