A Introduction to Machine Learning

A.1 ML: make computer to spot patterns

Example of a simple pattern in data.

A.2 Decision Trees

TBD

A.3 Overfitting

An example dataset.

Consider data on the figure here. What might be a good model to describe the relationship between \(x\) and \(y\)? Without knowing much more, one may assume that a linear regression will do–the points seem to broadly follow an increasing linear trend. Obviously, if we model data using linear regression then we do not get exact results as data do not follow the trend very closely.

Linear trend (linear regression) is displayed below at left. The prediction errors, difference between the green trend line and the blue data points, are marked with orange. But this is not the only way to model data. We can pick something else, for instance decision trees. Now decision trees allow to model data as an arbitrarily fine-grained step function. The “arbitrarily fine-grained” means that we can make a separate step for every single data point in the dataset. This is displayed below right. Obviously, such “à la carte” step function will capture each point perfectly and we will have all of them modeled exactly. There will be no errors.

Modeling the same data with linear regression (left) and regression trees (right). While linear regression results in substantial prediction errors, trees can capture each datapoint perfectly. But are trees any better?

While trees may look like a superior method here, it is not that simple. The problem is related to what we want to use this model for. Indeed, if we know the values anyway, like what we see in these figures, then why do we need a predictive model in the first place? We make such models to predict values that we do not know, not those we already know. And it turns out that if that is what we want, then the excellent results of regression trees above may be wildly misleading.

Such behaviour–excellent results on known data–are often an indication of overfitting. Models may look super good on the dataset that was used for training, but on unknown data their performance may be very bad instead. This phenomenon is known as overfitting. It is a pervasive problem when using flexible models, such as trees, for doing predictive modeling. It is related to the models’ flexibility–too flexible models may pick up all kinds of patterns, not only those we are interested in. In this example, the tree did not just pick up the upward trend, but it figured out how to make a separate “step” for every single datapoint. It learned a too elaborate pattern.25 We need to test the model behavior on unknown data instead.

But how can we test how well does the model perform on unknown data? After all, these are the values we do not know, and hence we cannot say how well does the model predict those? Fortunately, there is an easy way around. Namely, the “unknown” datapoints should be unknown for the model, but they do not have to be unknown for us.

The same dataset, but now split into training (dark blue) and validation (light blue) chunks.

Consider the same data as above. But now we have decided that we will keep one of the points, the light blue “validation data”, unknown for the model. So the model will be trained with no information that there is, in fact, a value in this place. It will still do its best to fit all data points as well as possible, but now it cannot adjust itself for the validation data.

The resulting models are below. At left, we use a similar linear regression as above. As you can see, the trend line is essentially unchanged, and it still passes the validation data point fairly close. The error on unknown (validation) data is small. However, the regression tree now has little guidance about what to do near the missing data point. Here it just extends the horizontal “stair” from left to right. Unfortunately, now it misses the validation point by a significant margin. On unknown data, linear regression performs better than decision tree.

Separating data into regression and validation chunks. Validation data are data that are not “shown” to the model and only used afterward to test its performance. On these figures, validation error is larger for the tree.

This is a brief introduction to the idea of model validation, and using dedicated validation data for it. The main message is straightforward–if we care about the model performance on unknown data then we should measure it on unknown data.


  1. Overfitting is typically not a problem with linear and logistic regression, as those are much more “rigid” models.