Chapter 19 Neural Networks

This section discusses now to use neural networks in python. First we discuss multi-layer perceptrons in sklearn package, and thereafter we do more complex networks using keras.

We assume you have loaded the following packages:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We load more functions below as we introduce those.

We demonstrate neural networks using artificial color spiral data. This is a 2-D dataset where different points are colored differently, and the task is to predict the correct color based on the point location. So it is a basic decision task. However, in order to make the task reasonably complex, we introduce the colors in a spiral pattern. We create the data as follows:

n = 800  # number of data points
x1 = np.random.normal(size=n)
x2 = np.random.normal(size=n)
X = np.column_stack((x1, x2))  # design matrix
alpha = np.arctan2(x2, x1)
r = np.sqrt(x1**2 + x2**2)
c1 = np.sin(3*alpha + 2*r)
c2 = np.cos(3*alpha + 2*r)
## partition the sum of a sin and cosine into 5 intervals
category = pd.cut(c1 + c2,
           bins=[-1.5, -1.1, -0.6, 0.6, 1.1, 1.5],
           labels=[1, 2, 3, 4, 5])
y = category.astype(int)
## /usr/lib/python3/dist-packages/pandas/core/arrays/categorical.py:528: RuntimeWarning: invalid value encountered in cast
##   fill_value = lib.item_from_zerodim(np.array(np.nan).astype(dtype))

So we transform data into polar coordinates and compute sin and cos of the angle \(\alpha\), but not just \(\alpha\) but we also add some polar distance to the angle. As a result we get a spiral. Let’s visualize the result:

_ = plt.figure(figsize=(8,8))
ax = plt.axes()
_ = ax.scatter(X[:,0], X[:,1], c=y, s=40, edgecolors='black')
_ = ax.set_aspect("equal")
_ = plt.show()
plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-3

The image depicts a complex pattern of spiral arms of five different color, three-armed yellow and violet correspond to the largest and smallest values, and 6-armed different shades of blue and green spiral arms correspond to the values in-between. These decision boundaries are very hard to capture with simple models, such as logistic regression, SVM or even trees. However, neural networks (and k-NN) can do fairly well.

19.1 Multi-Layer Perceptron

sklearn implements simple feed-forward neural networks, multi-layer perceptrons. These are simple dense feed-forward networks with an arbitrary number of hidden layers. Even if simple in neural network context, they are still powerful enough for many tasks. As with other advanced functions, sklearn provides two functions: MLPClassifier for classification tasks and MLPRegressor for regression tasks. This closely parallels trees and k-NN methods. The basic usage of these perceptron models is similar to that of all other sklearn models.

The most important arguments for MLPClassifier are

MLPClassifier(hidden_layer_sizes, activation, max_iter, alpha)

Out of these, hidden_layer_sizes is the most central one. This describes the network, in particular its hidden layers (the size of input and output layer are automatically determined from data). It is a tuple that tells the number of nodes for each hidden layer, so length of the tuple will also tell the number of hidden layers. So for instance hidden_layer_sizes = (32, 16) means two hidden layers, the first one with 32 and the following one with 16 nodes. activation describes the activation function, choose “relu” (the default) unless you have good reasons to choose something else. alpha is the l2 regularization parameter and max_iter tells the maximum number of iterations (or epochs if using SGD) before the optimization stops.

Let us now demonstrate the usage of MLPClassifier on the color spiral image data. Let us start with a simple perceptron with a single hidden layer of 20 nodes. Hence we use hidden_layer_sizes=(20,), a tuple with just a single number. Fitting the model and predicting is mostly similar to that of the other sklearn so we do not discuss it here. We also increase the number of iterations as the default 200 is too little in this case:

from sklearn.neural_network import MLPClassifier
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
##   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
from sklearn.metrics import confusion_matrix

m = MLPClassifier(hidden_layer_sizes = (20,), max_iter=10000)
_ = m.fit(X, y)
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
yhat = m.predict(X)
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
confusion_matrix(y, yhat)
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

This simple model did not do well even on training data. Its accuracy is

np.mean(yhat == y)
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

This is because it is too simple, just a single small hidden layer is not enough to model the complex spiral pattern well. Let’s also check to decision boundary plot to see how does the model represent the image:

def DBPlot(m, X, y, nGrid = 100):
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, nGrid),
                           np.linspace(x2_min, x2_max, nGrid))
    XX = np.column_stack((xx1.ravel(), xx2.ravel()))
    hatyy = m.predict(XX).reshape(xx1.shape)
    plt.figure(figsize=(8,8))
    _ = plt.imshow(hatyy, extent=(x1_min, x1_max, x2_min, x2_max),
                   aspect="auto",
                   interpolation='none', origin='lower',
                   alpha=0.3)
    plt.scatter(X[:,0], X[:,1], c=y, s=30, edgecolors='k')
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.show()
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DBPlot(m, X, y)
plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-6

MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can see that the model correctly captures the idea: spirals of different colors, but the shape of spirals is not accurate enough.

Let us repeat the model with a more powerful network:

m = MLPClassifier(hidden_layer_sizes = (256, 128, 64), max_iter=10000)
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
_ = m.fit(X, y)
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
yhat = m.predict(X)
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
confusion_matrix(y, yhat)
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
np.mean(yhat == y)
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DBPlot(m, X, y)
plot of chunk sklearn-complex-network

plot of chunk sklearn-complex-network

MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now the results are very good with the accuracy around 0.98 (note: on training data!). A visual inspection confirms that the more powerful neural network is quite good in capturing the overall model structure.

Fitted network models have a number of methods and attributes, e.g. coefs_ gives the model weights (as a list of weight matrices, one for each layer), and intercepts_ gives the model biases (as a list of bias vectors, one for each layer). For instance, the model fitted above contains

np.sum([b.size for b in m.intercepts_])
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

biases.

While sklearn offers easy access to neural network models, those model are substantially limited. For more powerful networks one has to use other libraries, such as tensorflow or pytorch.

19.2 Convolutional Neural Networks in Keras

19.2.1 Tensorflow and Keras

Tensorflow is a library that implements compute graphs that can compute gradients automatically. It also allows the computations to be carried on GPU, potentially offering a big speed imprevements over CPU computations. It is in many ways similar to numpy where one can create and manipulate matrices and tensors. However, because the compute graph approach and GPU-computing related considerations, it is much harder to use.

Keras is a tensorflow front-end, a submodule, that offers much more user-friendly access to building neural networks. The functionality includes a wide variety of network layers where one can adjust the corresponding parameters. When constructing the network, one can just add the layers next to each other, connection between the layers will be taken care for by keras itself. A good source for keras documentation is its API reference docs.

Tensorflow can be hard and frustrating to install. Normally it works fine using either conda install tensorflow or pip install tensorflow. However, sometimes things may go wrong and it may be hard to find and fix the issues. In particular, pip normally installs the most recent version, even if it is incompatible with the rest of your packages. It also installs dependencies, and may upgrade certain packages, breaking the python installation in the process. In order to avoid messing with the rest of your system, we strongly recommend to install it in the virtual environment, such as anaconda.

19.2.2 Example network in keras

Let us re-implement the color spiral example from the Multi-Layer Perceptron section but this time using keras. There will be a few noteworthy differences:

  • Construction of the network itself is different in keras.
  • Keras does not compute the size of input and output layers from data. Both must be specified by the user.
  • Finally, keras only predicts probabilities for all the categories, so we have to add code that finds the column (category) with maximum probability.

The full code can be downloaded from the Bitbucket repo, below we discuss the selected details.

19.2.2.1 Building the Model

First, the most important step: building and compiling the model. This is very different from how it is done in sklearn. Let’s build a sequential model with dense layers, the same perceptron that we created using sklearn above:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# sequential (not recursive) model (one input, one output)
model=Sequential()
model.add(Dense(512, activation="relu",
                input_shape=(2,)))
model.add(Dense(256, activation="relu"))
model.add(Dense(64, activation="relu"))
model.add(Dense(nCategories, activation="softmax"))

We start with importing the functionality we need from tensorflow.keras. Thereafter we create an empty sequential model (i.e. a model that does not contain loops and data flowing backward) and start adding layers to it. We add 3 dense layers. The first layer contains 512 nodes, the second one 256 and the last node contains 64 nodes. All these nodes are activated using relu function.

The first important new feature is the argument input_shape for the first layer, the input layer. This tells keras what kind of inputs to expect. Currently it is just a tuple (2,) as our X-matrix only contains 2 columns. So (2,) is just the shape of a single row of X, a single instance of the input data. You can find the correct shape with X[0].shape. But the input here does not have to be just a vector. For instance, in case of images it may be a 3-D tensor with shape (width, height, #color channels). You only need to specify input_shape argument for the first layer, keras can find the information for the following layers itself.

We also have to add an explicit output layer. As the task here is classification, we need as many output nodes as we have categories–we can compute this number as

nCategories = len(np.unique(category))

Each output node will predict the probability that the input falls into the corresponding category, we can use softmax (multinomial logit) activation to ensure that the outcomes are valid probabilities. Note that in case of just two categories, softmax activation is equivalent to ordinary logistic regression. But unlike ordinary logistic regression, we have a number of other layers preceding the last logistic layer.

Getting the input shapes and output nodes right is one of the major sources of frustration when starting to work with keras. The error messages are long and not particularly helpful, and it is hard to understand what went wrong. Here is a checklist to work through if something does not work:

  • Does the first (and only the first) layer contain input_shape argument?
  • Does input_shape correctly represent the shape of a single instance of the input data?
  • Do you have correct number of nodes in the softmax-activated output layer?

See also Common Error Messages below.

19.2.2.2 Fitting the Model

The next task is to compile and fit the model. Keras models need to be compiled–what we set up so far is just a description of the model, not the actual model that is set up for tensorflow tensors and potentially for GPU execution. We can compile the model as

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print(model.summary())

The three most important arguments are

  • loss describes the model loss function, sparse_categorical_crossentropy, essentially log-likelihood, is suitable for categorization tasks. The exact type also depends on how exactly is the outcome coded.
  • optimizer is the optimizer to use for stochastic gradient descent. adam and rmsprop are good choices but there are other options.
  • metrics is a metric to be evaluated and printed while optimizing, offering some feedback about how the evaluation is going.

The last line here prints the model summary, a handy overview of what we have done:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 512)               1536      
_________________________________________________________________
dense_1 (Dense)              (None, 256)               131328    
_________________________________________________________________
dense_2 (Dense)              (None, 64)                16448     
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 325       
=================================================================
Total params: 149,637
Trainable params: 149,637
Non-trainable params: 0

So this network contains almost 150,000 parameters, all of which are trainable.

After successful compilation we can fit the model:

history = model.fit(X, y, epochs=200)

In this example X is the design matrix, y is the outcome vector, and the argument epochs tells how many epochs to run the optimizer. Keras reports the progress while optimizing, it may look like

Epoch 1/200
25/25 [==============================] - 0s 2ms/step - loss: 1.5984 - accuracy: 0.2438
Epoch 2/200
25/25 [==============================] - 0s 2ms/step - loss: 1.5709 - accuracy: 0.2763
Epoch 3/200
25/25 [==============================] - 0s 3ms/step - loss: 1.5550 - accuracy: 0.3013

where one can see the current epoch, batch (for stochastic gradient descent), and the current accuracy on training data. As in this example all the epoch are run till completion, we see 25/25 for batches, but otherwise one can see the batch number progressing as the training proceeds. This example works very fast, keras reports 2ms per step (batch) and total time for epoch is too small to be reported. But a single epoch may take many minutes for more complex models and more data.

The reported metric (here accuracy) is computed for a single batch, and it gives only a crude guidance of the actual model accuracy, even on training data. Do not take this accuracy measure too seriously!

Keras let’s you predict using model that is not fitted (unlike sklearn where that causes an error). The results will look mediocre at best.

19.2.2.3 Predicting and Plotting

When the fitting is done, we can use the model for prediction. Prediction itself works in a similar fashion as in sklearn, just the predict method predicts probability, not category (analogously to sklearn’s predict_proba):

phat = model.predict(X)

In this example, this will be a matrix of 5 columns where each column represents probability that the data point belongs to the corresponding category. Example lines of phat may look like

phat[:5]
[[1.3339513e-37 5.6408087e-24 2.7101044e-10 1.2674906e-03 9.9873251e-01]
 [2.7687559e-09 1.9052830e-02 9.8094696e-01 2.3103729e-07 2.8597169e-19]
 [1.6330467e-18 6.0083215e-07 9.5986998e-01 4.0129449e-02 2.5692884e-08]
 [5.6379267e-15 1.8879733e-06 9.9859852e-01 1.3995816e-03 2.9565132e-11]
 [1.0658005e-19 7.9592645e-08 1.7678380e-01 7.8631145e-01 3.6904618e-02]]

In case of the first line, the largest probability, 0.998, is in the 5th column. In all three following lines, the 3rd column contains the largest probabilities, 0.981, 0.960, and 0.999 respectively, and finally, in the fourth line, the maximum value 0.786 is in the 4th column.

Next, in order to find the corresponding column, we can use np.argmax(phat, axis=-1), it just finds the location of the largest elements in the array, across the last axis (axis=-1), i.e. columns. So for each row, we find the corresponding column number. Be aware that np.argmax counts columns starting from 0, not from 1:

yhat = np.argmax(phat, axis=-1)
yhat[:5]
[4, 2, 2, 2, 3]

Finally, we can compute confusion matrix using `pd.crosstab

This is similar to what we did in the DBPlot function above, here is an example how to compute confusion matrix and accuracy:

from sklearn.metrics import confusion_matrix

print("confusion matrix:\n", confusion_matrix(category, yhat))
print("Accuracy (on training data):", np.mean(category == yhat))

In this example we predict on training data but we can obviously choose another set of data for predictions. As the predicted value will be probability matrix of 5 columns, we compute yhat as the column number that cointans the largest probability for each row.

The output may be look something like this:

confusion matrix:
 [[172   4   0   0   0]
 [ 10  82   4   0   0]
 [  0  11 215   7   0]
 [  0   0   8 116   1]
 [  0   0   0  10 160]]
Accuracy (on training data): 0.93125

As one can see, the confusion matrix is populated almost exclusively on the main diagonal, and accuracy is high.

Finally, if we want to make a similar decision boundary plot as above, then we have to modify the DBPlot function in order to address the fact that keras models only predict probabilities:

def DBPlot(m, X, y, nGrid = 100):
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, nGrid),
                           np.linspace(x2_min, x2_max, nGrid))
    XX = np.column_stack((xx1.ravel(), xx2.ravel()))
    ## predict probability 
    phat = m.predict(XX)
    ## find the column that corresponds to the maximum probability
    hatyy = np.argmax(phat, axis=-1).reshape(xx1.shape)
    plt.figure(figsize=(8,8))
    _ = plt.imshow(hatyy, extent=(x1_min, x1_max, x2_min, x2_max),
                   aspect="auto",
                   interpolation='none', origin='lower',
                   alpha=0.3)
    plt.scatter(X[:,0], X[:,1], c=y, s=30, edgecolors='k')
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.show()

Again, the function is almost identical to the sklearn version, except the line containing np.argmax(phat, axis=-1) that converts predicted probabilities to categories.

19.3 Image processing with convolutional networks

The main reason to choose keras over sklearn is the much more powerful toolset included in keras library. This includes convolutional layers that are widely used in image processing, and also data generators that load data as they go so one does not have to keep tens of thousands of images in memory.

We demonstrate the usage by categorizing images into cats and dogs. The data used in this example can be downloaded from kaggle. It contains 25,000 labeled training images and 12,500 unlabeled testing images (as these are not labeled, those cannot be really used for testing). The data sets are large (as is common for image processing), training data is 600MB and testing data 300MB. In the code below we assume the images are located in cats-n-dogs/train for training data. The full code example is on the Bitbucket repo, here we discuss just the more crucial parts of it.

19.3.1 Loading Data

Let us first set the model parameters:

imgDir = "cats-n-dogs"
imageWidth, imageHeight = 128, 128
imageSize = (imageWidth, imageHeight)
channels = 3

As we need to repeatedly find the images, we specify the location of the folder here (you have to adjust this for your computer if you want to run this program). Next, because the input tensors that correspond to the images must be of the same size, we specify image target size here, and resize all images later into shape imageSize (this will be done by data generators, see below). In this example, one color channel will be a \(128\times128\) matrix. We also specify that the images contain 3 color channels, so a single image is in fact a \(128\times128\times3\) tensor. Obviously, larger image size gives better predictions but will be slower.

We do not want to load all images into memory–that would be a very memory-hungry approach. We specify data generators instead, functions that load the images from disk only when they are needed. There are different options for data generators, the one we use here is flow_from_dataframe. It expects the file names and the corresponding correct categories to be listed as data frame rows. We start by creating a data frame of all labeled images and their corresponding labels (cat vs dog). The images are named as “cat.xxxxx.jpg” and “dog.xxxxx.jpg”, so we can use the first three digits of the file name as the category:

filenames = os.listdir(os.path.join(imgDir, "train"))  # all training file names
print(len(filenames), "images found")
trainImages = pd.DataFrame({
    'filename':filenames,
    'category':pd.Series(filenames).str[:3]
    # first three letters of file name give the category
})
print("categories:\n", trainImages.category.value_counts())

So we create a data frame that contains two columns, filename and category. Later we tell keras to load images that are listed in this data frame using the supplied category as the label. A sample of the dataframe might look like:

           filename category
2966   cat.1278.jpg      cat
1105   cat.2979.jpg      cat
8483   dog.5446.jpg      dog
19006  dog.5471.jpg      dog
20363  dog.5543.jpg      dog

We can see that image file cat.1278.jpg is of category cat while image dog.5446.jpg is of category dog.

19.3.2 Building the Model

Now it is time to build the model. The basic model building steps are similar as discussed in Example network in keras: building the model, but this time we add more types of layers:

model = Sequential()
## First convolutional layer with 32 filters (kernels)
model.add(Conv2D(32,
                 kernel_size=3,
                 activation='relu',
                 input_shape=(imageWidth, imageHeight, channels)))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.25))
## 2nd convolutional layer
model.add(Conv2D(64,
                 kernel_size = 3,
                 activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.25))
## 3rd convolutional layer
model.add(Conv2D(128,
                 kernel_size=3,
                 activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.25))
## Flatten the image into a string of pixels
model.add(Flatten())
## Use one final dense layer
model.add(Dense(512, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
## Output layer with 2 softmax nodes
model.add(Dense(2, activation='softmax'))

We build a sequential model, but now the first three layers are convolutional layers. All layers, except the output layer, are activated using relu function.

  • The first layer (2-D convolutional layer) contains 32 filters of size \(3\times3\). This means we introduce 32 different convolutions and let the network learn which 32 filters give the best performance. As strides argument is not specified, these filters use \(1\times1\) strides, i.e. the kernel is moved over the images one pixel at time. Unlike the other layers, the first layer specifies the input shape, in this case the image shape (width, height, and color channels) as an individual data point here is image. Input shape describes how the input to the model look like. In case of linear regression, or non-convolutional netowrk, it is normally a single vector of \(x\)-s. However, here we cannot use vectors—we cannot just flatten the image into an 1-D series of pixels because convolutions need information about the pixel locations. So we have to tell the model what is the image size and how many color channels are there.

    The first layer contains 896 parameters, \(3\times3\) kernel weights for each 32 filters and each 3 color channels, and 32 biases, one for each filter: \((3\times3\times3 + 1)\times 32 = 896\).

  • Convolutional layers are followed with the corresponding max pooling over \(2\times2\) image regions.

  • BatchNormalization and Dropout are not separate layers but ways of training the corresponding layer’s weights. The former is useful to ensure stable gradients, the latter to avoid overfitting.

  • The second and the third convolutional layer are similar to the first one, except that they contain more filters, and do not specify the input shape. Keras can find the input shape itself based on the previous layers. These layers also contain way more parameters despite they are specified using \(3\times3\) kernel. Remember–the image itself contains three color channels, but the second convolutional layer works on the output of the first layer, i.e. 32 channels. Now the actual size of the kernel is \(3\times3\times32\) and hence we have \((3\times3\times32 + 1)\times 64 = 18,496\) parameters.

  • Final block of layers starts with flattening the image. This means we transform 3-D tensors into an 1-D array of pixel results, and loose the spatial information in the process. The flattened data is fed into the dense layer with similar batch normalization and dropout.

  • Finally, we predict using two output nodes, activated through softmax function.

The model summary will look like:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 126, 126, 32)      896       
_________________________________________________________________
batch_normalization (BatchNo (None, 126, 126, 32)      128       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 63, 63, 32)        0         
_________________________________________________________________
dropout (Dropout)            (None, 63, 63, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 61, 61, 64)        18496     
_________________________________________________________________
batch_normalization_1 (Batch (None, 61, 61, 64)        256       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 30, 30, 64)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 30, 30, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 28, 28, 128)       73856     
_________________________________________________________________
batch_normalization_2 (Batch (None, 28, 28, 128)       512       
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 14, 14, 128)       0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 14, 14, 128)       0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
dense (Dense)                (None, 512)               12845568  
_________________________________________________________________
batch_normalization_3 (Batch (None, 512)               2048      
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 1026      
=================================================================
Total params: 12,942,786
Trainable params: 12,941,314
Non-trainable params: 1,472

It is very instructive to analyze and understand the number of parameters and the output shapes. We can see that the model contains almost 13M parameters, most of which are in the last dense layer. This is because all the output pixels of the third pooling layer (\(14\times14\times128\)) must be fed into all 512 nodes of the dense layer. Hence we have \(512 \times 14 \times 14 \times 128 = 12,845,056\) weights and 512 biases, this is exactly 12,845,568 parameters for the dense layer.

We can also see how image size is decreasing through the convolutional layers. Remember, the input images are \(128\times128\) pixels. As the convolutional filters are \(3\times3\) pixels large, each of them cuts two pixels off from the image (as we did not specify any padding), and hence the first conv2d layer outputs \(126\times126\) pixels. Max pooling over \(2\times2\) squares further halves the image size to \(63\times63\). If we had specified strides larger than one, we would reduce the size even more rapidly.

19.3.3 Common model errors

Keras’ errors may be hard to understand for the un-initiated. Here we describe a few common errors. Note that these may be buried inside of large list of messages, usually toward the end of the latter.

19.3.3.1 Wrong input shape

Beginners often do not understand the input shape. But it is a necessary part of information that must be fed to the model. If we get it wrong, for instance if we specify the first layer as

model.add(Conv2D(32,
                 kernel_size=3,
                 activation='relu',
                 input_shape=(imageWidth, imageHeight)))  # no channels!

Then keras responds with a message

ValueError: Input 0 of layer conv2d is incompatible with the layer:
expected min_ndim=4, found ndim=3. 
Full shape received: [None, 128, 128]

This tells that a Conv2D layer expects a 4-D tensor as its input (min_ndim=4). The correct shape should be [None, 128, 128, 3], where None means that the different cases (different images) that are stacked along that dimension.

This error happens during the model building, i.e. when you call model.add(Conv2D(...)).

Sometimes you get the number of input extents right, but their dimension wrong. For for instance, if you specify 5 color channels instead of 3:

model.add(Conv2D(32,
                 kernel_size=3,
                 activation='relu',
                 input_shape=(imageWidth, imageHeight, 5)))  # should be 3 channels!

then the error will be

tensorflow.python.framework.errors_impl.InvalidArgumentError:
input depth must be evenly divisible by filter depth: 3 vs 5

It tries to wrap the \(128\times128\times3\) image into the \(128\times128\times5\) tensor, but it does not fit well.

This error occurs first when the fitting algorithm discovers that the images contain 3 channels instead of 5, i.e. when you call model.fit.

19.3.3.2 Wrong number of categories

This problem is conceptually fairly easy to grasp: the number of nodes in the final softmax layer must equal to the number of categories. If we get this wrong, e.g. by specifying the last layer as

model.add(Dense(3, activation='softmax'))  # too many categories!

then keras produces

tensorflow.python.framework.errors_impl.InvalidArgumentError:  
logits and labels must be broadcastable: 
logits_size=[32,3] labels_size=[32,2]

This tells that we were requesting 3 nodes (3 “logits”), but the data (labels) only contain 2 different categories. “32” is batch size here, that is why it replies not “3” and “2” but as “[32,3]” and “[32,2]”.

This error occurs when the fitting algorithm finds that there are too few categories, i.e. when you call model.fit.

19.3.3.3 Image runs out of pixels

Each convolutional filter makes the image smaller because we lose pixels at the edges (unless we use padding). If we use strides that are larger than one, we also lose output pixels because we now move with larger steps. In a similar way, pixels get lost in pooling because pooling layers “pool” pixels over neighboring areas into a single one. In this way it may happen that the image does not contain any pixels at a certain stage.

Let’s demonstrate it by adding strides=3 to all convolutional layers. The first layer now looks like:

model.add(Conv2D(32,
                 kernel_size=3,
                 strides=3,
                 activation='relu',
                 input_shape=(imageWidth, imageHeight, channels)))

The image gets too small–after a few operations it contains no pixels. keras stops with

ValueError: Negative dimension size caused by subtracting 2 from 1
for '{{node max_pooling2d_2/MaxPool}} = MaxPool[T=DT_FLOAT, 
data_format="NHWC", ksize=[1, 2, 2, 1], padding="VALID", 
strides=[1, 2, 2, 1]](batch_normalization_2/cond/Identity)' 
with input shapes: [?,1,1,128].

If you notice such a “negative dimension size” error, then you should check your “Output Shapes” in the model summary (see Section 19.3.2 above). Remove all convolutional layers besides the first one and try to understand what will be the image size after each operation, and which operations you want to remove or modify to retain an image with a meaningful number of pixels.

This error occurs at the model compilation stage where keras is computing size of the tensors.

19.3.4 Training the model

Now is time to setup training data generator that reads files during the model fitting:

## Training data generator:
train_generator = ImageDataGenerator(
    rescale=1./255,
    rotation_range=15,
    shear_range=0.1,
    zoom_range=0.2,
    horizontal_flip=True,
    width_shift_range=0.1,
    height_shift_range=0.1
).flow_from_dataframe(
    df,
    os.path.join(imgDir, "train"),
    x_col='filename', y_col='category',
    class_mode='categorical',  # target is 2-D array of one-hot encoded labels
    target_size=imageSize,
    shuffle=True
)

ImageDataGenerator is a class that can handle images in various ways, in particular it can rescale the intensity from 0-255 integer range to 0-1 float range, and it can introduce various small modifications to the image, such as rotation and shear. This is useful for adding more variation to the training data. However, do not modify your test data! We also shuffle the training data in order to avoid feeding it always in the same order to the network.

After introducing modifications to the image, we call method flow_from_dataframe. It takes the data frame that contains both images (specified as x_col) and labels (specified as y_col) and tells how the data should be read. In this case we request the images to be converted to the target size. class_mode tells keras to convert category into one-hot encoded matrix, the shape needed by the model.

It is possible to add another similar data generator for validation data, so keras will output current information not just about training accuracy but also about validation accuracy when the model is running. However, we do not do it here for simplicity.

Now we can proceed with model training:

## Model Training:
history = model.fit(
    train_generator,
    epochs=1
)

model.fit method is somewhat similar to that of sklearn, but it accepts a lot more options. Here we provide training data generator, and tell how many epochs to train. One epoch is usually too little, but more epochs may be slow. In case of this network and data, a single epoch will give you accuracy of approximately 60% while 40 epochs will reach to 95%.

Model fitting returns a history object which contains information about training loss and accuracy, and if validation_data is provided (as it is here), then it also shows validation loss and accuracy.

19.3.5 Predictions and Validation

The following step is to predict the category on testing data. We proceed in a similar way, by creating a data generator that reads files as specified in the data frame:

testDir = os.path.join(imgDir, "test")
dfTest = pd.DataFrame({
    'filename': os.listdir(testDir)
})
print(dfTest.shape, "test files read from", testDir)

test_generator = ImageDataGenerator(
    rescale=1./255
    # do not randomize testing!
).flow_from_dataframe(
    dfTest,
    os.path.join(imgDir, "test"),
    x_col='filename',
    class_mode = None,  # we don't want target for prediction
    target_size = imageSize,
    shuffle = False
    # do _not_ randomize the order!
    # this would clash with the file name order!
)

In an analogous fashion as when we created training data, we first list the image files in the folder, and create the corresponding data frame. However, as these images are not labeled, we do not have “category” variable here, this is also why we specify class_mode=None. We do not want testing images distorted or shuffled either. Shuffling the test images will break the correspondence between labels and images, and give essentially random accuracy.

We can predict the probabilities just be feeding the test generator to model.predict:

phat = model.predict(test_generator)

dfTest['category'] = np.argmax(phat, axis=-1)
label_map = {0:"cat", 1:"dog"}
dfTest['category'] = dfTest['category'].replace(label_map)

phat will be the predicted probability matrix, with the first column describing the probability that the image is a cat, and the second column is the probability that the image is a dog. It may look like

[[0.5337883  0.46621174]
 [0.4693922  0.53060776]
 [0.10248788 0.89751214]
 [0.05728473 0.9427153 ]
 [0.6855554  0.31444466]]

In this example the first image is more likely a cat (\(p = 0.53\)), but the algorithm is not really sure. However, the fourth example is confidently a dog (\(p = 0.94\)). The last three lines find the category of maximum probability, and replace the 0/1 labels with cat/dog labels for easier reading the results.

The final part of the code also plots a random set of images with the corresponding labels, after resizing those to the same desired image format so you see what kind of images the computer was working with!

19.3.6 Analyzing the model

You can get the layer data out of fitted models with

conv1 = model.get_layer("conv1")

This returns a large number of parameters for the layer called conv1.7 One of these is weights, a list with two components: filter weights (the first component) and filter biases (the second component). We can extract the weights as

conv1.weights[0]

This returns an array of \(S_x \times S_y \times N_L \times N\) where \(S_x\) and \(S_y\) are the x- and y-size of the filter, \(N_L\) is the number of layers for the filters, and \(N\) is the number of filters. For instance, if the first convolutional layer contains 50 filters of \(4\times4\) size, and the input is 3-layer color image, then the corresponding weights is an \(4\times 4\times 3\times 50\) array.


  1. You can see layer names with model.summary(). See Section 19.2.2.2. You can also give each layer a name of your choice with the argument name, e.g. Conv2D(..., name = "conv1").↩︎