Chapter 19 Neural Networks
This section discusses now to use neural networks in python. First we discuss multi-layer perceptrons in sklearn package, and thereafter we do more complex networks using keras.
We assume you have loaded the following packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
We load more functions below as we introduce those.
We demonstrate neural networks using artificial color spiral data. This is a 2-D dataset where different points are colored differently, and the task is to predict the correct color based on the point location. So it is a basic decision task. However, in order to make the task reasonably complex, we introduce the colors in a spiral pattern. We create the data as follows:
= 800 # number of data points
n = np.random.normal(size=n)
x1 = np.random.normal(size=n)
x2 = np.column_stack((x1, x2)) # design matrix
X = np.arctan2(x2, x1)
alpha = np.sqrt(x1**2 + x2**2)
r = np.sin(3*alpha + 2*r)
c1 = np.cos(3*alpha + 2*r)
c2 ## partition the sum of a sin and cosine into 5 intervals
= pd.cut(c1 + c2,
category =[-1.5, -1.1, -0.6, 0.6, 1.1, 1.5],
bins=[1, 2, 3, 4, 5])
labels= category.astype(int) y
## /usr/lib/python3/dist-packages/pandas/core/arrays/categorical.py:528: RuntimeWarning: invalid value encountered in cast
## fill_value = lib.item_from_zerodim(np.array(np.nan).astype(dtype))
So we transform data into polar coordinates and compute sin and cos of the angle \(\alpha\), but not just \(\alpha\) but we also add some polar distance to the angle. As a result we get a spiral. Let’s visualize the result:
= plt.figure(figsize=(8,8))
_ = plt.axes()
ax = ax.scatter(X[:,0], X[:,1], c=y, s=40, edgecolors='black')
_ = ax.set_aspect("equal")
_ = plt.show() _
The image depicts a complex pattern of spiral arms of five different color, three-armed yellow and violet correspond to the largest and smallest values, and 6-armed different shades of blue and green spiral arms correspond to the values in-between. These decision boundaries are very hard to capture with simple models, such as logistic regression, SVM or even trees. However, neural networks (and k-NN) can do fairly well.
19.1 Multi-Layer Perceptron
sklearn implements simple feed-forward neural networks,
multi-layer perceptrons.
These are simple
dense feed-forward networks with an arbitrary number of hidden
layers. Even if simple in neural network context, they are still
powerful enough for many tasks. As with other advanced functions,
sklearn provides two functions: MLPClassifier
for classification
tasks and MLPRegressor
for regression tasks. This closely parallels
trees and k-NN methods.
The basic usage of these perceptron models is similar to that of
all other sklearn models.
The most important arguments for MLPClassifier
are
MLPClassifier(hidden_layer_sizes, activation, max_iter, alpha)
Out of these, hidden_layer_sizes is the most central one. This describes
the network, in particular its hidden layers (the size of input and
output layer are automatically
determined from data). It is a tuple that tells the
number of nodes for each hidden layer, so length of the tuple will
also tell the number of hidden layers. So for instance
hidden_layer_sizes = (32, 16)
means two hidden layers, the first one
with 32 and the following one with 16 nodes.
activation describes the
activation function, choose “relu” (the default) unless you have good reasons to
choose something else. alpha is the l2 regularization parameter and
max_iter tells the maximum number of iterations (or epochs if
using SGD) before the optimization stops.
Let us now demonstrate the usage of MLPClassifier
on the color
spiral image data. Let us start with a simple perceptron with a
single hidden layer of 20 nodes. Hence we use
hidden_layer_sizes=(20,)
, a tuple with just a single number.
Fitting the model and predicting is mostly similar to that of the other
sklearn so we do not discuss it here. We also
increase the number of iterations as the default 200 is too little in
this case:
from sklearn.neural_network import MLPClassifier
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
## warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
from sklearn.metrics import confusion_matrix
= MLPClassifier(hidden_layer_sizes = (20,), max_iter=10000)
m = m.fit(X, y) _
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
= m.predict(X) yhat
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
confusion_matrix(y, yhat)
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
This simple model did not do well even on training data. Its accuracy is
== y) np.mean(yhat
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
This is because it is too simple, just a single small hidden layer is not enough to model the complex spiral pattern well. Let’s also check to decision boundary plot to see how does the model represent the image:
def DBPlot(m, X, y, nGrid = 100):
= X[:, 0].min() - 1, X[:, 0].max() + 1
x1_min, x1_max = X[:, 1].min() - 1, X[:, 1].max() + 1
x2_min, x2_max = np.meshgrid(np.linspace(x1_min, x1_max, nGrid),
xx1, xx2
np.linspace(x2_min, x2_max, nGrid))= np.column_stack((xx1.ravel(), xx2.ravel()))
XX = m.predict(XX).reshape(xx1.shape)
hatyy =(8,8))
plt.figure(figsize= plt.imshow(hatyy, extent=(x1_min, x1_max, x2_min, x2_max),
_ ="auto",
aspect='none', origin='lower',
interpolation=0.3)
alpha0], X[:,1], c=y, s=30, edgecolors='k')
plt.scatter(X[:,
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max) plt.show()
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
DBPlot(m, X, y)
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
We can see that the model correctly captures the idea: spirals of different colors, but the shape of spirals is not accurate enough.
Let us repeat the model with a more powerful network:
= MLPClassifier(hidden_layer_sizes = (256, 128, 64), max_iter=10000) m
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(20,), max_iter=10000)
= m.fit(X, y) _
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
= m.predict(X) yhat
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
confusion_matrix(y, yhat)
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
== y) np.mean(yhat
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
DBPlot(m, X, y)
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
Now the results are very good with the accuracy around 0.98 (note: on training data!). A visual inspection confirms that the more powerful neural network is quite good in capturing the overall model structure.
Fitted network models have a number of methods and attributes,
e.g. coefs_
gives the model weights (as a list of weight matrices, one for
each layer), and intercepts_
gives the model biases (as a list
of bias vectors, one for each layer). For instance, the model fitted above
contains
sum([b.size for b in m.intercepts_]) np.
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=10000)
biases.
While sklearn offers easy access to neural network models, those model are substantially limited. For more powerful networks one has to use other libraries, such as tensorflow or pytorch.
19.2 Convolutional Neural Networks in Keras
19.2.1 Tensorflow and Keras
Tensorflow is a library that implements compute graphs that can compute gradients automatically. It also allows the computations to be carried on GPU, potentially offering a big speed imprevements over CPU computations. It is in many ways similar to numpy where one can create and manipulate matrices and tensors. However, because the compute graph approach and GPU-computing related considerations, it is much harder to use.
Keras is a tensorflow front-end, a submodule, that offers much more user-friendly access to building neural networks. The functionality includes a wide variety of network layers where one can adjust the corresponding parameters. When constructing the network, one can just add the layers next to each other, connection between the layers will be taken care for by keras itself. A good source for keras documentation is its API reference docs.
Tensorflow can be hard and frustrating to install. Normally it
works fine using either conda install tensorflow
or pip install tensorflow
. However, sometimes things may go wrong and it may be
hard to find and fix the issues. In particular, pip normally installs
the most recent version, even if it is incompatible with the rest of
your packages. It also installs dependencies, and may upgrade
certain packages, breaking the python installation in the process.
In order to avoid messing with the
rest of your system, we strongly recommend to install it in the
virtual environment, such as anaconda.
19.2.2 Example network in keras
Let us re-implement the color spiral example from the Multi-Layer Perceptron section but this time using keras. There will be a few noteworthy differences:
- Construction of the network itself is different in keras.
- Keras does not compute the size of input and output layers from data. Both must be specified by the user.
- Finally, keras only predicts probabilities for all the categories, so we have to add code that finds the column (category) with maximum probability.
The full code can be downloaded from the Bitbucket repo, below we discuss the selected details.
19.2.2.1 Building the Model
First, the most important step: building and compiling the model. This is very different from how it is done in sklearn. Let’s build a sequential model with dense layers, the same perceptron that we created using sklearn above:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# sequential (not recursive) model (one input, one output)
=Sequential()
model512, activation="relu",
model.add(Dense(=(2,)))
input_shape256, activation="relu"))
model.add(Dense(64, activation="relu"))
model.add(Dense(="softmax")) model.add(Dense(nCategories, activation
We start with importing the functionality we need from
tensorflow.keras
. Thereafter we create an empty sequential model
(i.e. a model that does not contain loops and data flowing
backward) and start adding layers to it. We add 3 dense layers. The first layer contains 512
nodes, the second one 256 and the last node contains 64 nodes. All these
nodes are activated using relu function.
The first important new feature is the argument input_shape
for the first
layer, the input layer. This tells keras what kind of inputs to expect. Currently it
is just a tuple (2,)
as our X-matrix only contains 2 columns. So
(2,)
is just the shape of a single row of X, a single instance of
the input data. You can find the correct shape with X[0].shape
.
But the input here does not have to be just a vector. For instance,
in case of images it may be a 3-D tensor with shape
(width, height, #color channels)
. You only need to specify
input_shape
argument for the first layer, keras can find the
information for the following layers itself.
We also have to add an explicit output layer. As the task here is classification, we need as many output nodes as we have categories–we can compute this number as
= len(np.unique(category)) nCategories
Each output node will predict the probability that the input falls into the corresponding category, we can use softmax (multinomial logit) activation to ensure that the outcomes are valid probabilities. Note that in case of just two categories, softmax activation is equivalent to ordinary logistic regression. But unlike ordinary logistic regression, we have a number of other layers preceding the last logistic layer.
Getting the input shapes and output nodes right is one of the major sources of frustration when starting to work with keras. The error messages are long and not particularly helpful, and it is hard to understand what went wrong. Here is a checklist to work through if something does not work:
- Does the first (and only the first) layer contain
input_shape
argument? - Does
input_shape
correctly represent the shape of a single instance of the input data? - Do you have correct number of nodes in the softmax-activated output layer?
See also Common Error Messages below.
19.2.2.2 Fitting the Model
The next task is to compile and fit the model. Keras models need to be compiled–what we set up so far is just a description of the model, not the actual model that is set up for tensorflow tensors and potentially for GPU execution. We can compile the model as
compile(loss='sparse_categorical_crossentropy',
model.='adam',
optimizer=['accuracy'])
metricsprint(model.summary())
The three most important arguments are
loss
describes the model loss function,sparse_categorical_crossentropy
, essentially log-likelihood, is suitable for categorization tasks. The exact type also depends on how exactly is the outcome coded.optimizer
is the optimizer to use for stochastic gradient descent.adam
andrmsprop
are good choices but there are other options.metrics
is a metric to be evaluated and printed while optimizing, offering some feedback about how the evaluation is going.
The last line here prints the model summary, a handy overview of what we have done:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 512) 1536
_________________________________________________________________
dense_1 (Dense) (None, 256) 131328
_________________________________________________________________
dense_2 (Dense) (None, 64) 16448
_________________________________________________________________
dense_3 (Dense) (None, 5) 325
=================================================================
Total params: 149,637
Trainable params: 149,637
Non-trainable params: 0
So this network contains almost 150,000 parameters, all of which are trainable.
After successful compilation we can fit the model:
= model.fit(X, y, epochs=200) history
In this example X
is the design matrix, y
is the outcome vector,
and the argument epochs
tells
how many epochs to run the optimizer. Keras reports the progress
while optimizing, it may look like
Epoch 1/200
25/25 [==============================] - 0s 2ms/step - loss: 1.5984 - accuracy: 0.2438
Epoch 2/200
25/25 [==============================] - 0s 2ms/step - loss: 1.5709 - accuracy: 0.2763
Epoch 3/200
25/25 [==============================] - 0s 3ms/step - loss: 1.5550 - accuracy: 0.3013
where one can see the current epoch, batch (for stochastic gradient descent), and the
current accuracy on training data. As in this example all the epoch are
run till completion, we see 25/25
for batches, but otherwise
one can see the batch number
progressing as the training proceeds. This
example works very fast, keras reports 2ms per step (batch) and
total time for epoch is too small to be reported. But a single
epoch may take many minutes for more complex models and more data.
The reported metric (here accuracy) is computed for a single batch, and it gives only a crude guidance of the actual model accuracy, even on training data. Do not take this accuracy measure too seriously!
Keras let’s you predict using model that is not fitted (unlike sklearn where that causes an error). The results will look mediocre at best.
19.2.2.3 Predicting and Plotting
When the fitting is done, we can use the model for prediction.
Prediction itself works in a similar fashion as in sklearn, just the
predict
method predicts probability, not category (analogously to
sklearn’s predict_proba
):
= model.predict(X) phat
In this example, this will be a matrix of 5 columns where each column
represents probability that the data point belongs to
the corresponding category. Example lines of phat
may look like
5] phat[:
[[1.3339513e-37 5.6408087e-24 2.7101044e-10 1.2674906e-03 9.9873251e-01]
[2.7687559e-09 1.9052830e-02 9.8094696e-01 2.3103729e-07 2.8597169e-19]
[1.6330467e-18 6.0083215e-07 9.5986998e-01 4.0129449e-02 2.5692884e-08]
[5.6379267e-15 1.8879733e-06 9.9859852e-01 1.3995816e-03 2.9565132e-11]
[1.0658005e-19 7.9592645e-08 1.7678380e-01 7.8631145e-01 3.6904618e-02]]
In case of the first line, the largest probability, 0.998, is in the 5th column. In all three following lines, the 3rd column contains the largest probabilities, 0.981, 0.960, and 0.999 respectively, and finally, in the fourth line, the maximum value 0.786 is in the 4th column.
Next, in order to find the corresponding column, we can use
np.argmax(phat, axis=-1)
, it just finds
the location of the largest elements in the array, across the last
axis (axis=-1
), i.e. columns. So for each row, we find the
corresponding column number. Be aware that np.argmax
counts columns starting
from 0, not from 1:
= np.argmax(phat, axis=-1)
yhat 5] yhat[:
[4, 2, 2, 2, 3]
Finally, we can compute confusion matrix using `pd.crosstab
This is similar to
what we did in the DBPlot
function above, here is an example how to
compute confusion matrix and accuracy:
from sklearn.metrics import confusion_matrix
print("confusion matrix:\n", confusion_matrix(category, yhat))
print("Accuracy (on training data):", np.mean(category == yhat))
In this example we predict on training data but we can obviously
choose another set of data for predictions. As the predicted value
will be probability matrix of 5 columns, we compute yhat
as the
column number that cointans the largest probability for each row.
The output may be look something like this:
confusion matrix:
[[172 4 0 0 0]
[ 10 82 4 0 0]
[ 0 11 215 7 0]
[ 0 0 8 116 1]
[ 0 0 0 10 160]]
Accuracy (on training data): 0.93125
As one can see, the confusion matrix is populated almost exclusively on the main diagonal, and accuracy is high.
Finally, if we want to make a similar decision boundary plot as
above,
then we
have to modify the DBPlot
function in order to address
the fact that keras models only predict probabilities:
def DBPlot(m, X, y, nGrid = 100):
= X[:, 0].min() - 1, X[:, 0].max() + 1
x1_min, x1_max = X[:, 1].min() - 1, X[:, 1].max() + 1
x2_min, x2_max = np.meshgrid(np.linspace(x1_min, x1_max, nGrid),
xx1, xx2
np.linspace(x2_min, x2_max, nGrid))= np.column_stack((xx1.ravel(), xx2.ravel()))
XX ## predict probability
= m.predict(XX)
phat ## find the column that corresponds to the maximum probability
= np.argmax(phat, axis=-1).reshape(xx1.shape)
hatyy =(8,8))
plt.figure(figsize= plt.imshow(hatyy, extent=(x1_min, x1_max, x2_min, x2_max),
_ ="auto",
aspect='none', origin='lower',
interpolation=0.3)
alpha0], X[:,1], c=y, s=30, edgecolors='k')
plt.scatter(X[:,
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max) plt.show()
Again, the function is almost identical to the sklearn version,
except the line containing np.argmax(phat, axis=-1)
that converts
predicted probabilities to categories.
19.3 Image processing with convolutional networks
The main reason to choose keras over sklearn is the much more powerful toolset included in keras library. This includes convolutional layers that are widely used in image processing, and also data generators that load data as they go so one does not have to keep tens of thousands of images in memory.
We demonstrate the usage by categorizing images into cats and dogs.
The data used in this example can be downloaded from
kaggle. It contains
25,000 labeled training images and 12,500 unlabeled testing images (as
these are not labeled, those cannot be really used for testing). The
data sets are large (as is common for image processing), training data
is 600MB and testing data 300MB.
In
the code below we assume the images are located in cats-n-dogs/train
for training data. The full code example is on the Bitbucket
repo,
here we discuss just the more crucial parts of it.
19.3.1 Loading Data
Let us first set the model parameters:
= "cats-n-dogs"
imgDir = 128, 128
imageWidth, imageHeight = (imageWidth, imageHeight)
imageSize = 3 channels
As we need to repeatedly find the images, we specify the location of
the folder here (you have to adjust this for your computer if you want
to run this program). Next, because the input tensors that correspond to
the images must be of the same size, we specify image target size here, and
resize all images later into shape imageSize
(this will be done by
data generators, see below). In this example,
one color channel will be a \(128\times128\) matrix. We also specify that the images
contain 3 color channels, so a single image is in fact a
\(128\times128\times3\) tensor. Obviously, larger image size gives
better predictions but will be slower.
We do not want to load all images into memory–that would be a
very memory-hungry approach.
We specify data generators instead,
functions that load the images from disk only when they are
needed. There are different options for data
generators, the one we use here is flow_from_dataframe
. It
expects the
file names and the corresponding correct categories to be listed as data frame
rows. We start by creating a data frame of all labeled
images and their corresponding
labels (cat vs dog). The images are named as “cat.xxxxx.jpg” and
“dog.xxxxx.jpg”, so we can use the first three digits of the file name
as the category:
= os.listdir(os.path.join(imgDir, "train")) # all training file names
filenames print(len(filenames), "images found")
= pd.DataFrame({
trainImages 'filename':filenames,
'category':pd.Series(filenames).str[:3]
# first three letters of file name give the category
})print("categories:\n", trainImages.category.value_counts())
So we create a data frame that contains two columns, filename and category. Later we tell keras to load images that are listed in this data frame using the supplied category as the label. A sample of the dataframe might look like:
filename category
2966 cat.1278.jpg cat
1105 cat.2979.jpg cat
8483 dog.5446.jpg dog
19006 dog.5471.jpg dog
20363 dog.5543.jpg dog
We can see that image file cat.1278.jpg is of category cat while image dog.5446.jpg is of category dog.
19.3.2 Building the Model
Now it is time to build the model. The basic model building steps are similar as discussed in Example network in keras: building the model, but this time we add more types of layers:
= Sequential()
model ## First convolutional layer with 32 filters (kernels)
32,
model.add(Conv2D(=3,
kernel_size='relu',
activation=(imageWidth, imageHeight, channels)))
input_shape
model.add(BatchNormalization())=2))
model.add(MaxPooling2D(pool_size0.25))
model.add(Dropout(## 2nd convolutional layer
64,
model.add(Conv2D(= 3,
kernel_size ='relu'))
activation
model.add(BatchNormalization())=2))
model.add(MaxPooling2D(pool_size0.25))
model.add(Dropout(## 3rd convolutional layer
128,
model.add(Conv2D(=3,
kernel_size='relu'))
activation
model.add(BatchNormalization())=2))
model.add(MaxPooling2D(pool_size0.25))
model.add(Dropout(## Flatten the image into a string of pixels
model.add(Flatten())## Use one final dense layer
512, activation='relu'))
model.add(Dense(
model.add(BatchNormalization())0.5))
model.add(Dropout(## Output layer with 2 softmax nodes
2, activation='softmax')) model.add(Dense(
We build a sequential model, but now the first three layers are convolutional layers. All layers, except the output layer, are activated using relu function.
The first layer (2-D convolutional layer) contains 32 filters of size \(3\times3\). This means we introduce 32 different convolutions and let the network learn which 32 filters give the best performance. As
strides
argument is not specified, these filters use \(1\times1\) strides, i.e. the kernel is moved over the images one pixel at time. Unlike the other layers, the first layer specifies the input shape, in this case the image shape (width, height, and color channels) as an individual data point here is image. Input shape describes how the input to the model look like. In case of linear regression, or non-convolutional netowrk, it is normally a single vector of \(x\)-s. However, here we cannot use vectors—we cannot just flatten the image into an 1-D series of pixels because convolutions need information about the pixel locations. So we have to tell the model what is the image size and how many color channels are there.The first layer contains 896 parameters, \(3\times3\) kernel weights for each 32 filters and each 3 color channels, and 32 biases, one for each filter: \((3\times3\times3 + 1)\times 32 = 896\).
Convolutional layers are followed with the corresponding max pooling over \(2\times2\) image regions.
BatchNormalization
andDropout
are not separate layers but ways of training the corresponding layer’s weights. The former is useful to ensure stable gradients, the latter to avoid overfitting.The second and the third convolutional layer are similar to the first one, except that they contain more filters, and do not specify the input shape. Keras can find the input shape itself based on the previous layers. These layers also contain way more parameters despite they are specified using \(3\times3\) kernel. Remember–the image itself contains three color channels, but the second convolutional layer works on the output of the first layer, i.e. 32 channels. Now the actual size of the kernel is \(3\times3\times32\) and hence we have \((3\times3\times32 + 1)\times 64 = 18,496\) parameters.
Final block of layers starts with flattening the image. This means we transform 3-D tensors into an 1-D array of pixel results, and loose the spatial information in the process. The flattened data is fed into the dense layer with similar batch normalization and dropout.
Finally, we predict using two output nodes, activated through softmax function.
The model summary will look like:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 126, 126, 32) 896
_________________________________________________________________
batch_normalization (BatchNo (None, 126, 126, 32) 128
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 63, 63, 32) 0
_________________________________________________________________
dropout (Dropout) (None, 63, 63, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 61, 61, 64) 18496
_________________________________________________________________
batch_normalization_1 (Batch (None, 61, 61, 64) 256
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 30, 30, 64) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 30, 30, 64) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 28, 28, 128) 73856
_________________________________________________________________
batch_normalization_2 (Batch (None, 28, 28, 128) 512
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 14, 14, 128) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 14, 14, 128) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
dense (Dense) (None, 512) 12845568
_________________________________________________________________
batch_normalization_3 (Batch (None, 512) 2048
_________________________________________________________________
dropout_3 (Dropout) (None, 512) 0
_________________________________________________________________
dense_1 (Dense) (None, 2) 1026
=================================================================
Total params: 12,942,786
Trainable params: 12,941,314
Non-trainable params: 1,472
It is very instructive to analyze and understand the number of parameters and the output shapes. We can see that the model contains almost 13M parameters, most of which are in the last dense layer. This is because all the output pixels of the third pooling layer (\(14\times14\times128\)) must be fed into all 512 nodes of the dense layer. Hence we have \(512 \times 14 \times 14 \times 128 = 12,845,056\) weights and 512 biases, this is exactly 12,845,568 parameters for the dense layer.
We can also see how image size is decreasing through the convolutional
layers. Remember, the input images are \(128\times128\) pixels. As the
convolutional filters are \(3\times3\) pixels large, each of them cuts
two pixels off from the image (as we did not specify any padding),
and hence the first conv2d
layer
outputs \(126\times126\) pixels. Max pooling over \(2\times2\) squares
further halves the image size to \(63\times63\). If we had specified
strides larger than one, we would reduce the size even more rapidly.
19.3.3 Common model errors
Keras’ errors may be hard to understand for the un-initiated. Here we describe a few common errors. Note that these may be buried inside of large list of messages, usually toward the end of the latter.
19.3.3.1 Wrong input shape
Beginners often do not understand the input shape. But it is a necessary part of information that must be fed to the model. If we get it wrong, for instance if we specify the first layer as
32,
model.add(Conv2D(=3,
kernel_size='relu',
activation=(imageWidth, imageHeight))) # no channels! input_shape
Then keras responds with a message
ValueError: Input 0 of layer conv2d is incompatible with the layer:
expected min_ndim=4, found ndim=3.
Full shape received: [None, 128, 128]
This tells that a Conv2D layer expects a 4-D tensor as its input (min_ndim=4). The correct shape should be [None, 128, 128, 3], where None means that the different cases (different images) that are stacked along that dimension.
This error happens during the model building, i.e. when you call
model.add(Conv2D(...))
.
Sometimes you get the number of input extents right, but their dimension wrong. For for instance, if you specify 5 color channels instead of 3:
32,
model.add(Conv2D(=3,
kernel_size='relu',
activation=(imageWidth, imageHeight, 5))) # should be 3 channels! input_shape
then the error will be
tensorflow.python.framework.errors_impl.InvalidArgumentError:
input depth must be evenly divisible by filter depth: 3 vs 5
It tries to wrap the \(128\times128\times3\) image into the \(128\times128\times5\) tensor, but it does not fit well.
This error occurs first when the fitting algorithm discovers that the
images contain 3 channels instead of 5, i.e. when you call
model.fit
.
19.3.3.2 Wrong number of categories
This problem is conceptually fairly easy to grasp: the number of nodes in the final softmax layer must equal to the number of categories. If we get this wrong, e.g. by specifying the last layer as
3, activation='softmax')) # too many categories! model.add(Dense(
then keras produces
tensorflow.python.framework.errors_impl.InvalidArgumentError:
logits and labels must be broadcastable:
logits_size=[32,3] labels_size=[32,2]
This tells that we were requesting 3 nodes (3 “logits”), but the data (labels) only contain 2 different categories. “32” is batch size here, that is why it replies not “3” and “2” but as “[32,3]” and “[32,2]”.
This error occurs when the fitting algorithm finds that there are too
few categories, i.e. when you call model.fit
.
19.3.3.3 Image runs out of pixels
Each convolutional filter makes the image smaller because we lose pixels at the edges (unless we use padding). If we use strides that are larger than one, we also lose output pixels because we now move with larger steps. In a similar way, pixels get lost in pooling because pooling layers “pool” pixels over neighboring areas into a single one. In this way it may happen that the image does not contain any pixels at a certain stage.
Let’s demonstrate it by adding strides=3 to all convolutional layers. The first layer now looks like:
32,
model.add(Conv2D(=3,
kernel_size=3,
strides='relu',
activation=(imageWidth, imageHeight, channels))) input_shape
The image gets too small–after a few operations it contains no pixels. keras stops with
ValueError: Negative dimension size caused by subtracting 2 from 1
for '{{node max_pooling2d_2/MaxPool}} = MaxPool[T=DT_FLOAT,
data_format="NHWC", ksize=[1, 2, 2, 1], padding="VALID",
strides=[1, 2, 2, 1]](batch_normalization_2/cond/Identity)'
with input shapes: [?,1,1,128].
If you notice such a “negative dimension size” error, then you should check your “Output Shapes” in the model summary (see Section 19.3.2 above). Remove all convolutional layers besides the first one and try to understand what will be the image size after each operation, and which operations you want to remove or modify to retain an image with a meaningful number of pixels.
This error occurs at the model compilation stage where keras is computing size of the tensors.
19.3.4 Training the model
Now is time to setup training data generator that reads files during the model fitting:
## Training data generator:
= ImageDataGenerator(
train_generator =1./255,
rescale=15,
rotation_range=0.1,
shear_range=0.2,
zoom_range=True,
horizontal_flip=0.1,
width_shift_range=0.1
height_shift_range
).flow_from_dataframe(
df,"train"),
os.path.join(imgDir, ='filename', y_col='category',
x_col='categorical', # target is 2-D array of one-hot encoded labels
class_mode=imageSize,
target_size=True
shuffle )
ImageDataGenerator
is a class that can handle images in
various ways, in particular it can rescale the intensity from 0-255
integer range to 0-1 float range, and it can introduce various small
modifications to the image, such as rotation and shear. This is
useful for adding more variation to the training data. However, do
not modify your test data! We also shuffle the training data in order
to avoid feeding it always in the same order to the network.
After introducing modifications to the image, we call method
flow_from_dataframe
. It takes
the data frame that contains both images (specified as x_col
) and
labels
(specified as y_col
) and tells how the data should be read. In this case we
request
the images to be converted to the target
size. class_mode
tells keras to convert category into one-hot
encoded matrix, the shape needed by the model.
It is possible to add another similar data generator for validation data, so keras will output current information not just about training accuracy but also about validation accuracy when the model is running. However, we do not do it here for simplicity.
Now we can proceed with model training:
## Model Training:
= model.fit(
history
train_generator,=1
epochs )
model.fit
method is somewhat similar to that of sklearn, but it
accepts a lot more options. Here we provide training data generator,
and tell how many epochs to train.
One epoch is usually too little, but more epochs may be slow. In case
of this network and data, a single epoch will give you accuracy of
approximately 60% while 40 epochs will reach to 95%.
Model fitting returns a history object which contains information
about training loss and accuracy, and if validation_data
is provided
(as it is here), then it also shows validation loss and accuracy.
19.3.5 Predictions and Validation
The following step is to predict the category on testing data. We proceed in a similar way, by creating a data generator that reads files as specified in the data frame:
= os.path.join(imgDir, "test")
testDir = pd.DataFrame({
dfTest 'filename': os.listdir(testDir)
})print(dfTest.shape, "test files read from", testDir)
= ImageDataGenerator(
test_generator =1./255
rescale# do not randomize testing!
).flow_from_dataframe(
dfTest,"test"),
os.path.join(imgDir, ='filename',
x_col= None, # we don't want target for prediction
class_mode = imageSize,
target_size = False
shuffle # do _not_ randomize the order!
# this would clash with the file name order!
)
In an analogous fashion as when we created training data, we first
list the image files in the folder, and create the corresponding data
frame. However, as these images are not labeled, we do not have
“category” variable here, this is also why we specify
class_mode=None
.
We do not want testing images distorted or
shuffled either. Shuffling the test images will break the
correspondence between labels and images, and give essentially random
accuracy.
We can predict the probabilities just be feeding the test generator to
model.predict
:
= model.predict(test_generator)
phat
'category'] = np.argmax(phat, axis=-1)
dfTest[= {0:"cat", 1:"dog"}
label_map 'category'] = dfTest['category'].replace(label_map) dfTest[
phat
will be the predicted probability matrix, with the first column
describing the probability that the image is a cat, and the second
column is the probability that the image is a dog. It may look like
[[0.5337883 0.46621174]
[0.4693922 0.53060776]
[0.10248788 0.89751214]
[0.05728473 0.9427153 ]
[0.6855554 0.31444466]]
In this example the first image is more likely a cat (\(p = 0.53\)), but the algorithm is not really sure. However, the fourth example is confidently a dog (\(p = 0.94\)). The last three lines find the category of maximum probability, and replace the 0/1 labels with cat/dog labels for easier reading the results.
The final part of the code also plots a random set of images with the corresponding labels, after resizing those to the same desired image format so you see what kind of images the computer was working with!
19.3.6 Analyzing the model
You can get the layer data out of fitted models with
= model.get_layer("conv1") conv1
This returns a large number of parameters for the layer called conv1.7 One of these is weights, a list with two components: filter weights (the first component) and filter biases (the second component). We can extract the weights as
0] conv1.weights[
This returns an array of \(S_x \times S_y \times N_L \times N\) where \(S_x\) and \(S_y\) are the x- and y-size of the filter, \(N_L\) is the number of layers for the filters, and \(N\) is the number of filters. For instance, if the first convolutional layer contains 50 filters of \(4\times4\) size, and the input is 3-layer color image, then the corresponding weights is an \(4\times 4\times 3\times 50\) array.