Chapter 18 Unsupervised Learning
We assume you have loaded the following packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Below we load more as we introduce more.
18.1 Clustering
18.1.1 \(k\)-means in sklearn
Clustering is best understood in a 2-D framework where we can easily visualize the data and clusters. We start with making such a dataset and demonstrate how to use \(k\)-means clustering with sklearn library. But be aware that these data will be exceptionally good for clustering, real data is almost always much messier and often not even appropriate for cluster analysis.
We create 5 clusters. As a first step, we randomly create the corresponding cluster centers’ coordinates \(x_{1c}\) and \(x_{2c}\):
= 5 # number of clusters
nc 1) # make the results replicable
np.random.seed(## create cluster centers
= np.random.normal(scale=5, size=nc)
x1c = np.random.normal(scale=5, size=nc) x2c
The random locations are taken from normal distribution with scale
(i.e. standard deviation) equal to 5. We want the clusters to be
located sufficiently far from each other so they are
clearly distinct for this demonstration.
Next, we create 100 data points. We assign each data point to a cluster and thereafter put it into a random location near the corresponding cluster center. The clusters might have meaningful id-s but here we just assign them a numeric label \(0, 1, \dots, 4\).
= 100 # number of data points
n = np.random.choice(nc, n) # cluster membership label for each dot
cl ## create clusters around centers
= x1c[cl] + np.random.normal(size=n)
x1 = x2c[cl] + np.random.normal(size=n) x2
Variable cl
is the id (numeric label) of the corresponding cluster,
picked randomly with np.random.choice
. Note how the points \((x, y)\)
are placed close to the corresponding cluster center \((x_{1c}, x_{2c})\) while
adding additional normal random noise.
However, this time scale has its default value, i.e. the standard
deviation of the noise is 1.
So data points are dispersed much less around
the cluster center compared to the centers itself.
The figure shows the generated data.
= plt.axes()
ax = ax.scatter(x1, x2, c=cl,
_ =50, alpha=0.5)
s= ax.set_aspect("equal")
_ = plt.show() _
There are five distinct clusters, here colored differently depending on the corresponding cluster id. The vertical and horizontal scale are made equal, so that human eye can correctly judge the distance we use below.
Next to our main task. This is to correctly detect the clusters in the data. Here we use \(k\)-means algorithm, a simple and perhaps the most popular clustering algorithms. It attempts to allocate the data points into a pre-defined number \(k\) clusters in a way that minimizes intra-cluster variation. See more in Lecture Notes Chapter 10.
Using unsupervised methods in sklearn is in many ways similar to using supervised ones, except that fitting the model must be done without the target y.
from sklearn.cluster import KMeans
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
## warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
= np.column_stack((x1, x2))
X = KMeans(n_clusters=5)
m = m.fit(X) _
KMeans(n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5)
We import KMeans
from sklearn.cluster
. Next, we
create the design matrix \(\mathsf{X}\) which we need below for fitting.
Thereafter we set up the model with
just calling KMeans()
. The most important argument it expects is
the number of clusters \(k\). There are other options regarding to the
iterations (max_iter
), the exact algorithm (algorithm
), etc.
Thereafter we fit the model, note that for the unsupervised model
fitting we only supply the design matrix X
and no target. This
creates the fitted model that we can use in next steps for predicting
cluster labels and querying the location of cluster centers:
= m.predict(X) # predicted cluster membership for each dot hatCl
KMeans(n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5)
print("labels:\n", hatCl[:20])
KMeans(n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5)
the predict
method computes the predicted cluster label for each
observation (in X
), the possible label range from \(0\) to \(k-1\).
\(k\)-means relies on cluster centroids, those can be queried with
cluster_centers
method:
= m.cluster_centers_ # centroids for each cluster hatc
KMeans(n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5)
print("centers:\n", hatc)
KMeans(n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5)
This returns \(k\) (here 5) vectors (as rows of the matrix). Each vector corresponds to the centroid of the corresponding clusters, in the same order as labels. So the first row of the matrix is the centroid for cluster “0”. As the data here is 2-dimensional, the centroids contains two components
It is fairly simple to visualize \(k\)-means results:
= plt.axes()
ax ## plot datapoints in size 50,
## color as predicted cluster
= ax.scatter(X[:,0], X[:,1],
_ =hatCl,
c=0.3, s=50,
alpha="k")
edgecolor## plot centers in size 200,
## colors as cluster labels
= ax.scatter(hatc[:,0], hatc[:,1],
_ =np.unique(hatCl),
c=200, alpha=0.5,
s="k")
edgecolor= ax.set_aspect("equal")
_ = plt.show() _
We also plot the centroids hatc
as larger circles in color that
corresponds to the cluster.
It is also very easy to do elbow plot:
= range(2, 10)
ks # range of possible
# clusters, should be >1
= pd.Series(0.0, index=ks)
loss # store loss values here
for k in ks:
= KMeans(k).fit(X)
m = m.inertia_
loss[k]
= plt.plot(ks, loss, marker="o")
_ = plt.xlabel("Number of clusters k")
_ = plt.ylabel("Loss")
_ = plt.show() _
The fitted model’s
inertia_
attribute contains the corresponding loss function value.
Here we fit \(k\)-means models for \(k = 2, 3, \dots, 9\), store the
respective loss, and plot it.
We can see that the loss falls rapidly to nearly zero at \(k=5\). Hence \(k=5\) is the optimal number of clusters in these data. Indeed, this can be confirmed visually on the images above.
We can also compute an analogue to confusion matrix:
from sklearn.metrics import confusion_matrix
confusion_matrix(cl, hatCl)
## array([[ 0, 21, 0, 0, 0],
## [ 0, 0, 0, 0, 25],
## [ 0, 0, 0, 17, 0],
## [15, 0, 0, 0, 0],
## [ 0, 0, 22, 0, 0]])
# cl: original cluster id,
# hatCl: predicted cluster id
One can see each original cluster (rows) is consistently predicted to a single cluster (columns), for instance all memebers of the original cluster “0” (the first row) are consistently predicted to be members of cluster “2”. This is as good accuracy as it is possible to achieve, as the algorithm has no way to know what was the original label of the clusters.
18.1.2 Hieararchical clustering in sklearn
Hierarchical clustering works in a fairly similar way in sklearn:
from sklearn.cluster import AgglomerativeClustering
= AgglomerativeClustering(n_clusters = 5)
m = m.fit_predict(X) cl
There are on noticeable difference though: the
AgglomerativeClustering
model does not have .predict
method. One
should use .fit_predict
, this does both fitting and predicting in
one go.
One these data, the hierarchical clusters are exactly the same as \(k\)-means (just their labels may differ):
= plt.scatter(X[:,0], X[:,1],
_ = cl, alpha=0.5, s=20)
c = plt.show() _
This is because the data is so cleanly split into different clusters, and hence all clustering algorithms will pick up the same structure.
import scipy.cluster.hierarchy as sch
= sch.dendrogram((sch.linkage(X, method='ward')))
dendrogram = plt.show() _