Chapter 18 Unsupervised Learning

We assume you have loaded the following packages:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Below we load more as we introduce more.

18.1 Clustering

18.1.1 \(k\)-means in sklearn

Clustering is best understood in a 2-D framework where we can easily visualize the data and clusters. We start with making such a dataset and demonstrate how to use \(k\)-means clustering with sklearn library. But be aware that these data will be exceptionally good for clustering, real data is almost always much messier and often not even appropriate for cluster analysis.

We create 5 clusters. As a first step, we randomly create the corresponding cluster centers’ coordinates \(x_{1c}\) and \(x_{2c}\):

nc = 5  # number of clusters
np.random.seed(1)  # make the results replicable
## create cluster centers
x1c = np.random.normal(scale=5, size=nc)
x2c = np.random.normal(scale=5, size=nc)

The random locations are taken from normal distribution with scale (i.e. standard deviation) equal to 5. We want the clusters to be located sufficiently far from each other so they are clearly distinct for this demonstration.

Next, we create 100 data points. We assign each data point to a cluster and thereafter put it into a random location near the corresponding cluster center. The clusters might have meaningful id-s but here we just assign them a numeric label \(0, 1, \dots, 4\).

n = 100  # number of data points
cl = np.random.choice(nc, n)  # cluster membership label for each dot
## create clusters around centers
x1 = x1c[cl] + np.random.normal(size=n)
x2 = x2c[cl] + np.random.normal(size=n)

Variable cl is the id (numeric label) of the corresponding cluster, picked randomly with np.random.choice. Note how the points \((x, y)\) are placed close to the corresponding cluster center \((x_{1c}, x_{2c})\) while adding additional normal random noise. However, this time scale has its default value, i.e. the standard deviation of the noise is 1. So data points are dispersed much less around the cluster center compared to the centers itself.

The figure shows the generated data.

ax = plt.axes()
_ = ax.scatter(x1, x2, c=cl,
               s=50, alpha=0.5)
_ = ax.set_aspect("equal")
_ = plt.show()

There are five distinct clusters, here colored differently depending on the corresponding cluster id. The vertical and horizontal scale are made equal, so that human eye can correctly judge the distance we use below.

Next to our main task. This is to correctly detect the clusters in the data. Here we use \(k\)-means algorithm, a simple and perhaps the most popular clustering algorithms. It attempts to allocate the data points into a pre-defined number \(k\) clusters in a way that minimizes intra-cluster variation. See more in Lecture Notes Chapter 10.

Using unsupervised methods in sklearn is in many ways similar to using supervised ones, except that fitting the model must be done without the target y.

from sklearn.cluster import KMeans

## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
##   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

X = np.column_stack((x1, x2))
m = KMeans(n_clusters=5)
_ = m.fit(X)

KMeans(n_clusters=5)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We import KMeans from sklearn.cluster. Next, we create the design matrix \(\mathsf{X}\) which we need below for fitting. Thereafter we set up the model with just calling KMeans(). The most important argument it expects is the number of clusters \(k\). There are other options regarding to the iterations (max_iter), the exact algorithm (algorithm), etc.

Thereafter we fit the model, note that for the unsupervised model fitting we only supply the design matrix X and no target. This creates the fitted model that we can use in next steps for predicting cluster labels and querying the location of cluster centers:

hatCl = m.predict(X)  # predicted cluster membership for each dot

KMeans(n_clusters=5)

print("labels:\n", hatCl[:20])

KMeans(n_clusters=5)

the predict method computes the predicted cluster label for each observation (in X), the possible label range from \(0\) to \(k-1\). \(k\)-means relies on cluster centroids, those can be queried with cluster_centers method:

hatc = m.cluster_centers_  # centroids for each cluster

KMeans(n_clusters=5)

print("centers:\n", hatc)

KMeans(n_clusters=5)

This returns \(k\) (here 5) vectors (as rows of the matrix). Each vector corresponds to the centroid of the corresponding clusters, in the same order as labels. So the first row of the matrix is the centroid for cluster “0”. As the data here is 2-dimensional, the centroids contains two components

It is fairly simple to visualize \(k\)-means results:

ax = plt.axes()
## plot datapoints in size 50,
## color as predicted cluster
_ = ax.scatter(X[:,0], X[:,1],
               c=hatCl,
               alpha=0.3, s=50,
               edgecolor="k")
## plot centers in size 200,
## colors as cluster labels
_ = ax.scatter(hatc[:,0], hatc[:,1],
               c=np.unique(hatCl), 
               s=200, alpha=0.5,
               edgecolor="k")
_ = ax.set_aspect("equal")
_ = plt.show()

We also plot the centroids hatc as larger circles in color that corresponds to the cluster.

It is also very easy to do elbow plot:

ks = range(2, 10)
          # range of possible
          # clusters, should be >1
loss = pd.Series(0.0, index=ks)
          # store loss values here
for k in ks:
    m = KMeans(k).fit(X)
    loss[k] = m.inertia_

_ = plt.plot(ks, loss, marker="o")
_ = plt.xlabel("Number of clusters k")
_ = plt.ylabel("Loss")
_ = plt.show()

The fitted model’s inertia_ attribute contains the corresponding loss function value. Here we fit \(k\)-means models for \(k = 2, 3, \dots, 9\), store the respective loss, and plot it.

We can see that the loss falls rapidly to nearly zero at \(k=5\). Hence \(k=5\) is the optimal number of clusters in these data. Indeed, this can be confirmed visually on the images above.

We can also compute an analogue to confusion matrix:

from sklearn.metrics import confusion_matrix

confusion_matrix(cl, hatCl)

## array([[ 0, 21,  0,  0,  0],
##        [ 0,  0,  0,  0, 25],
##        [ 0,  0,  0, 17,  0],
##        [15,  0,  0,  0,  0],
##        [ 0,  0, 22,  0,  0]])

            # cl: original cluster id,
            # hatCl: predicted cluster id

One can see each original cluster (rows) is consistently predicted to a single cluster (columns), for instance all memebers of the original cluster “0” (the first row) are consistently predicted to be members of cluster “2”. This is as good accuracy as it is possible to achieve, as the algorithm has no way to know what was the original label of the clusters.

18.1.2 Hieararchical clustering in sklearn

Hierarchical clustering works in a fairly similar way in sklearn:

from sklearn.cluster import AgglomerativeClustering 

m = AgglomerativeClustering(n_clusters = 5)
cl = m.fit_predict(X)

There are on noticeable difference though: the AgglomerativeClustering model does not have .predict method. One should use .fit_predict, this does both fitting and predicting in one go.

One these data, the hierarchical clusters are exactly the same as \(k\)-means (just their labels may differ):

_ = plt.scatter(X[:,0], X[:,1], 
                c = cl, alpha=0.5, s=20)
_ = plt.show()

This is because the data is so cleanly split into different clusters, and hence all clustering algorithms will pick up the same structure.

import scipy.cluster.hierarchy as sch

dendrogram = sch.dendrogram((sch.linkage(X, method='ward')))
_ = plt.show()