How Machines See People

Skip to animation ↓ ↓ Technical deep dive

An honest story about what cluster analysis is, how it works, and where it lies.

Loading story…

1 / 29

See K-means run live

This is the algorithm you just read about. Click Start to watch it run. Use Step to walk through it one move at a time. Change K to see how different choices give different stories.

Ready

Iteration:0

Click Start to begin the algorithm. Each step will highlight as it runs.

Drop K centroids randomly
Assign each dot to its nearest centroid
Move each centroid to its group's center
Repeat steps 2 and 3 until nothing changes
Converged

K = Speed

↑ Back to story ↓ Want more? Read the technical deep dive

For the curious

The story above is the human version. This section is the technical version. Read whichever serves you. If you came for the math, this is where it lives.

The objective function

K-means optimizes one quantity. For data points \(x_1, \ldots, x_N\) and \(K\) centroids \(\mu_1, \ldots, \mu_K\), with \(z_{ik} = 1\) when point \(i\) is assigned to cluster \(k\) and \(0\) otherwise:

\[J = \sum_{i=1}^{N} \sum_{k=1}^{K} z_{ik} \, \lVert x_i - \mu_k \rVert^2\]

The algorithm alternates between two steps. Fix the centroids, then choose each \(z_{ik}\) to minimize \(J\) (assign every point to its nearest centroid). Fix the assignments, then choose each \(\mu_k\) to minimize \(J\) (set every centroid to the mean of its assigned points). Each step is guaranteed to either lower \(J\) or keep it the same. The process is guaranteed to converge.

It is not guaranteed to converge to the lowest possible \(J\). K-means finds a local minimum that depends on where the centroids were initialized. Two reasonable people running K-means on the same data with different random seeds can land on different clusterings, both locally optimal.

This is the formal mirror of the four choices Lina was making implicitly: \(K\) (how many centroids), the features in \(x\) (which dimensions to measure), the norm in \(\lVert \cdot \rVert\) (which distance metric), and the slice of data she sampled (which time window). Change any of these and you change \(J\), and you can change which local minimum the algorithm finds.

Within-cluster sum of squares decreases at every iteration. Five random seeds, same data, K = 5. All five converge. Not all five land in the same place.

A worked example

Synthetic streaming-service customers. Four features per customer: weekly watch hours, genre variety, monthly spend, and number of devices. Five underlying groups generated with different means and spreads. Two thousand customers in total.

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# X is a (2000, 4) array of watch_hours, genre_variety, monthly_spend, device_count
X = StandardScaler().fit_transform(X)
km = KMeans(n_clusters=5, n_init=10, random_state=7)
labels = km.fit_predict(X)

# project to 2D so we can see what the algorithm did
pca = PCA(n_components=2).fit(X)
XY = pca.transform(X)
centroids_xy = pca.transform(km.cluster_centers_)

Standardizing matters. Without it, monthly_spend (range roughly 0 to 60) would dominate device_count (range 1 to 5) purely because of unit choice. Standardization puts every feature on the same scale so the Euclidean distance treats them as equally important. This is itself a choice.

PCA projection of the K = 5 K-means solution. Five colored clusters with brass X marks at the centroids. The picture is two-dimensional; the algorithm worked in four.

The elbow method

Run K-means for \(K = 1, 2, \ldots, 10\). Plot \(J\) at the converged solution against \(K\). As \(K\) increases, \(J\) has to decrease (more centroids means each point is on average closer to one of them). What you are looking for is a “bend” where adding another cluster stops buying you much.

In textbook plots the elbow is obvious. In real data it is almost always ambiguous. The chart below shows the elbow for the synthetic dataset. There is a defensible argument for \(K = 3\), \(K = 4\), or \(K = 5\).

Elbow plot. The shaded region marks a defensible range. Outside that region the answer is clearer; inside, it is a judgment call.

Silhouette scores

The silhouette score for a point compares its average distance to the points in its own cluster against its average distance to the points in the nearest other cluster. The result is in \([-1, +1]\). Positive values say the point is closer to its own cluster than to any other. Negative values say it would be better off elsewhere. The mean silhouette across all points gives one number per choice of \(K\).

Silhouette and elbow can disagree, and often do. When they do, you have to choose what you are optimizing for, and write that decision down.

Silhouette scores for K from 2 to 10. The best K by this metric is highlighted in brass.

Stability via bootstrap

This is the test that would have saved Lina in the story above.

Resample the data with replacement. Run K-means on the bootstrap sample. Record, for every pair of customers, whether they ended up in the same cluster. Do this fifty times. For each pair, the fraction of bootstrap runs in which they were co-assigned is their consensus score.

If the clusters are real structure in the data, customers should be co-assigned almost every time. If the clusters are an artifact of sampling noise, the consensus matrix will be fuzzy.

Sort the matrix by a reference clustering and look at the result. Sharp dark blocks on the diagonal say the clusters are stable. Pale boundaries between blocks say the algorithm is uncertain where to draw the line.

Consensus matrix from 50 bootstrap K-means runs at K = 5. Rows and columns sorted by a reference clustering. Dark blocks are stable groupings, lighter cells are unstable.

If you only run one clustering and report the personas, you have no idea whether they are real or whether you would have invented different personas given a different draw of the data.

Same data, four stories

The most uncomfortable plot in this post. Same 2000 customers, four reasonable analytical choices, four different clusterings.

K = 3 Euclidean, K = 5 Euclidean, K = 5 Manhattan, K = 5 cosine. Pick the one that supports your conclusion and you have a press release.

Each of these clusterings could be defended. None of them is wrong. They are different answers to slightly different questions. The marketing team that builds five segments off the K = 5 Euclidean plot will tell a different story than the team that builds three segments off K = 3.

Where K-means breaks

K-means is doing exactly what its objective tells it to do. The objective just does not match what we want from a clustering in every situation.

Three failure modes. Non-spherical clusters get cut in half. Clusters of unequal density get merged or split. A handful of outliers drag centroids away from the real cluster centers.

The crescent-moons example is famous. K-means assumes clusters are roughly spherical and roughly equally sized in feature space. Crescents are neither, so K-means cuts each moon in half along the boundary that minimizes \(J\). The unequal-density example shows the same problem from a different angle: when one true cluster is much tighter than another, K-means tends to split the dense cluster and merge the sparse one to balance variance. The outlier example shows what happens when a few extreme points pull centroids away from where the mass of the data lives.

None of these are bugs. K-means is optimizing \(J\). The complaint is that \(J\) is not the thing we wanted.

Alternatives

DBSCAN. Density-based clustering. Finds clusters as connected regions of high density, marks low-density points as noise. Handles arbitrary cluster shapes and is robust to outliers. The choices move from \(K\) to two new parameters: a neighborhood radius and a minimum number of points to count as a dense region. Different choices, still choices.

Hierarchical clustering. Builds a tree of nested groupings. You can cut the tree at any level to get any number of clusters, and you can see how clusters merge as you relax the threshold. The choices move to the linkage rule (single, complete, average, Ward) and the cut height. The tree itself is a useful artifact even if you never commit to one cut.

Gaussian mixture models. Each cluster is a Gaussian distribution with its own mean and covariance, so clusters can be elongated, rotated, or differently sized. Fit by expectation-maximization. The choices move to \(K\), the covariance structure (full, diagonal, tied), and the assumption that the components really are Gaussian. The model also gives every point a soft probability of belonging to each cluster rather than a hard label.

Switching algorithms does not eliminate the problem from the story above. It moves the choices around. You still have to defend them in writing.

A defensible practice

A checklist that survives most clustering projects.

Standardize features before clustering, or have a written reason not to.
Try multiple \(K\) values. Show both elbow and silhouette plots. Note where they agree and where they do not.
Run K-means with multiple random initializations (n_init = 10 or higher). Report the best result and a sense of the variance across initializations.
Bootstrap the data, run the clustering many times, and look at the consensus matrix. Report stability alongside the clustering.
Label the output as constructions, not discoveries. The personas exist because you built them. They do not exist in the data without you.
Defend every choice in writing. If you cannot defend a choice, change it or document why you made it anyway.