K-Means Clustering: Finding Structure in Unlabeled Data

simulator intermediate ~10 min
Loading simulation...
Silhouette = 0.72 — 3 clusters converged in 5 iterations

K-means with k=3 on 120 points from 3 well-separated blobs converges in about 5 iterations. The silhouette score of 0.72 indicates well-defined clusters with good separation.

Formula

Inertia: J = Σ_k Σ_{x∈C_k} ||x − μ_k||²
Centroid update: μ_k = (1/|C_k|) × Σ_{x∈C_k} x
Silhouette: s(i) = (b(i) − a(i)) / max(a(i), b(i))

Unsupervised Learning: No Labels Needed

Clustering is the flagship unsupervised learning task — finding natural groupings in data without labeled examples. K-means, introduced by Stuart Lloyd in 1957, is the most widely used clustering algorithm. Its simplicity (just two alternating steps) belies its effectiveness: k-means scales to millions of points, converges quickly, and produces interpretable results. From customer segmentation to image compression, k-means is often the first algorithm tried.

The Algorithm: Assign and Update

K-means starts by placing k centroids (cluster centers) — ideally using k-means++ initialization, which spreads them out. Then it alternates: assign each point to its nearest centroid (creating Voronoi regions), then move each centroid to the mean of its assigned points. The simulation animates this process — watch the centroids converge from their initial positions to stable locations that minimize within-cluster variance. Convergence is typically fast, usually under 20 iterations.

Choosing K: The Eternal Question

K-means requires you to specify k in advance — but how do you know how many clusters exist? The elbow method runs k-means for k=1,2,3,... and plots inertia (total within-cluster distance). The 'elbow' where the curve bends indicates the natural number of clusters. The silhouette method is more rigorous, measuring how well each point fits its cluster versus the next-best cluster. The simulation lets you mismatch k against the true number of blobs to see what happens.

Beyond K-Means

K-means assumes clusters are spherical and equally sized — assumptions that often fail on real data. DBSCAN discovers clusters of arbitrary shape by following density-connected regions. Gaussian Mixture Models allow soft (probabilistic) cluster assignments and handle elliptical clusters. Hierarchical clustering builds a tree of nested clusters without requiring k upfront. Despite these alternatives, k-means remains the go-to starting point because of its speed, simplicity, and surprising effectiveness.

FAQ

How does k-means clustering work?

K-means partitions data into k clusters by iterating two steps: (1) assign each point to the nearest centroid, and (2) recompute each centroid as the mean of its assigned points. This repeats until centroids stop moving. The algorithm minimizes within-cluster sum of squares (inertia). It is fast and simple but assumes spherical clusters and requires specifying k in advance.

How do you choose the right number of clusters?

The elbow method plots inertia vs k and looks for a 'bend' where adding clusters gives diminishing returns. The silhouette method measures how similar points are to their own cluster vs neighboring clusters — higher is better. Gap statistic compares inertia to a null reference. In practice, domain knowledge often guides the final choice.

What is the silhouette score?

The silhouette score for each point measures how well it fits its assigned cluster. It ranges from -1 to 1: values near 1 mean the point is well-matched to its cluster and far from others; near 0 means it is on the boundary; near -1 means it is probably in the wrong cluster. The average silhouette score summarizes overall clustering quality.

When does k-means fail?

K-means struggles with: non-spherical clusters (elongated, ring-shaped), clusters of very different sizes, clusters with different densities, and outliers. For these cases, algorithms like DBSCAN (density-based), Gaussian Mixture Models (soft assignment), or spectral clustering are more appropriate. K-means also depends on initialization — k-means++ helps avoid poor starting positions.

Sources

Embed

<iframe src="https://homo-deus.com/lab/machine-learning/clustering/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub