Unsupervised Learning: No Labels Needed
Clustering is the flagship unsupervised learning task — finding natural groupings in data without labeled examples. K-means, introduced by Stuart Lloyd in 1957, is the most widely used clustering algorithm. Its simplicity (just two alternating steps) belies its effectiveness: k-means scales to millions of points, converges quickly, and produces interpretable results. From customer segmentation to image compression, k-means is often the first algorithm tried.
The Algorithm: Assign and Update
K-means starts by placing k centroids (cluster centers) — ideally using k-means++ initialization, which spreads them out. Then it alternates: assign each point to its nearest centroid (creating Voronoi regions), then move each centroid to the mean of its assigned points. The simulation animates this process — watch the centroids converge from their initial positions to stable locations that minimize within-cluster variance. Convergence is typically fast, usually under 20 iterations.
Choosing K: The Eternal Question
K-means requires you to specify k in advance — but how do you know how many clusters exist? The elbow method runs k-means for k=1,2,3,... and plots inertia (total within-cluster distance). The 'elbow' where the curve bends indicates the natural number of clusters. The silhouette method is more rigorous, measuring how well each point fits its cluster versus the next-best cluster. The simulation lets you mismatch k against the true number of blobs to see what happens.
Beyond K-Means
K-means assumes clusters are spherical and equally sized — assumptions that often fail on real data. DBSCAN discovers clusters of arbitrary shape by following density-connected regions. Gaussian Mixture Models allow soft (probabilistic) cluster assignments and handle elliptical clusters. Hierarchical clustering builds a tree of nested clusters without requiring k upfront. Despite these alternatives, k-means remains the go-to starting point because of its speed, simplicity, and surprising effectiveness.