We’ve been doing k-means wrong for more than half a century

library(tidyverse)
library(patchwork)
library(ggforce)
library(fastcluster)

set.seed(1)

Updated 2021-06-04: After @ctwardy tried replicating my results in Python, I discoved the LICORS implementation of k-means++ in R appears to be faulty. I have switched to the flexclust implementation, which gives superior results. I’ve updated my recommendation to be use either Ward clustering or k-means++ to choose initial cluster centers.

Updated 2021-06-19: Added Appendix 2 sketching an argument that k-means++ produces the optimal asymptotic cluster density.

Introduction

For a collection of \(n\) points in \(d\)-dimensional space, k-means clustering seeks to assign the points to \(k\) clusters and find cluster centers so as to minimize the sum of squared distances from each point to its cluster center.

In other words, the cluster assignments and centers approximate the original set of points and we are seeking to do this as well as possible in terms of sum-of-squares error. This is also sometimes called vector quantization.

The original k-means algorithm, “Lloyd’s algorithm”, iterates:

Assign points to their nearest cluster center.
Update cluster centers to the mean of the points in their cluster.

Some slight improvement to this is possible. For example centers can be updated alongside updates to cluster assignments, rather than making these two separate phases. R’s kmeans function uses the “Hartigan-Wong” algorithm which also has slightly improved cluster re-assignment.

These iterative methods all require some initial choice of cluster centers. A common default is to use points chosen at random from the dataset. This initial choice will follow the density distribution of points in the dataset. However, for large \(k\) and larger \(n\), the density of cluster centers should be proportional to the density of the points to the power \(\frac{d}{d+2}\). An argument for this is given in Appendix 1. In other words the distribution of clusters found by k-means should be more spread out than the distribution of points. Commonly used iterative schemes stay stuck close to the initial choice of centers, and do not in general achieve this.

There are several different approaches to k-means that avoid getting “stuck”:

Choose better initial clusters using the k-means++ method. Randomly chosen initial cluster centers follow the distribution of the dataset, which I am claiming is wrong. k-means++ chooses clusters randomly but with a different scheme that avoids putting initial cluster centers too close together. Appendix 2 sketches an argument that the density of these initial clusters matches the density of the optimal clusters.
In one dimension, a dynamic programming solution is possible. Ckmeans.1d.dp is a wonderfully fast implementation.
Ward agglomerative clustering takes an entierly different approach to clustering, while still seeking to minimize the sum-of-squares error. Older implementations have \(O(n^3)\) time complexity, and also keep a full distance matrix in memory, but fastcluster::hclust.vector provides a memory efficient implementation with \(O(n^2)\) time complexity.

In this note, I will demonstrate that Ward clustering can perform much better than the usual iterative schemes. Further improvement is achieved by polishing the Ward cluster centers using an iterative method. k-means++ is also a decent option, performing nearly as well.

The proposed Ward+polish method can be performed in R using:

# For matrix of points mat and specified k

library(tidyverse)
library(fastcluster)

hc <- hclust.vector(mat, method="ward")
cluster <- cutree(hc, k=k)
centers <- 
    seq_len(nrow(mat)) %>% 
    split(cluster) %>% 
    map(~ colMeans(mat[.,,drop=FALSE])) %>% 
    do.call(rbind,.)

# This can be followed by a refinement step
result <- kmeans(mat, centers, iter.max=100)

centers <- result$centers
cluster <- result$cluster

For k-means++ use:

library(flexclust)

result <- kcca(mat, k=k, family=kccaFamily("kmeans"), control=list(initcent="kmeanspp"))

centers <- result@centers
cluster <- result@cluster

Demonstration

Balancing squares

The first dataset will consist of two squares of uniform density, the first containing 10,000 points and the second containing 1,000 points. We will find k=200 means.

The points have a ratio of 10:1, so number of clusters should be split \(\sqrt{10}:1\). For 200 clusters, this should be k1=152, k2=48.

c(10000,1000) %>% sqrt %>% {./sum(.)*200}

## [1] 151.94939  48.05061

library(tidyverse)
library(patchwork)
library(ggforce)
library(fastcluster)

ss <- function(mat, fit) {
    sum( (mat - fit$centers[fit$cluster,])^2 )
}

null_fit <- function(mat) list(centers=mat,cluster=seq_len(nrow(mat)))

plotit <- function(mat, fit, do_lr=TRUE, do_ss=TRUE, title="", scale=0.003, kchar="k") {
    df <- tibble(x=fit$centers[,1], y=fit$centers[,2], size=as.integer(table(fit$cluster)))

    if (do_lr) {
        k1 <- sum(fit$centers[,1]<=0)
        k2 <- nrow(fit$centers)-k1
        title <- sprintf("%s %s1=%d %s2=%d",title,kchar,k1,kchar,k2)
    }

    if (do_ss) {
        title <- sprintf("%s SS=%.1f",title,ss(mat,fit))
    }

    ggplot(df) + 
        aes(x0=x,y0=y,r=sqrt(size)*scale) + 
        geom_circle(fill="white", color="black",n=16) + 
        coord_fixed() +
        labs(title=title,x="",y="")
}

do_fits <- function(mat, k) {
    result <- list()

    # k-means the usual way

    result$kmeans <- kmeans(mat,k,iter.max=100)

    # k-means++

    # Do not use LICORS::kmeanspp. Appears to be incorrect implementation.
    #result$kmeanspp <- LICORS::kmeanspp(mat,k)

    kmpp <- flexclust::kcca(mat,k=k,family=kccaFamily("kmeans"), control=list(initcent="kmeanspp", verbose=0))
    result$kmeanspp <- list(centers=kmpp@centers, cluster=kmpp@cluster)

    # Ward + polish

    hc <- hclust.vector(mat, method="ward")
    cluster <- cutree(hc, k=k)
    centers <- 
        seq_len(nrow(mat)) %>% 
        split(cluster) %>% 
        map(~ colMeans(mat[.,,drop=FALSE])) %>% 
        do.call(rbind,.)
    result$ward <- list(cluster=cluster,centers=centers)
    result$ward_polish <- kmeans(mat, centers, iter.max=100)

    result
}

do_plots <- function(mat, fits, scale=0.003, do_lr=F) {
    mat_plot <- 
        ggplot() + aes(x=mat[,1],y=mat[,2]) +
        geom_point(size=0.1) +
        coord_fixed() + 
        labs(x="",y="",title="data")
    
    ( mat_plot | plot_spacer() ) /
    ( plotit(mat, fits$kmeans, do_lr,T,"k-means",scale) | plotit(mat, fits$kmeanspp, do_lr,T,"k-means++",scale) ) /
    ( plotit(mat, fits$ward, do_lr,T,"Ward",scale) | plotit(mat, fits$ward_polish, do_lr,T,"Ward + polish",scale) )
}

k <- 200
n1 <- 10000
n2 <- 1000
mat <- rbind(
    cbind(runif(n1)-1,runif(n1)),
    cbind(runif(n2),  runif(n2)))

fits <- do_fits(mat, k)
do_plots(mat, fits, do_lr=T)

# Note we could also do this in 3D.
# Omitted as a digression.
if (FALSE) {
    # Expected optimal split
    c(10000,1000) %>% {.^(2/3)} %>% {./sum(.)*200}

    mat3 <- rbind(
        cbind(runif(n1)-1,runif(n1),runif(n1)),
        cbind(runif(n2),  runif(n2),runif(n2)))
    fits3 <- do_fits(mat3, k)
    do_plots(mat3, fits3, do_lr=T)
}

You can see the default k-means, with random initial centers, has gotten the balance between the left and right very wrong.

Ward clustering and Ward clustering with a polishing step get the balance between left and right (k1 and k2) about as expected, and therefore have smaller sum-of-squares error (SS).

I have also included “k-means++” clustering as implemented in flexclust, which uses a better initialization of cluster centers. It has also performed well.

Balancing lines

The second dataset will consist of two lines of uniform density, the first containing 1,000 points and the second containing 100 points. We will find k=20 means.

Since this is one dimensional data even though it is embedded in two dimensions, the expected split between the two lines is \(\sqrt[3]{10}:1\):

c(1000,100) %>% {.^(1/3)} %>% {./sum(.)*20}

## [1] 13.65972  6.34028

line_k <- 20 
line_n1 <- 1000
line_n2 <- 100
line_mat <- rbind(
    cbind(-0.5, runif(line_n1)),
    cbind(0.5, runif(line_n2)))

line_fits <- do_fits(line_mat, line_k)
do_plots(line_mat,line_fits, do_lr=T)

Ward (and k-means++) clustering “know” this is one dimensional data and has allocated clusters accordingly. Isn’t it wonderful?

More datasets

quakes

This is the quakes dataset containing the location of 5,000 earthquakes near Fiji. I used k=100. Notice how Ward clustering produces more small clusters around the edges.

quake_mat <- cbind(quakes$long, quakes$lat)
quake_k <- 100

quake_fits <- do_fits(quake_mat, quake_k)
do_plots(quake_mat,quake_fits, scale=0.1)

diamonds

This is the ggplot2::diamonds dataset, using the carat and price columns, which have been scaled. I used k=50. k-means++ and Ward clustering have done dramatically better with this highly skewed data in terms of sum-of-squares error.

diamonds_mat <- cbind(diamonds$carat,diamonds$price) %>% scale()
diamonds_k <- 50

diamonds_fits <- do_fits(diamonds_mat, diamonds_k)
do_plots(diamonds_mat, diamonds_fits, do_lr=FALSE, scale=0.003)

Shapley_galaxy

These are 4,215 galaxies in the Shapley supercluster. I used k=100. Similar to the two squares example, in the k-means++ and Ward clusterings dense regions have fewer clusters and sparse regions have more clusters, leading to a smaller sum-of-squares error.

library(astrodatR)
data(Shapley_galaxy)
galaxy_mat <- cbind(Shapley_galaxy$R.A., Shapley_galaxy$Dec.)
galaxy_k <- 100

galaxy_fits <- do_fits(galaxy_mat, galaxy_k)
do_plots(galaxy_mat, galaxy_fits, do_lr=FALSE, scale=0.03)

Conclusion

I hope I have demonstrated that the usual k-means algorithms, with randomly chosen initial centers, are flawed. It’s a little shocking, given the simplicity of the objective function and the length of time this idea has existed for. My proposed Ward+polish alternative is applicable up to moderately large datasets.

We have also had k-means++ available since 2007, but it has maybe not been widely appreciated how much of an improvement it can make. While k-means++ did a little worse than Ward+polish for the datasets shown here, it is still far better than choosing initial cluster centers at random. k-means++ initialization also has the advantage of being computationally efficient, requiring \(O(kn)\) distance calculations as compared to \(O(n^2)\) for Ward clustering. k-means++ is used by default in Python’s sklearn.cluster.KMeans, which is likely what most Python users would try first. R has fallen behind by not supporting k-means++ with its default kmeans function.

I believe further progress is possible both in terms of speed and closeness to the global optimum. Since the objective is well defined, a coding competition around some carefully chosen datasets might encourage a search for better algorithms.

My own research is currently to do with gene expression in tens of thousands of individual cells (scRNA-Seq data). I am interested in 0, 1, and 2-dimensional manifolds embedded in very high dimensional spaces. We saw how the clustering adapted to lower dimensional data (1D) embedded in a higher dimensional (2D) space. There’s no reason to think this doesn’t extend to higher dimensional spaces. A good k-means algorithm should provide clusters with good coverage of such data. Also interesting to think about is that the dimension of the manifold is somewhat dependent on the resolution (k) at which we are viewing it.

I believe k-means is an essential tool for summarizing data. It is not simply “clustering”, it is an approximation that provides good coverage of a whole dataset by neither overly concentrating on the most dense regions nor focussing too much on extremes. Maybe this is something our society needs more of. Anyway, we should get it right.

Appendix 1: Optimal point-cluster density power relationship

M. Anthony Wong (1982, 1984) has shown in one dimension that the asymptotic density of clusters is proportional to the density of points to the power \(\frac{1}{3}\). I will now sketch out an argument extending this idea to higher dimensions, showing that the power is in general \(\frac{d}{d+2}\). Compared to this, the traditional uniformly random choice of initial clusters is not just randomly sub-optimal but actually gives the wrong distribution of intial cluster centers, and the usual k-means iteration is unable to escape this.

Consider two \(d\)-dimensional hypercubes of equal size, containing \(n_1\) and \(n_2\) points respectively.

Out of \(k\), how shall we allocate the means to the two hypercubes, \(k_1\), \(k_2=k-k_1\)?

Within a hypercube, the distance to the nearest mean will typically go proportional to \(k^\frac{-1}{d}\) and the squared distance like \(k^\frac{-2}{d}\) (I acknowledge that this assertion is in need of a formal justification, this is why I am only saying this is a sketch of an argument). So within a hypercube the sum of squared distances will go approximately like

\[ SS = nc k^\frac{-2}{d} \]

where \(c\) is some constant.

Within two hypercubes we would have

\[ SS = n_1 c k_1^\frac{-2}{d} + n_2 c (k-k_1)^\frac{-2}{d} \]

Assuming \(k\) is large enough that we can treat it as effectively continuous, find the minimum by differentiation:

\[ \frac{dSS}{dk_1} = n_1 c \frac{-2}{d} k_1 ^{\frac{-2}{d}-1} - n_2 c \frac{-2}{d} (k-k_1)^{\frac{-2}{d}-1} \]

Set \(\frac{dSS}{dk_1} = 0\)

\[ \begin{align*} \Rightarrow & & n_1 k_1^{\frac{-2}{d}-1} &= n_2 k_2^{\frac{-2}{d}-1} \\ \Rightarrow & & \left( \frac{k_1}{k_2} \right)^{\frac{-2}{d}-1} &= \frac{n_2}{n_1} \\ \Rightarrow & & \left( \frac{k_2}{k_1} \right)^{\frac{2}{d}+1} &= \frac{n_2}{n_1} \\ \Rightarrow & & \frac{k_2}{k_1} &= \left( \frac{n_2}{n_1} \right)^\frac{1}{\frac{2}{d}+1} \\ \Rightarrow & & \frac{k_2}{k_1} &= \left( \frac{n_2}{n_1} \right)^\frac{d}{2+d} \\ \end{align*} \]

This shows how \(k\) means will be allocated between two regions of differing density. Between more regions, each pair of regions will be balanced in this way.

A corollary is that the sizes of clusters will be proportional to the density to the power \(\frac{2}{2+d}\).

Appendix 2: Density of k-means++ initialization

k-means++ successively chooses initial cluster centers randomly from the set of points to be clustered, with the probability of a point being chosen proportional to the squared distance to the nearest already chosen cluster center. The k-means++ paper places an \(O(\log k)\) bound on the expected sum of squares error compared to optimal. It’s an impressive result showing that k-means++ fares well even in pathological cases. However I think the practical importance of k-means++ can be better appreciated in terms of the asymptotic densitiy of these initial cluster centers. In this section I sketch out an argument that the asymptotic density from k-means++ matches the optimal asymptotic density.

Think about k-means++ in terms of asymptotic densities: we are choosing “many” cluster centers, and we have “many many” data points. Let \(f(x)\) be the density of points, and \(g(x)\) be the density of initial cluster centers generated by k-means++.

I again make an assertion without fully justifying, that once a large number of initial cluster centers are chosen the expected squared distance to the nearest cluster center will be proportional to \(g(x)^{\frac{-2}{d}}\) where d is the number of dimensions of the space. Referring back to the logic of the previous section, think of this as a point process with the expected number of centers \(k_x\) within a “small” hypercube of fixed size centered on \(x\) being proportional to \(g(x)\). Once many initial clusters have been chosen, the probability density from which a further cluster is chosen will be proportional to

\[ \begin{align*} & & g(x) &\propto g(x)^{\frac{-2}{d}} f(x) \\ \Rightarrow & & g(x)^{1+\frac{2}{d}} &\propto f(x) \\ \Rightarrow & & g(x)^{\frac{d+2}{d}} &\propto f(x) \\ \Rightarrow & & g(x) &\propto f(x)^{\frac{d}{d+2}} \\ \end{align*} \]

Thus the asymptotic density of initial k-means++ clusters matches the density of the optimal clusters that we saw in the previous section. This does not mean it chooses optimal clusters – the initial arrangement of clusters will be much less uniformly spaced than the optimal clusters – but at least the density is correct.