library(tidyverse)
library(patchwork)
library(ggforce)
library(fastcluster)

set.seed(1)

Updated 2021-06-04: After @ctwardy tried replicating my results in Python, I discoved the LICORS implementation of k-means++ in R appears to be faulty. I have switched to the flexclust implementation, which gives superior results. I’ve updated my recommendation to be use either Ward clustering or k-means++ to choose initial cluster centers.

Updated 2021-06-19: Added Appendix 2 sketching an argument that k-means++ produces the optimal asymptotic cluster density.

Introduction

For a collection of \(n\) points in \(d\)-dimensional space, k-means clustering seeks to assign the points to \(k\) clusters and find cluster centers so as to minimize the sum of squared distances from each point to its cluster center.

In other words, the cluster assignments and centers approximate the original set of points and we are seeking to do this as well as possible in terms of sum-of-squares error. This is also sometimes called vector quantization.

The original k-means algorithm, “Lloyd’s algorithm”, iterates:

  1. Assign points to their nearest cluster center.
  2. Update cluster centers to the mean of the points in their cluster.

Some slight improvement to this is possible. For example centers can be updated alongside updates to cluster assignments, rather than making these two separate phases. R’s kmeans function uses the “Hartigan-Wong” algorithm which also has slightly improved cluster re-assignment.

These iterative methods all require some initial choice of cluster centers. A common default is to use points chosen at random from the dataset. This initial choice will follow the density distribution of points in the dataset. However, for large \(k\) and larger \(n\), the density of cluster centers should be proportional to the density of the points to the power \(\frac{d}{d+2}\). An argument for this is given in Appendix 1. In other words the distribution of clusters found by k-means should be more spread out than the distribution of points. Commonly used iterative schemes stay stuck close to the initial choice of centers, and do not in general achieve this.

There are several different approaches to k-means that avoid getting “stuck”:

  • Choose better initial clusters using the k-means++ method. Randomly chosen initial cluster centers follow the distribution of the dataset, which I am claiming is wrong. k-means++ chooses clusters randomly but with a different scheme that avoids putting initial cluster centers too close together. Appendix 2 sketches an argument that the density of these initial clusters matches the density of the optimal clusters.

  • In one dimension, a dynamic programming solution is possible. Ckmeans.1d.dp is a wonderfully fast implementation.

  • Ward agglomerative clustering takes an entierly different approach to clustering, while still seeking to minimize the sum-of-squares error. Older implementations have \(O(n^3)\) time complexity, and also keep a full distance matrix in memory, but fastcluster::hclust.vector provides a memory efficient implementation with \(O(n^2)\) time complexity.

In this note, I will demonstrate that Ward clustering can perform much better than the usual iterative schemes. Further improvement is achieved by polishing the Ward cluster centers using an iterative method. k-means++ is also a decent option, performing nearly as well.

The proposed Ward+polish method can be performed in R using:

# For matrix of points mat and specified k

library(tidyverse)
library(fastcluster)

hc <- hclust.vector(mat, method="ward")
cluster <- cutree(hc, k=k)
centers <- 
    seq_len(nrow(mat)) %>% 
    split(cluster) %>% 
    map(~ colMeans(mat[.,,drop=FALSE])) %>% 
    do.call(rbind,.)

# This can be followed by a refinement step
result <- kmeans(mat, centers, iter.max=100)

centers <- result$centers
cluster <- result$cluster

For k-means++ use:

library(flexclust)

result <- kcca(mat, k=k, family=kccaFamily("kmeans"), control=list(initcent="kmeanspp"))

centers <- result@centers
cluster <- result@cluster

Demonstration

Balancing squares

The first dataset will consist of two squares of uniform density, the first containing 10,000 points and the second containing 1,000 points. We will find k=200 means.

The points have a ratio of 10:1, so number of clusters should be split \(\sqrt{10}:1\). For 200 clusters, this should be k1=152, k2=48.

c(10000,1000) %>% sqrt %>% {./sum(.)*200}
## [1] 151.94939  48.05061
library(tidyverse)
library(patchwork)
library(ggforce)
library(fastcluster)

ss <- function(mat, fit) {
    sum( (mat - fit$centers[fit$cluster,])^2 )
}

null_fit <- function(mat) list(centers=mat,cluster=seq_len(nrow(mat)))

plotit <- function(mat, fit, do_lr=TRUE, do_ss=TRUE, title="", scale=0.003, kchar="k") {
    df <- tibble(x=fit$centers[,1], y=fit$centers[,2], size=as.integer(table(fit$cluster)))

    if (do_lr) {
        k1 <- sum(fit$centers[,1]<=0)
        k2 <- nrow(fit$centers)-k1
        title <- sprintf("%s %s1=%d %s2=%d",title,kchar,k1,kchar,k2)
    }

    if (do_ss) {
        title <- sprintf("%s SS=%.1f",title,ss(mat,fit))
    }

    ggplot(df) + 
        aes(x0=x,y0=y,r=sqrt(size)*scale) + 
        geom_circle(fill="white", color="black",n=16) + 
        coord_fixed() +
        labs(title=title,x="",y="")
}

do_fits <- function(mat, k) {
    result <- list()

    # k-means the usual way

    result$kmeans <- kmeans(mat,k,iter.max=100)

    # k-means++

    # Do not use LICORS::kmeanspp. Appears to be incorrect implementation.
    #result$kmeanspp <- LICORS::kmeanspp(mat,k)

    kmpp <- flexclust::kcca(mat,k=k,family=kccaFamily("kmeans"), control=list(initcent="kmeanspp", verbose=0))
    result$kmeanspp <- list(centers=kmpp@centers, cluster=kmpp@cluster)

    # Ward + polish

    hc <- hclust.vector(mat, method="ward")
    cluster <- cutree(hc, k=k)
    centers <- 
        seq_len(nrow(mat)) %>% 
        split(cluster) %>% 
        map(~ colMeans(mat[.,,drop=FALSE])) %>% 
        do.call(rbind,.)
    result$ward <- list(cluster=cluster,centers=centers)
    result$ward_polish <- kmeans(mat, centers, iter.max=100)

    result
}

do_plots <- function(mat, fits, scale=0.003, do_lr=F) {
    mat_plot <- 
        ggplot() + aes(x=mat[,1],y=mat[,2]) +
        geom_point(size=0.1) +
        coord_fixed() + 
        labs(x="",y="",title="data")
    
    ( mat_plot | plot_spacer() ) /
    ( plotit(mat, fits$kmeans, do_lr,T,"k-means",scale) | plotit(mat, fits$kmeanspp, do_lr,T,"k-means++",scale) ) /
    ( plotit(mat, fits$ward, do_lr,T,"Ward",scale) | plotit(mat, fits$ward_polish, do_lr,T,"Ward + polish",scale) )
}

k <- 200
n1 <- 10000
n2 <- 1000
mat <- rbind(
    cbind(runif(n1)-1,runif(n1)),
    cbind(runif(n2),  runif(n2)))

fits <- do_fits(mat, k)
do_plots(mat, fits, do_lr=T)