library(tidyverse)
library(patchwork)
library(ggforce)
library(fastcluster)

set.seed(1)

Introduction

For a collection of \(n\) points in \(d\)-dimensional space, k-means clustering seeks to assign the points to \(k\) clusters and find cluster centers so as to minimize the sum of squared distances from each point to its cluster center.

In other words, the cluster assignments and centers approximate the original set of points and we are seeking to do this as well as possible in terms of sum-of-squares error. This is also sometimes called vector quantization.

The original k-means algorithm, “Lloyd’s algorithm”, iterates:

  1. Assign points to their nearest cluster center.
  2. Update cluster centers to the mean of the points in their cluster.

Some slight improvement to this is possible. For example centers can be updated alongside updates to cluster assignments, rather than making these two separate phases. R’s kmeans function uses the “Hartigan-Wong” algorithm which also has slightly improved cluster re-assignment.

For large \(k\) and larger \(n\), the density of cluster centers should be proportional to the density of the points to the power \(\frac{d}{d+2}\) (see Appendix). In other words the distribution of clusters found by k-means should be more spread out than the distribution of points. This is not in general achieved by commonly used iterative schemes, which stay stuck close to the initial choice of centers.

There are a couple of different approaches to k-means that avoid getting “stuck”:

  • In one dimension, a dynamic programming solution is possible. Ckmeans.1d.dp is a wonderfully fast implementation.

  • Ward agglomerative clustering takes an entierly different approach to clustering, while still seeking to minimize the sum-of-squares error. Older implementations calculate a full distance matrix, making them \(O(n^2)\) and not applicable even to moderately sized datasets, but fastcluster::hclust.vector avoids this problem and provides a reasonably fast implementation.

In this note, I will demonstrate that Ward clustering can perform much better than the usual iterative schemes. Further improvement is achieved by polishing the Ward cluster centers using an iterative method.

The proposed method can be performed in R using:

# For matrix of points mat and specified k

library(tidyverse)
library(fastcluster)

hc <- hclust.vector(mat, method="ward")
cluster <- cutree(hc, k=k)
centers <- 
    seq_len(nrow(mat)) %>% 
    split(cluster) %>% 
    map(~ colMeans(mat[.,,drop=FALSE])) %>% 
    do.call(rbind,.)

# This can be followed by a refinement step
result <- kmeans(mat, centers, iter.max=100)

Let’s compare some clustering methods.

Demonstration

Balancing squares

The first dataset will consist of two squares of uniform density, the first containing 10,000 points and the second containing 1,000 points. We will find k=200 means.

The points have a ratio of 10:1, so number of clusters should be split \(\sqrt{10}:1\):

c(10000,1000) %>% sqrt %>% {./sum(.)*200}
## [1] 151.94939  48.05061
library(tidyverse)
library(patchwork)
library(ggforce)
library(fastcluster)

ss <- function(mat, fit) {
    sum( (mat - fit$centers[fit$cluster,])^2 )
}

null_fit <- function(mat) list(centers=mat,cluster=seq_len(nrow(mat)))

plotit <- function(mat, fit, do_lr=TRUE, do_ss=TRUE, title="", scale=0.003, kchar="k") {
    df <- tibble(x=fit$centers[,1], y=fit$centers[,2], size=as.integer(table(fit$cluster)))

    if (do_lr) {
        k1 <- sum(fit$centers[,1]<=0)
        k2 <- nrow(fit$centers)-k1
        title <- sprintf("%s %s1=%d %s2=%d",title,kchar,k1,kchar,k2)
    }

    if (do_ss) {
        title <- sprintf("%s SS=%.1f",title,ss(mat,fit))
    }

    ggplot(df) + 
        aes(x0=x,y0=y,r=sqrt(size)*scale) + 
        geom_circle(fill="white", color="black",n=16) + 
        coord_fixed() +
        labs(title=title,x="",y="")
}

do_fits <- function(mat, k, do_kmeanspp=T) {
    result <- list()
    result$kmeans <- kmeans(mat,k,iter.max=100)

    if (do_kmeanspp)
        result$kmeanspp <- LICORS::kmeanspp(mat,k)

    hc <- hclust.vector(mat, method="ward")
    cluster <- cutree(hc, k=k)
    centers <- 
        seq_len(nrow(mat)) %>% 
        split(cluster) %>% 
        map(~ colMeans(mat[.,,drop=FALSE])) %>% 
        do.call(rbind,.)
    result$ward <- list(cluster=cluster,centers=centers)
    result$ward_polish <- kmeans(mat, centers, iter.max=100)

    result
}

do_plots <- function(mat, fits, scale=0.003, do_lr=F) {
    (plotit(mat, null_fit(mat), do_lr,F,"data",scale,kchar="n") | plot_spacer() ) /
    (plotit(mat, fits$kmeans, do_lr,T,"k-means",scale) | plotit(mat, fits$kmeanspp, do_lr,T,"k-means++",scale)) /
    (plotit(mat, fits$ward, do_lr,T,"Ward",scale) | plotit(mat, fits$ward_polish, do_lr,T,"Ward + polish",scale))
}

k <- 200
n1 <- 10000
n2 <- 1000
mat <- rbind(
    cbind(runif(n1)-1,runif(n1)),
    cbind(runif(n2),runif(n2)))

fits <- do_fits(mat, k)
do_plots(mat, fits, do_lr=T)