k-means the diversifier, the deviralizer


For a collection of points, the k-means algorithm seeks a set of k "mean" points minimizing the sum of squared distances from each point to its nearest mean. k-means is a simple way of clustering data. It has a fast approximate algorithm to find a local optimum, but this might not be sufficient for the application I am talking about here, which needs something like a truly global optimum. It can also be viewed as a way of approximating a dataset using a smaller number of points, even if it does not consist of distinct "clusters".

I've recently become interested in the behaviour of k-means for large k. What is the distribution of the means compared to the original distribution of vectors, as k becomes large but assuming n is always much larger?

In one dimension Wong (1982) has shown that the density distribution of means is proportional the cube root of the original density distribution. Raising a density to a fractional power such as here 1/3 has a flattening, widening effect. Peaks are lowered and tails are fattened. After some rough calculation (see end), in d dimensions I believe the distribution will be proportional to the original distribution to the power d/(2+d). Altering the distance metric (k-medians, etc) I think will result in a different power.

So, for any collection of things where we have a notion of distance, k-means provides a way to flatten and summarize the distribution. (If we don't want to interpolate, we could limit means to members of the original collection.)

In a collection in which most of the variation is only in a subset of dimensions I think the effective d will depend on k. For example, for moderate k the effective dimension might be 1 or 2, but for large k one gets into the fine structure, the effective d rises, and the flattening effect is reduced.

This idea of flattening a distribution seems useful, so an algorithm that does it is exciting:

It also seems reminiscent of Wikipedia pages, which tend to cover all the major opinions, not including wild theories but also not entirely focussing on the dominant theory.

Update 2020-10-24: I've applied this idea by clustering 2020 bioRxiv abstracts. This uses a "greedy" variant of k-means where means are optimized one after the other. In other words, it is an ordered list in which topics become progressively more specialized. The algorithm I used also has fairly good ability to escape local optima.

Further note 2021-03-23: The usual k-means algorithm performs very poorly at finding the global optimum, or even at producing clusterings with the expected properties I have described the global optimum as having. Some improvement may be obtained by initializing cluster membership using Ward agglomerative clustering. In R, use fastcluster::hclust.vector() followed by cutree().

In one dimension, obtain the exact global optimum with Ckmeans.1d.dp::Ckmeans.1d.dp().

Appendix: A very rough examination of what happens in d dimensions.

Consider two d-dimensional unit hypercubes, containing n1 and n2 points respectively.

Out of k, how shall we allocate the means to the two hypercubes, k1, k2=k-k1?

Within a hypercube, the distance to the nearest mean will typically go proportional to k^(-1/d).

So within a hypercube the sum of squared distances will go approximately like

SS = c (k^(-1/d))^2 n
   = c k^(-2/d) n

where c is some constant.

Within two hypercubes we would have

SS = c k1^(-2/d) n1 + c (k-k1)^(-2/d) n2

Assuming k is large enough that we can treat it as effectively continuous, find the minimum by differentiation:

dSS/dk1 = c n1 (-2/d) k1 ^(-2/d-1) - c n2 (-1/d) (k-k1)^(-2/d-1)

Set dSS/dk1 = 0
=> n1 k1^(-2/d-1) = n2 (k-k1)^(-2/d-1)
=> (k1/(k-k1))^(-2/d-1) = n2/n1
=> ((k-k1)/k1)^(2/d+1) = n2/n1
=> (k2/k1)^(2/d+1) = n2/n1
=> k2/k1 = (n2/n1)^(1/(2/d+1))
=> k2/k1 = (n2/n1)^(d/(2+d))

This shows how k means will be allocated between two regions of differing density. Between more regions, each pair of regions will be balanced in this way.