k-means the diversifier, the deviralizer

For a collection of points, the k-means algorithm seeks a set of k "mean" points minimizing the sum of squared distances from each point to its nearest mean. k-means is a simple way of clustering data. It has a fast approximate algorithm to find a local optimum, but this might not be sufficient for the application I am talking about here, which needs something like a truly global optimum. It can also be viewed as a way of approximating a dataset using a smaller number of points, even if it does not consist of distinct "clusters".

I've recently become interested in the behaviour of k-means for large k. What is the distribution of the means compared to the original distribution of vectors, as k becomes large but assuming n is always much larger?

In one dimension Wong (1982) has shown that the density distribution of means is proportional the cube root of the original density distribution. Raising a density to a fractional power such as here 1/3 has a flattening, widening effect. Peaks are lowered and tails are fattened. After some rough calculation (see end), in d dimensions I believe the distribution will be proportional to the original distribution to the power d/(2+d). Altering the distance metric (k-medians, etc) I think will result in a different power.

So, for any collection of things where we have a notion of distance, k-means provides a way to flatten and summarize the distribution. (If we don't want to interpolate, we could limit means to members of the original collection.)

In a collection in which most of the variation is only in a subset of dimensions I think the effective d will depend on k. For example, for moderate k the effective dimension might be 1 or 2, but for large k one gets into the fine structure, the effective d rises, and the flattening effect is reduced.

This idea of flattening a distribution seems useful, so an algorithm that does it is exciting:

From a news feed, pick items covering the diversity of opinions. Do not concentrate too much on wild outliers, but also do not concentrate too much on the most common themes.

In a shop, stock items to maximize the average ability to meet the needs of customers. Some stocked items will have higher turnover than others.

Looking to employ k people, choose a set of people who will bring diverse skills. For tasks that arise we want to ensure we've employed someone with skills fairly well matching the task. We have an anticipated distribution of tasks. The possible "means" are the available applicants.

It also seems reminiscent of Wikipedia pages, which tend to cover all the major opinions, not including wild theories but also not entirely focussing on the dominant theory.

Update 2020-10-24: I've applied this idea by clustering 2020 bioRxiv abstracts. This uses a "greedy" variant of k-means where means are optimized one after the other. In other words, it is an ordered list in which topics become progressively more specialized. The algorithm I used also has fairly good ability to escape local optima.

Further note 2021-03-23: The usual k-means algorithm performs very poorly at finding the global optimum, or even at producing clusterings with the expected properties I have described the global optimum as having. Some improvement may be obtained by initializing cluster membership using Ward agglomerative clustering. In R, use fastcluster::hclust.vector() followed by cutree().

In one dimension, obtain the exact global optimum with Ckmeans.1d.dp::Ckmeans.1d.dp().

Appendix: A very rough examination of what happens in d dimensions.

Consider two d-dimensional unit hypercubes, containing n1 and n2 points respectively.

Out of k, how shall we allocate the means to the two hypercubes, k1, k2=k-k1?

Within a hypercube, the distance to the nearest mean will typically go proportional to k^(-1/d).

So within a hypercube the sum of squared distances will go approximately like

SS = c (k^(-1/d))^2 n
   = c k^(-2/d) n

where c is some constant.

Within two hypercubes we would have

SS = c k1^(-2/d) n1 + c (k-k1)^(-2/d) n2

Assuming k is large enough that we can treat it as effectively continuous, find the minimum by differentiation:

dSS/dk1 = c n1 (-2/d) k1 ^(-2/d-1) - c n2 (-1/d) (k-k1)^(-2/d-1)

Set dSS/dk1 = 0
=> n1 k1^(-2/d-1) = n2 (k-k1)^(-2/d-1)
=> (k1/(k-k1))^(-2/d-1) = n2/n1
=> ((k-k1)/k1)^(2/d+1) = n2/n1
=> (k2/k1)^(2/d+1) = n2/n1
=> k2/k1 = (n2/n1)^(1/(2/d+1))
=> k2/k1 = (n2/n1)^(d/(2+d))

This shows how k means will be allocated between two regions of differing density. Between more regions, each pair of regions will be balanced in this way.