LSA + alpha-stable ICA ?

homeblogmastodonthingiverse



I have been thinking how to apply alpha-stable distributions to discrete data. To do this would require extraction of continuous attributes from discrete data.

With the Gaussian distribution, one way to do this is "Principal Components Analysis". For example, one might apply this to text by turning each document (or perhaps paragraph or sentence) into a (big) binary vector, each element of the vector indicating the presence or absence of a particular word. PCA can then be applied to find words that often co-occur. This particular application is called "Latent Semantic Analysis".

The alpha-stable equivalent to Principal Components Analysis is Independant Components Analysis (actually ICA can be applied using any non-Gaussian distribution). Alpha-stable ICA is somewhat more powerful than PCA in that the presence of outliers along a particular direction can be taken as evidence of a component along that direction.

Alpha-stable ICA is also more robust than PCA, in that outliers cause less surprise. Essentially, one is saying "it doesn't rain, but it pours". A topic (i.e. set of words) may be very rarely mentioned, but in those documents where it is mentioned it will may be mentioned often. Because of this, alpha-stable ICA can be seen not just as factorization, but also almost as a kind of clustering.


I think applying alpha-stable ISA to LSA, and other categorization tasks based on discrete data, may be worth a try.




[æ]