Traditional (computer aided) classification (see eg http://www.csse.monash.edu.au/~dld/Snob.html) assumes a large number of items and a small number of attributes. But consider:
- Classification of supermarket customers based on buying habits.
- Classification of drugs based on their usefulness to a set of patients. (http://www.remedyfind.com)
- Classification of movies based on user ratings. (http://www.netflixprize.com/, http://movielens.umn.edu/)
- Classification of bacteria based on the genes they carry.
- Face recognition from digital images.
- Classification of documents by word frequency.
The traditional model of classification breaks down in the following ways:
- There are often as many attributes by which to classify items as there are items.
- The items may be best classified one way for one subset of items, and another way for another set.
- Given the large number of attributes, the data is necessarily sparse.
Furthermore, these problems are often symmetric: The problem is just as meaningful if you call the attributes items and the items attributes.
- Classification of supermarket goods by who buys them.
- Classification of patients by their responses to drugs.
- Classification of users by their movie ratings.
- Classification of genes by which bacteria they are found in.
- Classification of words by the documents they are found in.
I'm interested in classification strategies that treat the problem symmetricaly. Ring any bells for you?
Responses
Lee points me to this Clay Shirky essay, in which Shirky points out the flaws of hierarchical classification and advocates tagging as an alternative.
Jiri pointed me to one of his blog entries which touches on similar concerns, and suggests the keyword "bigraph".
pfh notes that Singular Value Decomposition is suitably symmetric. I know this has been applied to at least the words-and-documents problem.