Surprise is the KL-divergence between prior and posterior

Jiří pointed me at this interesting paper on surprise and attention: Itti and Baldi, 2005, "Bayesian Surprise Attracts Human Attention", with free software available here. Their immediate application of this was to eye-tracking, which immeditately piqued my interest.

Itti and Baldi define surprise as the Kullback-Leibler divergence between the distribution of model hypothesese before and after observing a datum.

I have previously heard surprise defined as the amount of information a datum contains, but this definition is suprerior.

They point out in their paper that an outlier, though containing a lot of information in that it is highly improbable, may have little effect on the model distribution. Though not stated, this would only apply to robust models. This is similar to the NT "influence" function described in my autism draft paper.

A highly entropic source may lead to little updating of the model after a while, being simply random, and therefore can become quite unsurprising. This matches up with human surprise far better than the information-content definition. An immediate application: very rough areas of an image being lossily compressed need not be encoded as precisely as smoother areas, because that extra information would have little effect on the models people produce from viewing the image. This isn't a new idea, but now it can be quantified.

This definition of surprise (and attention) is principled, general, and they have presented evidence that it matches well with human behaviour. It is very nifty indeed.

My own immediate interest is seeing if I can adapt their software to be autistic (or perhaps non-autistic), then hopefully comparing it to published eye-tracking studies of autism. (One slight drawback here -- their model, though allowing for the possibility in theory, does not yet implement higher level cognitive processes. This is of course A.I. complete, but would be necessary in order to perfectly model human attention, as human attention is guided by higher level processes. I am thinking in particular of a study that did eye-tracking on people watching the movie "Who's Afraid of Virginia Woolf", in which attention (in normal people) was partly directed by what the actors were saying or looking at.)

... I wonder if it would be better to swap the order of parameters for the KL-divergence from what Itti and Baldi were using. What one really wants is an estimate of the cost of holding your previous erroneous world view, having now updated your beliefs based on the new data point.