# [R] deviance vs entropy

Warren R. Greiff greiff at mitre.org
Thu Feb 15 17:51:49 CET 2001

```> RemoteAPL wrote:
>
> Hello,
>
> The question looks like simple. It's probably even stupid. But I spent several hours
> searching Internet, downloaded tons of papers, where deviance is mentioned and...
> And haven't found an answer.
>
> Well, it is clear for me the using of entropy when I split some node of a classification tree.
> The sense is clear, because entropy is an old good measure of how uniform is distribution.
> And we want, for sure, the distribution to be uniform, represent one class only as the best.
>
> Where deviance come from at all? I look at a formula and see that the only difference to
> entropy is use of *number* of each class points, instead of *probability* as a multiplier
> of log(Pik). So, it looks like the deviance and entropy differ by factor 1/N (or 2/N), where
> N is total number of cases. Then WHY to say "deviance"? Any historical reason?
> Or most likely I do not understand something very basic. Please, help.
>
> Thanks,
> Alexander Skomorokhov,
>

I'm not quite sure what you have in mind, but I'm inferring from your comments that by "deviance"
you mean:

-SUM p_i log (p_i/q_i)  (or -2 SUM p_i log (p_i/q_i))

In information theoretic terms this is:

D(p_i||q_i) = - SUM p_i log p_i + SUM p_i log q_i = H(p) - H(p:q)

where H(p) is entropy of p, and H(p:q) is the cross entropy.  If q is the uniform distribution, then
the cross entropy reduces to:

-SUM p_i log q_i = -SUM p_i log 1/N = log(N) SUM p_i = log(N)

If q_i is the uniform distribution, then you get:

D(p||q) = H(p) + log(N).

I'm guessing that in the things you've read, when they are talking about deviance, q can (and
generally is) something other than the uniform distribution.  For example, p is often the empirical
distribution of a data sample, and q is the distribution corresponding to some induced model.  Then
D(p||q) is a measure of how far the model is from the observed data.

Note that the cross entropy corresponds to the likelihood, L(data;q), of the data (which has an
empirical distribution of p) having been produced by the induced model q.  The entropy corresponds
to the likelihood, L(data;p), of the data having been produced by a saturated model, one which fits
the empirical distribution of the data perfectly.  So the deviance:

D(p||q) = log L(data;p) - log L(data;q) = log [L(data;p) / L(data;q)]

is a measure of how far the model is from being as good as it could be.  Statisticians tend to think
in terms of the ratio of likelihoods, Machine Learning folks tend to think in terms of relative
entropy (entropy - cross_entropy, or KL-divergence).  Statisticians are interested in deviance
because (with the factor of 2) it is asymptotically chi-square for many modeling families.  In
information theoretic terms it's nice to think of the deviance as the number of bits extra that it
would take to transmit the data for a system assuming the distribution q, relative to a system that
had assumed p, which is the best system for transmitting that particular data set.

Then again, maybe I've misunderstood you completely.  Please set me straight if I have.

-warren
-------------- next part --------------
A non-text attachment was scrubbed...
Name: greiff.vcf
Type: text/x-vcard
Size: 388 bytes
Desc: Card for Warren R. Greiff