Shannon's measure - any profound explanation?

1. Oct 26, 2004

saviourmachine

Shannon's measure of information is the well known formula (in the discrete case):
$$H(S) = {-}\sum_{i=1}^N P(s_i) \log_{2}P(s_i)$$
Of course this can be written as:
$$H(S) = \sum_{i=1}^N \log_{2}P(s_i)^{-P(s_i)}$$
It appears to me that the multiple occurence of the same quantity (the probability of a particular symbol) must have some profound meaning. Why is it a power that is used to formulate this 'entropy' measure? In the case of throwing a six with a dice, the result of $$1/6^{-1/6}=1.348$$ and $$5/6^{-5/6}=22.255$$. Is there some intuitive alternative association with these numbers (without direct connection with binary strings)?
Until now I found only explanations that took the logarithm for granted. IMHO the logarithm is only rescaling the "chances to the power of chances" characteristic. So, I would like to have an explanation considering that characteristic.

Or is it more or less arbitrary, as taking quadratic error measures in stead of absolute errors (in e.g. the least mean square method)?

2. Sep 13, 2008

Terry Oldberg

A scientific model is an algorithm (mathematical procedure) for making inferences. An "inference" is an extrapolation from the state of a real object in an "observed state-space" to the state of the same object in an "unobserved state-space." Shannon's measure of information can be shown to be the unique measure of these inferences when the uncertainty of each state in the unobserved state-space is measured by the notion of probability.

Generally, Shannon's measure of the inferences is called the "conditional entropy." In the circumstance that the observed state-space contains but a single state, the conditional entropy reduces to the "entropy."

The existence and uniqueness of Shannon's measure makes it the only possible choice if one is to optimize the inferences that are made by a model. One optimizes these inferences by maximizing the entropy under zero or more constraints or by minimizing the conditional entropy. Optimization works extremely well in deciding which inferences shall be made by a model, thus the role for Shannon's measure in science.

By the way, Shannon's measure of information is not identical to the entropy but rather is a broader concept, some of whose manifestations also include the conditional entropy and the mutual information.

3. Sep 13, 2008

rbj

i think this:
$$H(S) = \sum_{i=1}^N \log_{2}P(s_i)^{-P(s_i)}$$

is just a consequence of this:

$$H(S) = {-}\sum_{i=1}^N P(s_i) \log_{2}P(s_i)$$

and has no other meaning than that it is mathematical equivalent. the reason the bottom works is that the measure of information of message si is -log( P(si) ). then the mean quantity of information (per message) over all of the messages is H(S).

4. Sep 13, 2008

atyy

The logarithm is there because we want the entropy for independent probability distributions [p(x,y)=p(x)p(y)] to add. A log is what converts multiplication into addition.

The entropy is roughly "log of the number of different possibilities". This precise formulation of this statement is called the "asymptotic equipartition theorem".

The mutual information is roughly "reduction in the log of the number of different possibilities".

Last edited: Sep 13, 2008