How Does Unequal Probability Affect Entropy Calculation in Information Theory?

Zak · Nov 5, 2015

I am reading a book called 'quantum processes systems and information' and in the beggining a basic idea of information is set out in the following way. If information is coded in 'bits', each of which have two possible states, then the number of possible different messages/'values' that can be encoded in an 'n' bit system is 2^n. If it is known that a message (to be received in the future) has M values, then the entropy is defined of log(M) = H where the logarithm is of base 2 (because its a binary system).

Effectively, the entropy of a message with M possible values tells you the minimum number of bits required to represent the message. However, this assumes that each possible 'value' of the message has an equal probability of being read. It is later discussed that if message with 2 possible values is being read with the probability of each message not being equal, then the entropy is defined in a different way.

If, say, the first message (0) has a probability of 1/3, and the second (1) of 2/3 then the second message is 'split' into two messages (say, 1a and 1b) each of which having a probability of 1/3. The entropy is then defined in the following way: if H(M) = entropy of message M, and P(x) = probability of value x then --> H(0,1a,1b) = H(0,1) + P(1)H(1a,1b).

Does anybody have an intuitive understanding of this definition of entropy for the second case (with unequal probabilities for each value of the message) that they could explain to me?

Danke.

jfizzix · Nov 5, 2015

Let's look at an 8 sided loaded die with the following probabilities for each side.
The probabilities are:
P(side 1) = 1/2
P(side 2) = 1/4
P(side 3) = 1/8
P(side 4) = 1/16
P(side 5) = 1/32
P(side 6) = 1/64
P(side 7) = 1/128
P(side 8) = 1/128

The entropy of this probability distribution will be 127/64, or about 1.98 bits.

Now with 8 outcomes, one needs three bits to uniquely label each outcome.
(side 1) -> 000
(side 2) -> 001
,,,
...
(side 8) -> 111
In order to communicate the outcome with certainty, one can accomplish this by sending mo more than three bits of information every time.

What the entropy tells us is the minimum number of bits we need to send the outcome on average.

Indeed, if we roll the die many times, we can with high probability communicate the sequence of outcomes using only 1.98 bits per roll instead of 3 bits per roll.
What makes this possible is the law of large numbers.
In particular, with a long sequence of die rolls, the sequence of outcomes is overwhelmingly likely to be one where the relative frequencies of each outcome are very close to the true probabilities (these sequences are "typical").

What gives us this savings of data is that the number of typical sequences is usually a whole lot smaller than the total number of possible sequences. In particular, the number of digits needed to uniquely label all typical sequences is smaller than the number of bits needed to uniquely label all possible sequences.For a more concrete example (since this distribution is particularly easy to work with), consider this alternative encoding of each outcome

(side1) -> 1
(side 2) -> 01
(side 3) -> 001
(side 4) -> 0001
(side 5) -> 00001
(side 6) -> 000001
(side 7) -> 0000001
(side 8) -> 0000000

Even though sides 7 and 8 have 7 bit code words for their outcomes, these outcomes are much more improbable than the most likely outcome, which needs only one bit. Indeed, if you use this scheme, you can uniquely communicate the outcome using on average only 1.98 bits per roll.

Zak · Nov 6, 2015

Ah yes I think that makes a lot of sense! So does this mean that the statement 'H(0,1a,1b) = H(0,1) + P(1)H(1a,1b)' is in a sense saying that after a very large number of repetitions (on average), the number of bits required to express all possible sequences is equal to the average number of bits (to express typical sequences) plus the number of bits required to express each individual message (which are weighted by some probability of occurring, over large numbers of repetition)?

Does this mean then that the statement could actually be written as: H(0,1a,1b) = H(0,1) + P(1)H(1a,1b) + P(0)H(0) where H(0) = log(1) = 0 (because sequence '0' has only one possible value)?

Thanks a lot.

jfizzix · Nov 6, 2015

Unfortunately, I'm not familiar with the mathematical notation you're using. If I read it correctly, you're talking about the additivity of the entropy (that the total entropy is the same no matter how you group your probabilities). Wikipedia has a decent treatment of it, and maybe there you'll see what you're looking for?

How Does Unequal Probability Affect Entropy Calculation in Information Theory?

Thread 'What Exactly is Dirac’s Delta Function? - Insight'

Thread 'Fermat's Last Theorem'

Thread 'Imaginary Pythagorus'

Similar threads

Hot Threads

Insights Fermat's Last Theorem

B How is it that law of sines does not work in this exercise?

B What could prove this wrong? I'm having a dispute with friends

B About a definition: What is the number of terms of a polynomial P(x)?

B How Many Straight Lines to Connect an N by M Array of Points in a Closed Loop?

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective