How Does Unequal Probability Affect Entropy Calculation in Information Theory?

AI Thread Summary
The discussion focuses on how unequal probabilities affect entropy calculations in information theory. It explains that while entropy is typically calculated assuming equal probabilities for each message value, a different approach is needed when probabilities vary. An example using a loaded die illustrates that the entropy can be calculated as 127/64 bits, allowing for efficient communication of outcomes. The law of large numbers supports the idea that, over many trials, the average bits needed to convey outcomes can be less than the maximum required. The conversation concludes with a clarification on the additivity of entropy, emphasizing that the total entropy remains consistent regardless of how probabilities are grouped.
Zak
Messages
15
Reaction score
0
I am reading a book called 'quantum processes systems and information' and in the beggining a basic idea of information is set out in the following way. If information is coded in 'bits', each of which have two possible states, then the number of possible different messages/'values' that can be encoded in an 'n' bit system is 2^n. If it is known that a message (to be received in the future) has M values, then the entropy is defined of log(M) = H where the logarithm is of base 2 (because its a binary system).

Effectively, the entropy of a message with M possible values tells you the minimum number of bits required to represent the message. However, this assumes that each possible 'value' of the message has an equal probability of being read. It is later discussed that if message with 2 possible values is being read with the probability of each message not being equal, then the entropy is defined in a different way.

If, say, the first message (0) has a probability of 1/3, and the second (1) of 2/3 then the second message is 'split' into two messages (say, 1a and 1b) each of which having a probability of 1/3. The entropy is then defined in the following way: if H(M) = entropy of message M, and P(x) = probability of value x then --> H(0,1a,1b) = H(0,1) + P(1)H(1a,1b).

Does anybody have an intuitive understanding of this definition of entropy for the second case (with unequal probabilities for each value of the message) that they could explain to me?

Danke.
 
Mathematics news on Phys.org
Let's look at an 8 sided loaded die with the following probabilities for each side.
The probabilities are:
P(side 1) = 1/2
P(side 2) = 1/4
P(side 3) = 1/8
P(side 4) = 1/16
P(side 5) = 1/32
P(side 6) = 1/64
P(side 7) = 1/128
P(side 8) = 1/128

The entropy of this probability distribution will be 127/64, or about 1.98 bits.

Now with 8 outcomes, one needs three bits to uniquely label each outcome.
(side 1) -> 000
(side 2) -> 001
,,,
...
(side 8) -> 111
In order to communicate the outcome with certainty, one can accomplish this by sending mo more than three bits of information every time.

What the entropy tells us is the minimum number of bits we need to send the outcome on average.

Indeed, if we roll the die many times, we can with high probability communicate the sequence of outcomes using only 1.98 bits per roll instead of 3 bits per roll.
What makes this possible is the law of large numbers.
In particular, with a long sequence of die rolls, the sequence of outcomes is overwhelmingly likely to be one where the relative frequencies of each outcome are very close to the true probabilities (these sequences are "typical").

What gives us this savings of data is that the number of typical sequences is usually a whole lot smaller than the total number of possible sequences. In particular, the number of digits needed to uniquely label all typical sequences is smaller than the number of bits needed to uniquely label all possible sequences.For a more concrete example (since this distribution is particularly easy to work with), consider this alternative encoding of each outcome

(side1) -> 1
(side 2) -> 01
(side 3) -> 001
(side 4) -> 0001
(side 5) -> 00001
(side 6) -> 000001
(side 7) -> 0000001
(side 8) -> 0000000

Even though sides 7 and 8 have 7 bit code words for their outcomes, these outcomes are much more improbable than the most likely outcome, which needs only one bit. Indeed, if you use this scheme, you can uniquely communicate the outcome using on average only 1.98 bits per roll.
 
Ah yes I think that makes a lot of sense! So does this mean that the statement 'H(0,1a,1b) = H(0,1) + P(1)H(1a,1b)' is in a sense saying that after a very large number of repetitions (on average), the number of bits required to express all possible sequences is equal to the average number of bits (to express typical sequences) plus the number of bits required to express each individual message (which are weighted by some probability of occurring, over large numbers of repetition)?

Does this mean then that the statement could actually be written as: H(0,1a,1b) = H(0,1) + P(1)H(1a,1b) + P(0)H(0) where H(0) = log(1) = 0 (because sequence '0' has only one possible value)?

Thanks a lot.
 
Unfortunately, I'm not familiar with the mathematical notation you're using. If I read it correctly, you're talking about the additivity of the entropy (that the total entropy is the same no matter how you group your probabilities). Wikipedia has a decent treatment of it, and maybe there you'll see what you're looking for?
 
Insights auto threads is broken atm, so I'm manually creating these for new Insight articles. In Dirac’s Principles of Quantum Mechanics published in 1930 he introduced a “convenient notation” he referred to as a “delta function” which he treated as a continuum analog to the discrete Kronecker delta. The Kronecker delta is simply the indexed components of the identity operator in matrix algebra Source: https://www.physicsforums.com/insights/what-exactly-is-diracs-delta-function/ by...
Fermat's Last Theorem has long been one of the most famous mathematical problems, and is now one of the most famous theorems. It simply states that the equation $$ a^n+b^n=c^n $$ has no solutions with positive integers if ##n>2.## It was named after Pierre de Fermat (1607-1665). The problem itself stems from the book Arithmetica by Diophantus of Alexandria. It gained popularity because Fermat noted in his copy "Cubum autem in duos cubos, aut quadratoquadratum in duos quadratoquadratos, et...
Thread 'Imaginary Pythagorus'
I posted this in the Lame Math thread, but it's got me thinking. Is there any validity to this? Or is it really just a mathematical trick? Naively, I see that i2 + plus 12 does equal zero2. But does this have a meaning? I know one can treat the imaginary number line as just another axis like the reals, but does that mean this does represent a triangle in the complex plane with a hypotenuse of length zero? Ibix offered a rendering of the diagram using what I assume is matrix* notation...
Back
Top