Help understanding Shannon entropy

voila · Mar 25, 2016

Hi
I'm having some trouble understanding Shannon entropy and its relation to "computer" bits (zeros and ones). Shannon entropy of a random variable is (assume b=2 so that we work in bits)

and everywhere I've read says it is "the number of bits on the average required to describe the random variable". Yet I can't seem to find a proper showing of the connection between the sentence and the equation.

I've also found an example (Example 1.1.2 on Elements of Information Theory, Cover & Thomas) which reads:

"Suppose we have a horse race with eight horses taking part. The probabilities of winning for the eight horses are (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64). The entropy of the horse race is (shows the calculation and equals) 2 bits. Suppose we wish to send a message indicating which horse won the race. One alternative is to send the index of the winning horse . This description requires 3 bits for any of the horses. But the probabilities are not uniform, so it makes sense to use shorter descriptions for the more probable horses, and viceversa, so that we achieve a lower average description length. For example, we could use:

0, 10, 110, 1110, 111100, 111101, 111110, 111111.

The average description length in this case is 2 bits."

Alright, in this case. But why does it have to be that case? Why can't I use a 1 for the second horse, 01 for the third and so on, reducing the average description length?

Appart from that, there's another example on the same book (Example 2.1.1) where we have a variable that takes the value 1 with probability p, and the value 0 with probability 1-p. If p is different to 1/2 (that is, if one outcome is more probable), the Shannon entropy is less than one bit. How can I describe it with less than one bit? It makes no sense. If it means "you could just not send a signal for the more probable outcome", wouldn't having a "not sending a signal" possibility effectively mean working with trits?
And once again, how dows Shannon entropy relate to all this??

jfizzix · Mar 25, 2016

Let's keep things general.

You have a random variable [itex]X[/itex] with [itex]N[/itex] possible outcomes.
The probability of each outcome [itex]X_{i}[/itex] is given by the probability [itex]P(X_{i})[/itex] so that the total over all outcomes is 1.

Now, let's say you want to communicate the outcome of this random variable to someone else.
The simplest thing you could do is number the outcomes from 1 to N and express them in binary, and send the binary number related to the outcome that happened.
The number of binary digits, or "bits" you need to express each of these numbers is [itex]\log_{2}(N)[/itex], rounded up to the nearest whole number.

If all outcomes were equally likely, this is the best you could do too, but there are more efficient ways, if you know what [itex]P(X)[/itex] is (and it is different than completely uniform).

It has to do with what's called the law of large numbers (or in that Cover&Thomas book, the Asymptotic Equipartition Theorem).
Over many trials of outcomes of [itex]X[/itex], you will find that the relative frequencies of each outcome grow to closely resemble the probability of that outcome. These sequences of outcomes that resemble the probability distributions they came from are called "typical".
In the limit of a large number of trials, the probability that your sequence is "typical" becomes overwhelmingly close to 1.

So, with a vanishingly small probability of error, you don't have to have unique numbers for every string of outcomes. You just have to have a unique number for every "typical" string of outcomes.

And, the number of bits you need per outcome in a typical string is equal to the Shannon entropy of [itex]X[/itex].

[itex]H(X)=-\sum_{i} P(X_{i})\log(P(X_{i}))[/itex]

atyy · Mar 26, 2016

Source coding theorem, Chapter 4 of http://www.inference.phy.cam.ac.uk/itila/book.html (free!)

A slightly different take, not so rigourous but in the same spirit as the asymptotic equipartition theorem that jfizzix mentions is section II.G of http://ocw.mit.edu/courses/physics/...-fall-2013/lecture-notes/MIT8_333F13_Lec6.pdf

voila · Mar 26, 2016

Thank you for your responses, jfizzix & atyy. The Chapter you showed me, atyy, helped a bit more (no pun intended) understanding how Shannon information and entropy works. However, my questions remain:
1) On the first example from my first post, why do we have to choose that description for the horses so the result matches Shannon entropy, instead of choosing a shorter one?
2) On the second example, how could you describe the outcome with less than a bit?

wle · Mar 26, 2016

voila said:

"Suppose we have a horse race with eight horses taking part. The probabilities of winning for the eight horses are (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64). The entropy of the horse race is (shows the calculation and equals) 2 bits. Suppose we wish to send a message indicating which horse won the race. One alternative is to send the index of the winning horse . This description requires 3 bits for any of the horses. But the probabilities are not uniform, so it makes sense to use shorter descriptions for the more probable horses, and viceversa, so that we achieve a lower average description length. For example, we could use:

0, 10, 110, 1110, 111100, 111101, 111110, 111111.

The average description length in this case is 2 bits."

Alright, in this case. But why does it have to be that case? Why can't I use a 1 for the second horse, 01 for the third and so on, reducing the average description length?

The idea in Shannon coding theory is that you're encoding and sending a constant stream of messages drawn randomly in an i.i.d. fashion, so someone receiving your messages needs to be able to tell where one message ends and the next begins. Both the example encodings described here make this possible: in the first, the receiver will know that each encoded message is exactly three bits long. In the second, each encoded message is either 1) up to four bits long and ends in a zero or 2) exactly six bits long and the first four bits are all ones.

Appart from that, there's another example on the same book (Example 2.1.1) where we have a variable that takes the value 1 with probability p, and the value 0 with probability 1-p. If p is different to 1/2 (that is, if one outcome is more probable), the Shannon entropy is less than one bit. How can I describe it with less than one bit? It makes no sense. If it means "you could just not send a signal for the more probable outcome", wouldn't having a "not sending a signal" possibility effectively mean working with trits?
And once again, how dows Shannon entropy relate to all this??

Like I said, the idea of Shannon coding theory is that you want to communicate a constant stream of these variables drawn randomly. You can encode these more efficiently by waiting until you have multiple values that you want to send and encoding them together. For instance, if you wait until you have 1000 bits to send instead of just one, then there are 2¹⁰⁰⁰ possible sequences of 1000 bits but the vast majority of the time you will get sequences with close to 1000p ones, and you can prioritise giving these the shorter encodings.

In general, when the Shannon or a related entropy is proved to mean something, the proof usually only applies exactly in the asymptotic limit of infinitely many repetitions and assuming the messages are drawn randomly in an i.i.d. fashion. If you literally only have one bit you want to send then this is very far from the type of situation that Shannon coding theory applies to.

voila · Mar 26, 2016

I see. I hadn't taken into account that messages should be sent altogether and made distinguishable for the receiver. Thank you very much!

Fooality · Apr 8, 2016

This is one of my favorite things, It really changed my worldview when I got it. My advice for intuitively grasping it is to forget about the information stream it decribes to start, and just say the the information needed to express something is the log2 of the reciprocal of its probability. So suppose we have numbers 0-255, making up a message, each occurring with equal probability. Then the information each needs to be expressed is log2(256/1) = 8 bits = 1 byte. As you know, a byte can encode numbers 0-255, so its simple. But suppose only numbers 0-16 occur with equal probabilty, the rest now have zero probability. Now we don't need a byte to encode each number now, we need log2(16/1) = 4bits, we can fit 2 such numbers to the byte, so 100 of these numbers takes half the information of 100 of the previous.

Now, imagine a message where the numbers 0-15 occur very regularly, but the numbers 16-255 occur incredibly rarely. We could use a byte for each, but we can use way less information on average if we make a 4bit symbol for the numbers 01-15, along with a 4 bit escape symbol that signifies 8 bit mode for the next number, in the rare event that it is greater than 15. This encoding will use just over 1/2 the information of the full 8 bits on average, if the numbers greater than 15 numbers are truly rare.

Once you understand that, go back and look at Shannon Entropy ( remember -log(p) = log(1/p) : i use latter Shannon former). When it hits you, you see that probabilty and information are inseparable concepts. I personally find it really beautiful.

Another thing to remember is that Shannon entropy can only provide an upper bound for the information in a message, the actual information is the Kolmogorov complexity which is not computable.

voila · Apr 9, 2016

Fooality, when you put that example where the numbers 0-15, which allows us to use the first 4 bits of each byte to identify them and the next 4 bits indicating whether the next number is in 8 bit mode, why would we need all the extra 4 bits, and not just one?

Maybe you were just using that as an example and we actually could use a single bit.

Fooality · Apr 9, 2016

voila said:

Fooality, when you put that example where the numbers 0-15, which allows us to use the first 4 bits of each byte to identify them and the next 4 bits indicating whether the next number is in 8 bit mode, why would we need all the extra 4 bits, and not just one?

Maybe you were just using that as an example and we actually could use a s

voila said:

Fooality, when you put that example where the numbers 0-15, which allows us to use the first 4 bits of each byte to identify them and the next 4 bits indicating whether the next number is in 8 bit mode, why would we need all the extra 4 bits, and not just one?

Maybe you were just using that as an example and we actually could use a single bit.

That was kind of a quick and dirty example, i think i meant to say 0-14 would have 4 bits, so for example the string 00000110111111111111 breaks down into 0000, 0110, 1111, 11111111 or 0, 6, (15= escape to 8 bit mode) 255. But there's a million ways to do it, the key idea is using more info for rare data, and the idea of the relation between low probability and high information.

Svein · Apr 9, 2016

For a more complex example, take a look at Huffman encoding (https://en.wikipedia.org/wiki/Huffman_coding).

Help understanding Shannon entropy

1. What is Shannon entropy?

2. How is Shannon entropy calculated?

3. What does a higher Shannon entropy value indicate?

4. How is Shannon entropy used in information theory?

5. Can Shannon entropy be applied to non-technical fields?

Similar threads

Hot Threads

Recent Insights