Help understanding Shannon entropy

Click For Summary

Discussion Overview

The discussion centers around understanding Shannon entropy, its mathematical formulation, and its implications for encoding information in bits. Participants explore examples from information theory, particularly focusing on how entropy relates to the average number of bits required to describe outcomes of random variables.

Discussion Character

  • Exploratory
  • Technical explanation
  • Conceptual clarification
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant expresses confusion about the connection between Shannon entropy and the average number of bits needed to describe a random variable, questioning why certain encoding choices are necessary.
  • Another participant explains that the number of bits needed to express outcomes is related to the logarithm of the number of outcomes, and that more efficient encoding can be achieved if the probabilities of outcomes are known.
  • There is a discussion about the law of large numbers and the concept of typical sequences, suggesting that in large samples, the Shannon entropy reflects the average bits needed per outcome.
  • Participants discuss specific examples from a textbook, questioning how to encode outcomes with less than one bit and the implications of doing so.
  • One participant clarifies that Shannon coding theory applies when sending a stream of messages, emphasizing the need for distinguishable messages and the efficiency gained by encoding multiple values together.
  • Another participant shares a personal insight about the intuitive understanding of Shannon entropy, suggesting a focus on the relationship between information and probability.

Areas of Agreement / Disagreement

Participants express differing views on the necessity of specific encoding methods and the interpretation of Shannon entropy in practical scenarios. The discussion remains unresolved regarding the implications of encoding with less than one bit and the conditions under which Shannon entropy applies.

Contextual Notes

Participants note that the proofs and applications of Shannon entropy typically assume a large number of trials and independent identically distributed (i.i.d.) messages, which may not apply in all situations discussed.

Who May Find This Useful

This discussion may be useful for those interested in information theory, coding theory, and the mathematical foundations of entropy, particularly in the context of communication and data encoding.

voila
Messages
58
Reaction score
6
Hi
I'm having some trouble understanding Shannon entropy and its relation to "computer" bits (zeros and ones). Shannon entropy of a random variable is (assume b=2 so that we work in bits)
fd2e77d6cb3759d143138006dc1f0c0a.png

and everywhere I've read says it is "the number of bits on the average required to describe the random variable". Yet I can't seem to find a proper showing of the connection between the sentence and the equation.

I've also found an example (Example 1.1.2 on Elements of Information Theory, Cover & Thomas) which reads:

"Suppose we have a horse race with eight horses taking part. The probabilities of winning for the eight horses are (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64). The entropy of the horse race is (shows the calculation and equals) 2 bits. Suppose we wish to send a message indicating which horse won the race. One alternative is to send the index of the winning horse . This description requires 3 bits for any of the horses. But the probabilities are not uniform, so it makes sense to use shorter descriptions for the more probable horses, and viceversa, so that we achieve a lower average description length. For example, we could use:

0, 10, 110, 1110, 111100, 111101, 111110, 111111.

The average description length in this case is 2 bits."

Alright, in this case. But why does it have to be that case? Why can't I use a 1 for the second horse, 01 for the third and so on, reducing the average description length?

Appart from that, there's another example on the same book (Example 2.1.1) where we have a variable that takes the value 1 with probability p, and the value 0 with probability 1-p. If p is different to 1/2 (that is, if one outcome is more probable), the Shannon entropy is less than one bit. How can I describe it with less than one bit? It makes no sense. If it means "you could just not send a signal for the more probable outcome", wouldn't having a "not sending a signal" possibility effectively mean working with trits?
And once again, how dows Shannon entropy relate to all this??
 
  • Like
Likes   Reactions: Pepper Mint and atyy
Technology news on Phys.org
Let's keep things general.

You have a random variable X with N possible outcomes.
The probability of each outcome X_{i} is given by the probability P(X_{i}) so that the total over all outcomes is 1.

Now, let's say you want to communicate the outcome of this random variable to someone else.
The simplest thing you could do is number the outcomes from 1 to N and express them in binary, and send the binary number related to the outcome that happened.
The number of binary digits, or "bits" you need to express each of these numbers is \log_{2}(N), rounded up to the nearest whole number.

If all outcomes were equally likely, this is the best you could do too, but there are more efficient ways, if you know what P(X) is (and it is different than completely uniform).

It has to do with what's called the law of large numbers (or in that Cover&Thomas book, the Asymptotic Equipartition Theorem).
Over many trials of outcomes of X, you will find that the relative frequencies of each outcome grow to closely resemble the probability of that outcome. These sequences of outcomes that resemble the probability distributions they came from are called "typical".
In the limit of a large number of trials, the probability that your sequence is "typical" becomes overwhelmingly close to 1.

So, with a vanishingly small probability of error, you don't have to have unique numbers for every string of outcomes. You just have to have a unique number for every "typical" string of outcomes.

And, the number of bits you need per outcome in a typical string is equal to the Shannon entropy of X.

H(X)=-\sum_{i} P(X_{i})\log(P(X_{i}))
 
Last edited:
  • Like
Likes   Reactions: QuantumQuest
Thank you for your responses, jfizzix & atyy. The Chapter you showed me, atyy, helped a bit more (no pun intended) understanding how Shannon information and entropy works. However, my questions remain:
1) On the first example from my first post, why do we have to choose that description for the horses so the result matches Shannon entropy, instead of choosing a shorter one?
2) On the second example, how could you describe the outcome with less than a bit?
 
  • Like
Likes   Reactions: atyy
voila said:
"Suppose we have a horse race with eight horses taking part. The probabilities of winning for the eight horses are (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64). The entropy of the horse race is (shows the calculation and equals) 2 bits. Suppose we wish to send a message indicating which horse won the race. One alternative is to send the index of the winning horse . This description requires 3 bits for any of the horses. But the probabilities are not uniform, so it makes sense to use shorter descriptions for the more probable horses, and viceversa, so that we achieve a lower average description length. For example, we could use:

0, 10, 110, 1110, 111100, 111101, 111110, 111111.

The average description length in this case is 2 bits."

Alright, in this case. But why does it have to be that case? Why can't I use a 1 for the second horse, 01 for the third and so on, reducing the average description length?

The idea in Shannon coding theory is that you're encoding and sending a constant stream of messages drawn randomly in an i.i.d. fashion, so someone receiving your messages needs to be able to tell where one message ends and the next begins. Both the example encodings described here make this possible: in the first, the receiver will know that each encoded message is exactly three bits long. In the second, each encoded message is either 1) up to four bits long and ends in a zero or 2) exactly six bits long and the first four bits are all ones.

Appart from that, there's another example on the same book (Example 2.1.1) where we have a variable that takes the value 1 with probability p, and the value 0 with probability 1-p. If p is different to 1/2 (that is, if one outcome is more probable), the Shannon entropy is less than one bit. How can I describe it with less than one bit? It makes no sense. If it means "you could just not send a signal for the more probable outcome", wouldn't having a "not sending a signal" possibility effectively mean working with trits?
And once again, how dows Shannon entropy relate to all this??

Like I said, the idea of Shannon coding theory is that you want to communicate a constant stream of these variables drawn randomly. You can encode these more efficiently by waiting until you have multiple values that you want to send and encoding them together. For instance, if you wait until you have 1000 bits to send instead of just one, then there are 21000 possible sequences of 1000 bits but the vast majority of the time you will get sequences with close to 1000p ones, and you can prioritise giving these the shorter encodings.

In general, when the Shannon or a related entropy is proved to mean something, the proof usually only applies exactly in the asymptotic limit of infinitely many repetitions and assuming the messages are drawn randomly in an i.i.d. fashion. If you literally only have one bit you want to send then this is very far from the type of situation that Shannon coding theory applies to.
 
  • Like
Likes   Reactions: Pepper Mint and atyy
I see. I hadn't taken into account that messages should be sent altogether and made distinguishable for the receiver. Thank you very much!
 
This is one of my favorite things, It really changed my worldview when I got it. My advice for intuitively grasping it is to forget about the information stream it decribes to start, and just say the the information needed to express something is the log2 of the reciprocal of its probability. So suppose we have numbers 0-255, making up a message, each occurring with equal probability. Then the information each needs to be expressed is log2(256/1) = 8 bits = 1 byte. As you know, a byte can encode numbers 0-255, so its simple. But suppose only numbers 0-16 occur with equal probability, the rest now have zero probability. Now we don't need a byte to encode each number now, we need log2(16/1) = 4bits, we can fit 2 such numbers to the byte, so 100 of these numbers takes half the information of 100 of the previous.

Now, imagine a message where the numbers 0-15 occur very regularly, but the numbers 16-255 occur incredibly rarely. We could use a byte for each, but we can use way less information on average if we make a 4bit symbol for the numbers 01-15, along with a 4 bit escape symbol that signifies 8 bit mode for the next number, in the rare event that it is greater than 15. This encoding will use just over 1/2 the information of the full 8 bits on average, if the numbers greater than 15 numbers are truly rare.

Once you understand that, go back and look at Shannon Entropy ( remember -log(p) = log(1/p) : i use latter Shannon former). When it hits you, you see that probability and information are inseparable concepts. I personally find it really beautiful.

Another thing to remember is that Shannon entropy can only provide an upper bound for the information in a message, the actual information is the Kolmogorov complexity which is not computable.
 
  • Like
Likes   Reactions: voila
Fooality, when you put that example where the numbers 0-15, which allows us to use the first 4 bits of each byte to identify them and the next 4 bits indicating whether the next number is in 8 bit mode, why would we need all the extra 4 bits, and not just one?

Maybe you were just using that as an example and we actually could use a single bit.
 
voila said:
Fooality, when you put that example where the numbers 0-15, which allows us to use the first 4 bits of each byte to identify them and the next 4 bits indicating whether the next number is in 8 bit mode, why would we need all the extra 4 bits, and not just one?

Maybe you were just using that as an example and we actually could use a s
voila said:
Fooality, when you put that example where the numbers 0-15, which allows us to use the first 4 bits of each byte to identify them and the next 4 bits indicating whether the next number is in 8 bit mode, why would we need all the extra 4 bits, and not just one?

Maybe you were just using that as an example and we actually could use a single bit.

That was kind of a quick and dirty example, i think i meant to say 0-14 would have 4 bits, so for example the string 00000110111111111111 breaks down into 0000, 0110, 1111, 11111111 or 0, 6, (15= escape to 8 bit mode) 255. But there's a million ways to do it, the key idea is using more info for rare data, and the idea of the relation between low probability and high information.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
5
Views
7K
Replies
1
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
3
Views
2K
  • · Replies 39 ·
2
Replies
39
Views
7K
Replies
17
Views
6K