Help understanding Shannon entropy

In summary, Shannon entropy is a measure of the average number of bits needed to describe a random variable. It is related to the probability distribution of the variable and can be used to find the most efficient way to encode and transmit information about the variable. The encoding method used in Shannon coding theory ensures that the receiver can accurately decode the transmitted information.
  • #1
voila
59
6
Hi
I'm having some trouble understanding Shannon entropy and its relation to "computer" bits (zeros and ones). Shannon entropy of a random variable is (assume b=2 so that we work in bits)
fd2e77d6cb3759d143138006dc1f0c0a.png

and everywhere I've read says it is "the number of bits on the average required to describe the random variable". Yet I can't seem to find a proper showing of the connection between the sentence and the equation.

I've also found an example (Example 1.1.2 on Elements of Information Theory, Cover & Thomas) which reads:

"Suppose we have a horse race with eight horses taking part. The probabilities of winning for the eight horses are (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64). The entropy of the horse race is (shows the calculation and equals) 2 bits. Suppose we wish to send a message indicating which horse won the race. One alternative is to send the index of the winning horse . This description requires 3 bits for any of the horses. But the probabilities are not uniform, so it makes sense to use shorter descriptions for the more probable horses, and viceversa, so that we achieve a lower average description length. For example, we could use:

0, 10, 110, 1110, 111100, 111101, 111110, 111111.

The average description length in this case is 2 bits."

Alright, in this case. But why does it have to be that case? Why can't I use a 1 for the second horse, 01 for the third and so on, reducing the average description length?

Appart from that, there's another example on the same book (Example 2.1.1) where we have a variable that takes the value 1 with probability p, and the value 0 with probability 1-p. If p is different to 1/2 (that is, if one outcome is more probable), the Shannon entropy is less than one bit. How can I describe it with less than one bit? It makes no sense. If it means "you could just not send a signal for the more probable outcome", wouldn't having a "not sending a signal" possibility effectively mean working with trits?
And once again, how dows Shannon entropy relate to all this??
 
  • Like
Likes Pepper Mint and atyy
Technology news on Phys.org
  • #2
Let's keep things general.

You have a random variable [itex]X[/itex] with [itex]N[/itex] possible outcomes.
The probability of each outcome [itex]X_{i}[/itex] is given by the probability [itex]P(X_{i})[/itex] so that the total over all outcomes is 1.

Now, let's say you want to communicate the outcome of this random variable to someone else.
The simplest thing you could do is number the outcomes from 1 to N and express them in binary, and send the binary number related to the outcome that happened.
The number of binary digits, or "bits" you need to express each of these numbers is [itex]\log_{2}(N)[/itex], rounded up to the nearest whole number.

If all outcomes were equally likely, this is the best you could do too, but there are more efficient ways, if you know what [itex]P(X)[/itex] is (and it is different than completely uniform).

It has to do with what's called the law of large numbers (or in that Cover&Thomas book, the Asymptotic Equipartition Theorem).
Over many trials of outcomes of [itex]X[/itex], you will find that the relative frequencies of each outcome grow to closely resemble the probability of that outcome. These sequences of outcomes that resemble the probability distributions they came from are called "typical".
In the limit of a large number of trials, the probability that your sequence is "typical" becomes overwhelmingly close to 1.

So, with a vanishingly small probability of error, you don't have to have unique numbers for every string of outcomes. You just have to have a unique number for every "typical" string of outcomes.

And, the number of bits you need per outcome in a typical string is equal to the Shannon entropy of [itex]X[/itex].

[itex]H(X)=-\sum_{i} P(X_{i})\log(P(X_{i}))[/itex]
 
Last edited:
  • Like
Likes QuantumQuest
  • #4
Thank you for your responses, jfizzix & atyy. The Chapter you showed me, atyy, helped a bit more (no pun intended) understanding how Shannon information and entropy works. However, my questions remain:
1) On the first example from my first post, why do we have to choose that description for the horses so the result matches Shannon entropy, instead of choosing a shorter one?
2) On the second example, how could you describe the outcome with less than a bit?
 
  • Like
Likes atyy
  • #5
voila said:
"Suppose we have a horse race with eight horses taking part. The probabilities of winning for the eight horses are (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64). The entropy of the horse race is (shows the calculation and equals) 2 bits. Suppose we wish to send a message indicating which horse won the race. One alternative is to send the index of the winning horse . This description requires 3 bits for any of the horses. But the probabilities are not uniform, so it makes sense to use shorter descriptions for the more probable horses, and viceversa, so that we achieve a lower average description length. For example, we could use:

0, 10, 110, 1110, 111100, 111101, 111110, 111111.

The average description length in this case is 2 bits."

Alright, in this case. But why does it have to be that case? Why can't I use a 1 for the second horse, 01 for the third and so on, reducing the average description length?

The idea in Shannon coding theory is that you're encoding and sending a constant stream of messages drawn randomly in an i.i.d. fashion, so someone receiving your messages needs to be able to tell where one message ends and the next begins. Both the example encodings described here make this possible: in the first, the receiver will know that each encoded message is exactly three bits long. In the second, each encoded message is either 1) up to four bits long and ends in a zero or 2) exactly six bits long and the first four bits are all ones.

Appart from that, there's another example on the same book (Example 2.1.1) where we have a variable that takes the value 1 with probability p, and the value 0 with probability 1-p. If p is different to 1/2 (that is, if one outcome is more probable), the Shannon entropy is less than one bit. How can I describe it with less than one bit? It makes no sense. If it means "you could just not send a signal for the more probable outcome", wouldn't having a "not sending a signal" possibility effectively mean working with trits?
And once again, how dows Shannon entropy relate to all this??

Like I said, the idea of Shannon coding theory is that you want to communicate a constant stream of these variables drawn randomly. You can encode these more efficiently by waiting until you have multiple values that you want to send and encoding them together. For instance, if you wait until you have 1000 bits to send instead of just one, then there are 21000 possible sequences of 1000 bits but the vast majority of the time you will get sequences with close to 1000p ones, and you can prioritise giving these the shorter encodings.

In general, when the Shannon or a related entropy is proved to mean something, the proof usually only applies exactly in the asymptotic limit of infinitely many repetitions and assuming the messages are drawn randomly in an i.i.d. fashion. If you literally only have one bit you want to send then this is very far from the type of situation that Shannon coding theory applies to.
 
  • Like
Likes Pepper Mint and atyy
  • #6
I see. I hadn't taken into account that messages should be sent altogether and made distinguishable for the receiver. Thank you very much!
 
  • #7
This is one of my favorite things, It really changed my worldview when I got it. My advice for intuitively grasping it is to forget about the information stream it decribes to start, and just say the the information needed to express something is the log2 of the reciprocal of its probability. So suppose we have numbers 0-255, making up a message, each occurring with equal probability. Then the information each needs to be expressed is log2(256/1) = 8 bits = 1 byte. As you know, a byte can encode numbers 0-255, so its simple. But suppose only numbers 0-16 occur with equal probabilty, the rest now have zero probability. Now we don't need a byte to encode each number now, we need log2(16/1) = 4bits, we can fit 2 such numbers to the byte, so 100 of these numbers takes half the information of 100 of the previous.

Now, imagine a message where the numbers 0-15 occur very regularly, but the numbers 16-255 occur incredibly rarely. We could use a byte for each, but we can use way less information on average if we make a 4bit symbol for the numbers 01-15, along with a 4 bit escape symbol that signifies 8 bit mode for the next number, in the rare event that it is greater than 15. This encoding will use just over 1/2 the information of the full 8 bits on average, if the numbers greater than 15 numbers are truly rare.

Once you understand that, go back and look at Shannon Entropy ( remember -log(p) = log(1/p) : i use latter Shannon former). When it hits you, you see that probabilty and information are inseparable concepts. I personally find it really beautiful.

Another thing to remember is that Shannon entropy can only provide an upper bound for the information in a message, the actual information is the Kolmogorov complexity which is not computable.
 
  • Like
Likes voila
  • #8
Fooality, when you put that example where the numbers 0-15, which allows us to use the first 4 bits of each byte to identify them and the next 4 bits indicating whether the next number is in 8 bit mode, why would we need all the extra 4 bits, and not just one?

Maybe you were just using that as an example and we actually could use a single bit.
 
  • #9
voila said:
Fooality, when you put that example where the numbers 0-15, which allows us to use the first 4 bits of each byte to identify them and the next 4 bits indicating whether the next number is in 8 bit mode, why would we need all the extra 4 bits, and not just one?

Maybe you were just using that as an example and we actually could use a s
voila said:
Fooality, when you put that example where the numbers 0-15, which allows us to use the first 4 bits of each byte to identify them and the next 4 bits indicating whether the next number is in 8 bit mode, why would we need all the extra 4 bits, and not just one?

Maybe you were just using that as an example and we actually could use a single bit.

That was kind of a quick and dirty example, i think i meant to say 0-14 would have 4 bits, so for example the string 00000110111111111111 breaks down into 0000, 0110, 1111, 11111111 or 0, 6, (15= escape to 8 bit mode) 255. But there's a million ways to do it, the key idea is using more info for rare data, and the idea of the relation between low probability and high information.
 

1. What is Shannon entropy?

Shannon entropy is a measure of the uncertainty or randomness in a system. It was developed by Claude Shannon in the 1940s as a way to quantify the amount of information in a message.

2. How is Shannon entropy calculated?

Shannon entropy is calculated using the formula H = -Σ(Pi * log2(Pi)), where Pi is the probability of each possible outcome in a system. It can be applied to a variety of systems, such as language, genetics, and computer data.

3. What does a higher Shannon entropy value indicate?

A higher Shannon entropy value indicates a system with more uncertainty or randomness. This means that there are more possible outcomes or variations in the system, and it is more difficult to predict or compress the information.

4. How is Shannon entropy used in information theory?

Shannon entropy is a fundamental concept in information theory, which is the study of how information is processed, transmitted, and stored. It is used to measure the amount of information in a message or data set, and to analyze the efficiency and reliability of communication systems.

5. Can Shannon entropy be applied to non-technical fields?

Yes, Shannon entropy can be applied to a wide range of fields, including linguistics, biology, and social sciences. It can be used to analyze the complexity and diversity in language, the genetic diversity in a population, and the diversity of opinions in a social network.

Similar threads

  • Introductory Physics Homework Help
Replies
5
Views
1K
  • Programming and Computer Science
Replies
1
Views
1K
  • Quantum Physics
Replies
3
Views
1K
  • Introductory Physics Homework Help
Replies
4
Views
2K
Replies
2
Views
844
Replies
1
Views
1K
  • Advanced Physics Homework Help
Replies
5
Views
6K
Back
Top