The Entropy of Information Source

Click For Summary
The discussion centers on Shannon's Theorem 4 regarding the entropy of an information source, specifically how to interpret n(q) and log n(q) in the context of sequence probabilities. n(q) represents the number of the most probable sequences needed to accumulate a total probability q, and log n(q) indicates the bits required to specify these sequences. As the length of sequences (N) approaches infinity, the ratio of log n(q) to N converges to the entropy (H) of the information source. The theorem implies that the maximum bits needed to represent a sequence is NH, reinforcing that H is less than or equal to the logarithm of the total number of sequences. This highlights the significance of focusing on the most probable sequences while acknowledging that less probable sequences are effectively neglected in this context.
iVenky
Messages
212
Reaction score
12
In the paper "A Mathematical theory of Communication" by Shannon I couldn't understand a theorem (Theorem 4) under the topic "The Entropy of Information Source"-
Consider again the sequences of length N and let them be arranged in order of decreasing probability. We define n(q) to be the number we must take from this set starting with the most probable one in order to accumulate a total probability q for those taken

Theorem 4:
<br /> \lim_{N \to +\infty} \frac{log (n(q))}{N}= H<br />

when q does not equal 0 or 1.
We may interpret log n(q) as the number of bits required to specify the sequence when we consider only the most probable sequences with a total probability q.

Here 'H' is the entropy.


Can you explain me how we get this theorem?
What is this n(q)? I couldn't understand the sentence that I have underlined. How is log n(q) number of bits required to specify the sequence?

Here's his paper-

http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

It's under the topic "The Entropy of an Information Source" in Page 13 and 14. He has written the proof for that in the appendix but I couldn't understand that.


thanks a lot
 
Last edited by a moderator:
Engineering news on Phys.org
I was hoping someone who has actually gone through the derivation would jump in here, but I guess not...so I'll give it a little whack... this theorem is a step in getting to the (more understandable?) Shannon Entropy equation.

Lets say you decide to count all the four-letter words found on the internet, so:
N = 4.

For each word you count the number of times it is found and then calculate the probability of its being found as:
p = number-found / total-words
where p will be between 0 and 1. So if you found 10 words and "word" occurred 5 times its p = 0.5

Then you sort that list by probability such that the most frequently found word is first and the least is last. This gives you the list specified in the first sentence.

Since it is a gigantic list with a lot of very improbable entries you are probably -- no pun intended -- just interested in the most frequent ones, so you might decide to eliminate all but the top 80%, i.e., your q = 0.80. You go down the list collecting entries and summing the associated probabilities until you get a total of 0.80. The number of entries you collect is then n(q). Notice that this doesn't mean n*q, it might better have been notated as: nq

Then...you take the log2 of your nq number. That it's the log to the base-2 may not have been obvious, you can use any base, but base-2 is what Shannon uses all along and it's the only one that gives you "bits".

For instance, if your nq == 32, meaning that you selected the 32 most popular words from your giant list, then log2(nq) = 5 bits. This means that you can represent each of those 32 words as a unique 5 bit number, and, if you ignore all the other words in the big list, every sentence made of 4-letter words can be encoded as a sequence of 5-bit numbers. This is a danged good compression ratio when compared to the number of bits one would need to send the raw text as alphabetic characters.

In a more(or less?) intuitive mode, think of all the individual values between 0 and 255. There's 256 of em. And they can each be represented by an 8-bit number. Magically enough log2(256) == 8 !

Now, getting H by normalizing by N as N->infinity is a little more mysterious than my poor brain can handle right now. Maybe someone else can explain that part...
 
What I know is this- Shannon proved that
The maximum number of bits required to represent a sequence of length 'N' is NH bits (where H is the entropy). This in turn means that maximum number of bits required to represent each letter in the word is NH/N= H bits and note that

H ≤ log (total number of entries )

Ideally speaking as you said the number of bits required to represent a letter should have been log(total number of letters) but according to Shannon's theorem it is H only and H will always be less than or equal to log (total number of letters).
This is what he seems to have proved in the above expression that I had asked in the question. But there he has taken only n(q) sequences and he has left out those least probable ones.
Does that you mean have completely neglected about those least probable sequences? I mean you should represent them using bits too right? What would you do for those?

Thanks a lot for your help
 
I totally figured that out!

Actually he had proved in his paper that (I can understand this derivation. It's in his paper)

<br /> <br /> \lim_{N \to \infty }\frac{log(p^{-1})}{N} = H<br /> <br />

where p is the total probability of the sequence of length N. It looks like as N-> infinity, all have the same probability 'p'. This can be easily understood if you see that derivation.

Now if you define n(q) as the number of sequences till the summation of those probabilities become equal to 'q' then we have

n(q) * p= q (we have assumed that as N-> infinity all have the same probability 'p'. If you see that derivation I had told in his paper you will get this)

We know that the upper bound of q is 1

so clearly

n(q)*p=1

which means

n(q)=1/p =p-1

When you substitute this in the above expression you will get the result !

What I understand from this is that when you have an ergodic process (with a Markov chain kind of thing) then as N-> infinity some sequences won't occur at all so that you need not worry about them. This also means that the number of bits required to represent them is not log (total sample space) but rather just NH.

Thanks a lot for your help.
 
for your questions. I will do my best to explain the theorem and its implications for the entropy of an information source.

First, let's define some terms that are used in this theorem. An information source is a system that produces a sequence of symbols or messages. In Shannon's paper, he uses the term "sequence" to refer to a series of symbols, which could be letters, numbers, or any other type of symbol. The length of a sequence is denoted by N.

The probability of a sequence refers to the likelihood of that sequence occurring. In this context, it is the probability of a specific sequence being produced by the information source. Shannon assumes that all sequences of length N have equal probability, which means that each sequence has a probability of 1/2^N.

Now, let's look at the definition of n(q) in the theorem. This refers to the number of sequences that we need to take from the set of all possible sequences of length N, starting with the most probable one, in order to accumulate a total probability of q. In other words, n(q) represents the number of sequences that are needed to make up a proportion of q of the total probability.

For example, if we have a set of 8 sequences of length 3, with each sequence having a probability of 1/8, then n(1/2) would be 4 because we need to take the first 4 most probable sequences (1/8, 1/8, 1/8, 1/8) to accumulate a total probability of 1/2.

Now, let's look at the theorem itself. It states that as N approaches infinity, the ratio of log n(q) to N approaches the entropy of the information source (H). In other words, as the length of the sequences gets longer and longer, the number of bits required to specify the sequence (log n(q)) becomes closer and closer to the entropy of the information source.

This makes sense intuitively if we think about the definition of entropy. Entropy is a measure of the uncertainty or randomness of a system. In this context, it represents the average number of bits needed to specify a sequence from the information source. As the length of the sequences gets longer, the number of possible sequences also increases, resulting in a higher entropy.

To understand the proof of this theorem, it might be helpful to look at the appendix of Shannon's paper where he provides
 
I am trying to understand how transferring electric from the powerplant to my house is more effective using high voltage. The suggested explanation that the current is equal to the power supply divided by the voltage, and hence higher voltage leads to lower current and as a result to a lower power loss on the conductives is very confusing me. I know that the current is determined by the voltage and the resistance, and not by a power capability - which defines a limit to the allowable...

Similar threads

  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 13 ·
Replies
13
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 5 ·
Replies
5
Views
502
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
1K
Replies
5
Views
2K