The Entropy of Information Source

In summary: The maximum number of bits required to represent a sequence of length 'N' is NH bits (where H is the entropy). This in turn means that maximum number of bits required to represent each letter in the word is NH/N= H bits and note that H ≤ log (total number of entries )
  • #1
iVenky
212
12
In the paper "A Mathematical theory of Communication" by Shannon I couldn't understand a theorem (Theorem 4) under the topic "The Entropy of Information Source"-
Consider again the sequences of length N and let them be arranged in order of decreasing probability. We define n(q) to be the number we must take from this set starting with the most probable one in order to accumulate a total probability q for those taken

Theorem 4:
[tex]
\lim_{N \to +\infty} \frac{log (n(q))}{N}= H
[/tex]

when q does not equal 0 or 1.
We may interpret log n(q) as the number of bits required to specify the sequence when we consider only the most probable sequences with a total probability q.

Here 'H' is the entropy.


Can you explain me how we get this theorem?
What is this n(q)? I couldn't understand the sentence that I have underlined. How is log n(q) number of bits required to specify the sequence?

Here's his paper-

http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf [Broken]

It's under the topic "The Entropy of an Information Source" in Page 13 and 14. He has written the proof for that in the appendix but I couldn't understand that.


thanks a lot
 
Last edited by a moderator:
Engineering news on Phys.org
  • #2
I was hoping someone who has actually gone through the derivation would jump in here, but I guess not...so I'll give it a little whack... this theorem is a step in getting to the (more understandable?) Shannon Entropy equation.

Lets say you decide to count all the four-letter words found on the internet, so:
N = 4.

For each word you count the number of times it is found and then calculate the probability of its being found as:
p = number-found / total-words
where p will be between 0 and 1. So if you found 10 words and "word" occurred 5 times its p = 0.5

Then you sort that list by probability such that the most frequently found word is first and the least is last. This gives you the list specified in the first sentence.

Since it is a gigantic list with a lot of very improbable entries you are probably -- no pun intended -- just interested in the most frequent ones, so you might decide to eliminate all but the top 80%, i.e., your q = 0.80. You go down the list collecting entries and summing the associated probabilities until you get a total of 0.80. The number of entries you collect is then n(q). Notice that this doesn't mean n*q, it might better have been notated as: nq

Then...you take the log2 of your nq number. That it's the log to the base-2 may not have been obvious, you can use any base, but base-2 is what Shannon uses all along and it's the only one that gives you "bits".

For instance, if your nq == 32, meaning that you selected the 32 most popular words from your giant list, then log2(nq) = 5 bits. This means that you can represent each of those 32 words as a unique 5 bit number, and, if you ignore all the other words in the big list, every sentence made of 4-letter words can be encoded as a sequence of 5-bit numbers. This is a danged good compression ratio when compared to the number of bits one would need to send the raw text as alphabetic characters.

In a more(or less?) intuitive mode, think of all the individual values between 0 and 255. There's 256 of em. And they can each be represented by an 8-bit number. Magically enough log2(256) == 8 !

Now, getting H by normalizing by N as N->infinity is a little more mysterious than my poor brain can handle right now. Maybe someone else can explain that part...
 
  • #3
What I know is this- Shannon proved that
The maximum number of bits required to represent a sequence of length 'N' is NH bits (where H is the entropy). This in turn means that maximum number of bits required to represent each letter in the word is NH/N= H bits and note that

H ≤ log (total number of entries )

Ideally speaking as you said the number of bits required to represent a letter should have been log(total number of letters) but according to Shannon's theorem it is H only and H will always be less than or equal to log (total number of letters).
This is what he seems to have proved in the above expression that I had asked in the question. But there he has taken only n(q) sequences and he has left out those least probable ones.
Does that you mean have completely neglected about those least probable sequences? I mean you should represent them using bits too right? What would you do for those?

Thanks a lot for your help
 
  • #4
I totally figured that out!

Actually he had proved in his paper that (I can understand this derivation. It's in his paper)

[tex]

\lim_{N \to \infty }\frac{log(p^{-1})}{N} = H

[/tex]

where p is the total probability of the sequence of length N. It looks like as N-> infinity, all have the same probability 'p'. This can be easily understood if you see that derivation.

Now if you define n(q) as the number of sequences till the summation of those probabilities become equal to 'q' then we have

n(q) * p= q (we have assumed that as N-> infinity all have the same probability 'p'. If you see that derivation I had told in his paper you will get this)

We know that the upper bound of q is 1

so clearly

n(q)*p=1

which means

n(q)=1/p =p-1

When you substitute this in the above expression you will get the result !

What I understand from this is that when you have an ergodic process (with a Markov chain kind of thing) then as N-> infinity some sequences won't occur at all so that you need not worry about them. This also means that the number of bits required to represent them is not log (total sample space) but rather just NH.

Thanks a lot for your help.
 
  • #5
for your questions. I will do my best to explain the theorem and its implications for the entropy of an information source.

First, let's define some terms that are used in this theorem. An information source is a system that produces a sequence of symbols or messages. In Shannon's paper, he uses the term "sequence" to refer to a series of symbols, which could be letters, numbers, or any other type of symbol. The length of a sequence is denoted by N.

The probability of a sequence refers to the likelihood of that sequence occurring. In this context, it is the probability of a specific sequence being produced by the information source. Shannon assumes that all sequences of length N have equal probability, which means that each sequence has a probability of 1/2^N.

Now, let's look at the definition of n(q) in the theorem. This refers to the number of sequences that we need to take from the set of all possible sequences of length N, starting with the most probable one, in order to accumulate a total probability of q. In other words, n(q) represents the number of sequences that are needed to make up a proportion of q of the total probability.

For example, if we have a set of 8 sequences of length 3, with each sequence having a probability of 1/8, then n(1/2) would be 4 because we need to take the first 4 most probable sequences (1/8, 1/8, 1/8, 1/8) to accumulate a total probability of 1/2.

Now, let's look at the theorem itself. It states that as N approaches infinity, the ratio of log n(q) to N approaches the entropy of the information source (H). In other words, as the length of the sequences gets longer and longer, the number of bits required to specify the sequence (log n(q)) becomes closer and closer to the entropy of the information source.

This makes sense intuitively if we think about the definition of entropy. Entropy is a measure of the uncertainty or randomness of a system. In this context, it represents the average number of bits needed to specify a sequence from the information source. As the length of the sequences gets longer, the number of possible sequences also increases, resulting in a higher entropy.

To understand the proof of this theorem, it might be helpful to look at the appendix of Shannon's paper where he provides
 

What is the Entropy of Information Source?

The Entropy of Information Source refers to the measure of uncertainty or randomness in a given source of information. It is a measure of how much information is needed to describe or represent the source accurately.

How is Entropy of Information Source calculated?

The Entropy of Information Source is calculated using the Shannon Entropy formula. This formula takes into account the probability of each possible outcome and weighs it against the amount of information needed to describe that outcome.

What is the significance of Entropy of Information Source?

The Entropy of Information Source is important in various fields such as information theory, statistics, and machine learning. It helps in understanding the amount of uncertainty or predictability in a given source of information, which can aid in data compression, information storage, and data analysis.

How does increasing or decreasing entropy affect the information source?

Increasing entropy means that there is more uncertainty or randomness in the information source, making it harder to predict or compress. On the other hand, decreasing entropy means that the information source becomes more predictable, making it easier to represent or compress.

Can entropy be negative?

No, entropy cannot be negative as it is a measure of randomness or uncertainty. It can only have non-negative values, with a value of 0 representing complete predictability and a higher value representing more unpredictability.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
Replies
1
Views
1K
  • Advanced Physics Homework Help
Replies
5
Views
956
Replies
2
Views
816
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Special and General Relativity
Replies
6
Views
917
  • Thermodynamics
Replies
1
Views
696
  • Calculus and Beyond Homework Help
Replies
1
Views
669
  • Beyond the Standard Models
Replies
1
Views
789
Back
Top