The Entropy of Information Source

iVenky · Mar 31, 2013

In the paper "A Mathematical theory of Communication" by Shannon I couldn't understand a theorem (Theorem 4) under the topic "The Entropy of Information Source"-

Consider again the sequences of length N and let them be arranged in order of decreasing probability. We define n(q) to be the number we must take from this set starting with the most probable one in order to accumulate a total probability q for those taken

Theorem 4:
[tex] \lim_{N \to +\infty} \frac{log (n(q))}{N}= H[/tex]

when q does not equal 0 or 1.
We may interpret log n(q) as the number of bits required to specify the sequence when we consider only the most probable sequences with a total probability q.

Here 'H' is the entropy.

Can you explain me how we get this theorem?
What is this n(q)? I couldn't understand the sentence that I have underlined. How is log n(q) number of bits required to specify the sequence?

Here's his paper-

http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

It's under the topic "The Entropy of an Information Source" in Page 13 and 14. He has written the proof for that in the appendix but I couldn't understand that.

thanks a lot

schip666! · Apr 1, 2013

I was hoping someone who has actually gone through the derivation would jump in here, but I guess not...so I'll give it a little whack... this theorem is a step in getting to the (more understandable?) Shannon Entropy equation.

Lets say you decide to count all the four-letter words found on the internet, so:
N = 4.

For each word you count the number of times it is found and then calculate the probability of its being found as:
p = number-found / total-words
where p will be between 0 and 1. So if you found 10 words and "word" occurred 5 times its p = 0.5

Then you sort that list by probability such that the most frequently found word is first and the least is last. This gives you the list specified in the first sentence.

Since it is a gigantic list with a lot of very improbable entries you are probably -- no pun intended -- just interested in the most frequent ones, so you might decide to eliminate all but the top 80%, i.e., your q = 0.80. You go down the list collecting entries and summing the associated probabilities until you get a total of 0.80. The number of entries you collect is then n(q). Notice that this doesn't mean n*q, it might better have been notated as: n_q

Then...you take the log₂ of your n_q number. That it's the log to the base-2 may not have been obvious, you can use any base, but base-2 is what Shannon uses all along and it's the only one that gives you "bits".

For instance, if your n_q == 32, meaning that you selected the 32 most popular words from your giant list, then log₂(n_q) = 5 bits. This means that you can represent each of those 32 words as a unique 5 bit number, and, if you ignore all the other words in the big list, every sentence made of 4-letter words can be encoded as a sequence of 5-bit numbers. This is a danged good compression ratio when compared to the number of bits one would need to send the raw text as alphabetic characters.

In a more(or less?) intuitive mode, think of all the individual values between 0 and 255. There's 256 of em. And they can each be represented by an 8-bit number. Magically enough log₂(256) == 8 !

Now, getting H by normalizing by N as N->infinity is a little more mysterious than my poor brain can handle right now. Maybe someone else can explain that part...

iVenky · Apr 1, 2013

What I know is this- Shannon proved that
The maximum number of bits required to represent a sequence of length 'N' is NH bits (where H is the entropy). This in turn means that maximum number of bits required to represent each letter in the word is NH/N= H bits and note that

H ≤ log (total number of entries )

Ideally speaking as you said the number of bits required to represent a letter should have been log(total number of letters) but according to Shannon's theorem it is H only and H will always be less than or equal to log (total number of letters).
This is what he seems to have proved in the above expression that I had asked in the question. But there he has taken only n(q) sequences and he has left out those least probable ones.
Does that you mean have completely neglected about those least probable sequences? I mean you should represent them using bits too right? What would you do for those?

Thanks a lot for your help

iVenky · Apr 2, 2013

I totally figured that out!

Actually he had proved in his paper that (I can understand this derivation. It's in his paper)

[tex] <br /> \lim_{N \to \infty }\frac{log(p^{-1})}{N} = H<br /> [/tex]

where p is the total probability of the sequence of length N. It looks like as N-> infinity, all have the same probability 'p'. This can be easily understood if you see that derivation.

Now if you define n(q) as the number of sequences till the summation of those probabilities become equal to 'q' then we have

n(q) * p= q (we have assumed that as N-> infinity all have the same probability 'p'. If you see that derivation I had told in his paper you will get this)

We know that the upper bound of q is 1

so clearly

n(q)*p=1

which means

n(q)=1/p =p^-1

When you substitute this in the above expression you will get the result !

What I understand from this is that when you have an ergodic process (with a Markov chain kind of thing) then as N-> infinity some sequences won't occur at all so that you need not worry about them. This also means that the number of bits required to represent them is not log (total sample space) but rather just NH.

Thanks a lot for your help.

alphysicist · Apr 8, 2013

for your questions. I will do my best to explain the theorem and its implications for the entropy of an information source.

First, let's define some terms that are used in this theorem. An information source is a system that produces a sequence of symbols or messages. In Shannon's paper, he uses the term "sequence" to refer to a series of symbols, which could be letters, numbers, or any other type of symbol. The length of a sequence is denoted by N.

The probability of a sequence refers to the likelihood of that sequence occurring. In this context, it is the probability of a specific sequence being produced by the information source. Shannon assumes that all sequences of length N have equal probability, which means that each sequence has a probability of 1/2^N.

Now, let's look at the definition of n(q) in the theorem. This refers to the number of sequences that we need to take from the set of all possible sequences of length N, starting with the most probable one, in order to accumulate a total probability of q. In other words, n(q) represents the number of sequences that are needed to make up a proportion of q of the total probability.

For example, if we have a set of 8 sequences of length 3, with each sequence having a probability of 1/8, then n(1/2) would be 4 because we need to take the first 4 most probable sequences (1/8, 1/8, 1/8, 1/8) to accumulate a total probability of 1/2.

Now, let's look at the theorem itself. It states that as N approaches infinity, the ratio of log n(q) to N approaches the entropy of the information source (H). In other words, as the length of the sequences gets longer and longer, the number of bits required to specify the sequence (log n(q)) becomes closer and closer to the entropy of the information source.

This makes sense intuitively if we think about the definition of entropy. Entropy is a measure of the uncertainty or randomness of a system. In this context, it represents the average number of bits needed to specify a sequence from the information source. As the length of the sequences gets longer, the number of possible sequences also increases, resulting in a higher entropy.

To understand the proof of this theorem, it might be helpful to look at the appendix of Shannon's paper where he provides

The Entropy of Information Source

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Why must residential electrical systems be connected to Earth (soil)?

VFD for powering a car lift

Series motors, switched to parallel

Wireless Charging

Interpreting VIA PCB characteristics through TDR and TDT

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect