In the paper "A Mathematical theory of Communication" by Shannon I couldn't understand a theorem (Theorem 4) under the topic "The Entropy of Information Source"-

Here 'H' is the entropy.

Can you explain me how we get this theorem?
What is this n(q)? I couldn't understand the sentence that I have underlined. How is log n(q) number of bits required to specify the sequence?

It's under the topic "The Entropy of an Information Source" in Page 13 and 14. He has written the proof for that in the appendix but I couldn't understand that.

I was hoping someone who has actually gone through the derivation would jump in here, but I guess not...so I'll give it a little whack... this theorem is a step in getting to the (more understandable?) Shannon Entropy equation.

Lets say you decide to count all the four-letter words found on the internet, so:
N = 4.

For each word you count the number of times it is found and then calculate the probability of its being found as:
p = number-found / total-words
where p will be between 0 and 1. So if you found 10 words and "word" occurred 5 times its p = 0.5

Then you sort that list by probability such that the most frequently found word is first and the least is last. This gives you the list specified in the first sentence.

Since it is a gigantic list with a lot of very improbable entries you are probably -- no pun intended -- just interested in the most frequent ones, so you might decide to eliminate all but the top 80%, i.e., your q = 0.80. You go down the list collecting entries and summing the associated probabilities until you get a total of 0.80. The number of entries you collect is then n(q). Notice that this doesn't mean n*q, it might better have been notated as: n_{q}

Then...you take the log_{2} of your n_{q} number. That it's the log to the base-2 may not have been obvious, you can use any base, but base-2 is what Shannon uses all along and it's the only one that gives you "bits".

For instance, if your n_{q} == 32, meaning that you selected the 32 most popular words from your giant list, then log_{2}(n_{q}) = 5 bits. This means that you can represent each of those 32 words as a unique 5 bit number, and, if you ignore all the other words in the big list, every sentence made of 4-letter words can be encoded as a sequence of 5-bit numbers. This is a danged good compression ratio when compared to the number of bits one would need to send the raw text as alphabetic characters.

In a more(or less?) intuitive mode, think of all the individual values between 0 and 255. There's 256 of em. And they can each be represented by an 8-bit number. Magically enough log_{2}(256) == 8 !!!

Now, getting H by normalizing by N as N->infinity is a little more mysterious than my poor brain can handle right now. Maybe someone else can explain that part...

What I know is this- Shannon proved that
The maximum number of bits required to represent a sequence of length 'N' is NH bits (where H is the entropy). This in turn means that maximum number of bits required to represent each letter in the word is NH/N= H bits and note that

H ≤ log (total number of entries )

Ideally speaking as you said the number of bits required to represent a letter should have been log(total number of letters) but according to Shannon's theorem it is H only and H will always be less than or equal to log (total number of letters).
This is what he seems to have proved in the above expression that I had asked in the question. But there he has taken only n(q) sequences and he has left out those least probable ones.
Does that you mean have completely neglected about those least probable sequences? I mean you should represent them using bits too right? What would you do for those?

Actually he had proved in his paper that (I can understand this derivation. It's in his paper)

[tex]

\lim_{N \to \infty }\frac{log(p^{-1})}{N} = H

[/tex]

where p is the total probability of the sequence of length N. It looks like as N-> infinity, all have the same probability 'p'. This can be easily understood if you see that derivation.

Now if you define n(q) as the number of sequences till the summation of those probabilities become equal to 'q' then we have

n(q) * p= q (we have assumed that as N-> infinity all have the same probability 'p'. If you see that derivation I had told in his paper you will get this)

We know that the upper bound of q is 1

so clearly

n(q)*p=1

which means

n(q)=1/p =p^{-1}

When you substitute this in the above expression you will get the result !

What I understand from this is that when you have an ergodic process (with a Markov chain kind of thing) then as N-> infinity some sequences won't occur at all so that you need not worry about them. This also means that the number of bits required to represent them is not log (total sample space) but rather just NH.