# Huffman coding a Markov source

techmologist
Suppose you have a Markov source of symbols (or more generally, a stationary ergodic source). If you Huffman encode longer and longer extensions (that is, you take blocks of n source symbols at a time, letting n increase) of this source, then the frequencies of 1's and 0's in the encoded message tend to 1/2 each. How do you prove this? I have an idea of why it is so. Namely, if the frequencies of 1's and 0's were very different, you could Huffman encode the coded message taken in blocks of k << n, and thereby obtain an even shorter code for the original message. But the original code was as short as possible by construction (the Huffman procedure), a contradiction. But this isn't really a proof. It sort of assumes that the expectation of a product is the product of the expectations, which is not generally true.

Hey techmologist.

Are you referring to the long term distribution of symbols? In Markovian analysis there is a thing called the stationary distribution that you can calculate given a valid transition matrix which gives the long distribution of getting a particular value as the number of process realizations gets large (or tends to infinity).

techmologist
Yes, but specifically the long term distribution of the code symbols, 0 and 1 in this case. If the source symbols have a stationary distribution, the code symbols in the encoded message do, too. I'm trying to fill in the gaps in a proof that the 0's and the 1's become equiprobable as you Huffman encode longer and longer blocks of source symbols (extensions of the source).

Another argument that is entirely convincing to me, but isn't a proof, is this. If L_n is the average codeword length of a Huffman code for blocks of n source symbols, then as n increases, L_n/n approaches the source entropy H(S). Since the average entropy of a block of n source symbols is n*H(S), the average entropy per code symbol is n*H(S)/L_n, and this tends to 1. That is, the code symbols tend toward being equiprobable.

This argument has the same gaps as my first argument, namely assuming that averages of products and quotients are the same as the products and quotients of the averages. They must not differ by much for the problem at hand, but that fact would still require a proof. It makes me wonder if there ain't a better, more direct argument.

Can you get an explicit distribution of each Huffman code for n? If you assume that each block is independent then it means you can refer to only one block in terms of the distribution of all code-words and subsequently for the alphabet realizations of 0 and 1.

Remember that if you have independence, entropy is additive and the mutual entropy is 0. If you have a distribution for the Huffman codes for each block and show that it goes to uniform, then you're done.

techmologist
I have been trying that, and even for the simplest case, where the source itself is binary (with alphabet {a, b}, say), I can't figure out an explicit distribution of 1's and 0's in the Huffman code for blocks of n source symbols. The particular form the Huffman tree takes depends too much on the relative value of the probabilities P(a) = p and P(b) = q = 1-p. All I can think of is that as n increases, the Huffman tree tends to become more balanced, which would mean 1's and 0's happen equally often. This comes from the Shannon-McMillan-Breiman theorem. I was hoping not to have to appeal to something that complicated.

I have been trying that, and even for the simplest case, where the source itself is binary (with alphabet {a, b}, say), I can't figure out an explicit distribution of 1's and 0's in the Huffman code for blocks of n source symbols. The particular form the Huffman tree takes depends too much on the relative value of the probabilities P(a) = p and P(b) = q = 1-p. All I can think of is that as n increases, the Huffman tree tends to become more balanced, which would mean 1's and 0's happen equally often. This comes from the Shannon-McMillan-Breiman theorem. I was hoping not to have to appeal to something that complicated.

Before doing a full fledged analytic proof, try simulation to generate the distribution for the final Huffman codes, instead of doing it analytically.

With regard to the analytic distribution though, you should use a computer program to generate an explicit distribution corresponding to map every n-bit block to every relevant huffman code and then calculate the entropy of that.

Basically this will do an exhaustive case for a given n and it won't be a general proof. Once you do this though, you can look at the structure and see if you can make an inductive argument about either the distribution itself or just the entropy.

What I mean by the above is that the computer generates a PDF for a given n which is explicit and then take that and try a symbolic extension to generate a PDF for some value of n.

So to verify, first do simulation. Then try analytic for specific n by simply mapping each code to it's corresponding huff-man code and go from there.

At least if you have an explicit distribution for a few values of n on your screen this might help you get the pattern you are after.

techmologist
That's a good suggestion. The best way for me to develop intuition for how the Huffman procedure works on longer extensions of the source is to actually see a bunch of specific examples. Perhaps a pattern will emerge. Thank you.