Huffman coding a Markov source

In summary: I have been trying to generate an explicit distribution for the final Huffman codes, but I'm not getting very far. I think my problem is that I'm trying to do it too analytically. Perhaps I should try simulations instead.
  • #1
techmologist
306
12
Suppose you have a Markov source of symbols (or more generally, a stationary ergodic source). If you Huffman encode longer and longer extensions (that is, you take blocks of n source symbols at a time, letting n increase) of this source, then the frequencies of 1's and 0's in the encoded message tend to 1/2 each. How do you prove this? I have an idea of why it is so. Namely, if the frequencies of 1's and 0's were very different, you could Huffman encode the coded message taken in blocks of k << n, and thereby obtain an even shorter code for the original message. But the original code was as short as possible by construction (the Huffman procedure), a contradiction. But this isn't really a proof. It sort of assumes that the expectation of a product is the product of the expectations, which is not generally true.
 
Physics news on Phys.org
  • #2
Hey techmologist.

Are you referring to the long term distribution of symbols? In Markovian analysis there is a thing called the stationary distribution that you can calculate given a valid transition matrix which gives the long distribution of getting a particular value as the number of process realizations gets large (or tends to infinity).
 
  • #3
Yes, but specifically the long term distribution of the code symbols, 0 and 1 in this case. If the source symbols have a stationary distribution, the code symbols in the encoded message do, too. I'm trying to fill in the gaps in a proof that the 0's and the 1's become equiprobable as you Huffman encode longer and longer blocks of source symbols (extensions of the source).

Another argument that is entirely convincing to me, but isn't a proof, is this. If L_n is the average codeword length of a Huffman code for blocks of n source symbols, then as n increases, L_n/n approaches the source entropy H(S). Since the average entropy of a block of n source symbols is n*H(S), the average entropy per code symbol is n*H(S)/L_n, and this tends to 1. That is, the code symbols tend toward being equiprobable.

This argument has the same gaps as my first argument, namely assuming that averages of products and quotients are the same as the products and quotients of the averages. They must not differ by much for the problem at hand, but that fact would still require a proof. It makes me wonder if there ain't a better, more direct argument.
 
  • #4
Can you get an explicit distribution of each Huffman code for n? If you assume that each block is independent then it means you can refer to only one block in terms of the distribution of all code-words and subsequently for the alphabet realizations of 0 and 1.

Remember that if you have independence, entropy is additive and the mutual entropy is 0. If you have a distribution for the Huffman codes for each block and show that it goes to uniform, then you're done.
 
  • #5
I have been trying that, and even for the simplest case, where the source itself is binary (with alphabet {a, b}, say), I can't figure out an explicit distribution of 1's and 0's in the Huffman code for blocks of n source symbols. The particular form the Huffman tree takes depends too much on the relative value of the probabilities P(a) = p and P(b) = q = 1-p. All I can think of is that as n increases, the Huffman tree tends to become more balanced, which would mean 1's and 0's happen equally often. This comes from the Shannon-McMillan-Breiman theorem. I was hoping not to have to appeal to something that complicated.
 
  • #6
techmologist said:
I have been trying that, and even for the simplest case, where the source itself is binary (with alphabet {a, b}, say), I can't figure out an explicit distribution of 1's and 0's in the Huffman code for blocks of n source symbols. The particular form the Huffman tree takes depends too much on the relative value of the probabilities P(a) = p and P(b) = q = 1-p. All I can think of is that as n increases, the Huffman tree tends to become more balanced, which would mean 1's and 0's happen equally often. This comes from the Shannon-McMillan-Breiman theorem. I was hoping not to have to appeal to something that complicated.

Before doing a full fledged analytic proof, try simulation to generate the distribution for the final Huffman codes, instead of doing it analytically.

With regard to the analytic distribution though, you should use a computer program to generate an explicit distribution corresponding to map every n-bit block to every relevant huffman code and then calculate the entropy of that.

Basically this will do an exhaustive case for a given n and it won't be a general proof. Once you do this though, you can look at the structure and see if you can make an inductive argument about either the distribution itself or just the entropy.

What I mean by the above is that the computer generates a PDF for a given n which is explicit and then take that and try a symbolic extension to generate a PDF for some value of n.

So to verify, first do simulation. Then try analytic for specific n by simply mapping each code to it's corresponding huff-man code and go from there.

At least if you have an explicit distribution for a few values of n on your screen this might help you get the pattern you are after.
 
  • #7
That's a good suggestion. The best way for me to develop intuition for how the Huffman procedure works on longer extensions of the source is to actually see a bunch of specific examples. Perhaps a pattern will emerge. Thank you.
 

1. What is Huffman coding for a Markov source?

Huffman coding is a data compression technique used to reduce the amount of space needed to store data by assigning shorter codes to more frequently occurring symbols. It is specifically used for Markov sources, which are sources of data that exhibit a certain degree of predictability or dependency between successive symbols.

2. How does Huffman coding work for a Markov source?

Huffman coding for a Markov source involves building a binary tree where the most frequently occurring symbols are placed closer to the root, and the least frequently occurring symbols are placed farther away. This results in shorter codes for frequently occurring symbols, making the overall encoded data more compact.

3. What are the advantages of using Huffman coding for a Markov source?

The main advantage of using Huffman coding for a Markov source is its ability to significantly reduce the amount of space needed to store data. This makes it useful for storing large amounts of data or for transmitting data over a network. Additionally, because it takes into account the predictability of the data, it can achieve higher compression ratios compared to other compression techniques.

4. Can Huffman coding be used for any type of data?

Huffman coding can be used for any type of data, as long as there is a certain degree of predictability or dependency between successive symbols. However, it is most commonly used for text data, such as documents or web pages, where certain words or characters are more likely to occur than others.

5. Are there any limitations to using Huffman coding for a Markov source?

One limitation of Huffman coding for a Markov source is that it requires a significant amount of processing and memory resources to build the binary tree. This can be a problem when dealing with very large data sets. Additionally, if the data is not very predictable, Huffman coding may not be as effective in achieving compression.

Similar threads

Replies
3
Views
990
  • Computing and Technology
Replies
0
Views
132
  • Programming and Computer Science
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Advanced Physics Homework Help
Replies
6
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
6
Views
5K
  • Introductory Physics Homework Help
Replies
11
Views
758
  • MATLAB, Maple, Mathematica, LaTeX
Replies
1
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
5
Views
2K
  • Classical Physics
Replies
22
Views
2K
Back
Top