How Does Huffman Coding Affect Symbol Frequencies in a Markov Source?

  • Context: Graduate 
  • Thread starter Thread starter techmologist
  • Start date Start date
  • Tags Tags
    Coding Source
Click For Summary

Discussion Overview

The discussion revolves around the effects of Huffman coding on symbol frequencies in a Markov source, particularly focusing on how encoding longer blocks of symbols influences the distribution of encoded symbols (0's and 1's). Participants explore theoretical aspects, potential proofs, and practical approaches to understanding this phenomenon.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests that as longer blocks of symbols are Huffman encoded, the frequencies of 1's and 0's in the encoded message tend to equalize at 1/2, although they acknowledge that their reasoning lacks a formal proof.
  • Another participant introduces the concept of stationary distribution in Markov analysis, indicating that it could relate to the long-term distribution of code symbols.
  • A participant emphasizes the need to demonstrate that the code symbols become equiprobable as block sizes increase, referencing the average codeword length and its relationship to source entropy.
  • One participant questions the possibility of deriving an explicit distribution for Huffman codes, suggesting that independence among blocks could simplify the analysis.
  • Another participant expresses difficulty in determining the distribution of 1's and 0's in the Huffman code, noting that the structure of the Huffman tree is influenced by the probabilities of the source symbols.
  • A suggestion is made to use simulations to generate distributions for Huffman codes, which could provide insights before attempting an analytic proof.
  • One participant acknowledges the value of simulations for developing intuition about the Huffman procedure and hopes to identify patterns through specific examples.

Areas of Agreement / Disagreement

Participants express various viewpoints on the relationship between Huffman coding and symbol frequencies, with no consensus reached on a definitive proof or explicit distribution. Multiple competing ideas and approaches remain present throughout the discussion.

Contextual Notes

Participants note limitations in their arguments, including assumptions about independence, the need for formal proofs, and the complexity of deriving distributions based on varying probabilities of source symbols.

techmologist
Messages
313
Reaction score
12
Suppose you have a Markov source of symbols (or more generally, a stationary ergodic source). If you Huffman encode longer and longer extensions (that is, you take blocks of n source symbols at a time, letting n increase) of this source, then the frequencies of 1's and 0's in the encoded message tend to 1/2 each. How do you prove this? I have an idea of why it is so. Namely, if the frequencies of 1's and 0's were very different, you could Huffman encode the coded message taken in blocks of k << n, and thereby obtain an even shorter code for the original message. But the original code was as short as possible by construction (the Huffman procedure), a contradiction. But this isn't really a proof. It sort of assumes that the expectation of a product is the product of the expectations, which is not generally true.
 
Physics news on Phys.org
Hey techmologist.

Are you referring to the long term distribution of symbols? In Markovian analysis there is a thing called the stationary distribution that you can calculate given a valid transition matrix which gives the long distribution of getting a particular value as the number of process realizations gets large (or tends to infinity).
 
Yes, but specifically the long term distribution of the code symbols, 0 and 1 in this case. If the source symbols have a stationary distribution, the code symbols in the encoded message do, too. I'm trying to fill in the gaps in a proof that the 0's and the 1's become equiprobable as you Huffman encode longer and longer blocks of source symbols (extensions of the source).

Another argument that is entirely convincing to me, but isn't a proof, is this. If L_n is the average codeword length of a Huffman code for blocks of n source symbols, then as n increases, L_n/n approaches the source entropy H(S). Since the average entropy of a block of n source symbols is n*H(S), the average entropy per code symbol is n*H(S)/L_n, and this tends to 1. That is, the code symbols tend toward being equiprobable.

This argument has the same gaps as my first argument, namely assuming that averages of products and quotients are the same as the products and quotients of the averages. They must not differ by much for the problem at hand, but that fact would still require a proof. It makes me wonder if there ain't a better, more direct argument.
 
Can you get an explicit distribution of each Huffman code for n? If you assume that each block is independent then it means you can refer to only one block in terms of the distribution of all code-words and subsequently for the alphabet realizations of 0 and 1.

Remember that if you have independence, entropy is additive and the mutual entropy is 0. If you have a distribution for the Huffman codes for each block and show that it goes to uniform, then you're done.
 
I have been trying that, and even for the simplest case, where the source itself is binary (with alphabet {a, b}, say), I can't figure out an explicit distribution of 1's and 0's in the Huffman code for blocks of n source symbols. The particular form the Huffman tree takes depends too much on the relative value of the probabilities P(a) = p and P(b) = q = 1-p. All I can think of is that as n increases, the Huffman tree tends to become more balanced, which would mean 1's and 0's happen equally often. This comes from the Shannon-McMillan-Breiman theorem. I was hoping not to have to appeal to something that complicated.
 
techmologist said:
I have been trying that, and even for the simplest case, where the source itself is binary (with alphabet {a, b}, say), I can't figure out an explicit distribution of 1's and 0's in the Huffman code for blocks of n source symbols. The particular form the Huffman tree takes depends too much on the relative value of the probabilities P(a) = p and P(b) = q = 1-p. All I can think of is that as n increases, the Huffman tree tends to become more balanced, which would mean 1's and 0's happen equally often. This comes from the Shannon-McMillan-Breiman theorem. I was hoping not to have to appeal to something that complicated.

Before doing a full fledged analytic proof, try simulation to generate the distribution for the final Huffman codes, instead of doing it analytically.

With regard to the analytic distribution though, you should use a computer program to generate an explicit distribution corresponding to map every n-bit block to every relevant huffman code and then calculate the entropy of that.

Basically this will do an exhaustive case for a given n and it won't be a general proof. Once you do this though, you can look at the structure and see if you can make an inductive argument about either the distribution itself or just the entropy.

What I mean by the above is that the computer generates a PDF for a given n which is explicit and then take that and try a symbolic extension to generate a PDF for some value of n.

So to verify, first do simulation. Then try analytic for specific n by simply mapping each code to it's corresponding huff-man code and go from there.

At least if you have an explicit distribution for a few values of n on your screen this might help you get the pattern you are after.
 
That's a good suggestion. The best way for me to develop intuition for how the Huffman procedure works on longer extensions of the source is to actually see a bunch of specific examples. Perhaps a pattern will emerge. Thank you.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
Replies
3
Views
2K
  • · Replies 6 ·
Replies
6
Views
7K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 22 ·
Replies
22
Views
6K
Replies
8
Views
2K
  • · Replies 14 ·
Replies
14
Views
3K