# Inverse source coding theorem

## Main Question or Discussion Point

The source coding theorem tells us that given a discrete probability distribution, there is an optimal encoding for it. Is it possible to go in the reverse direction? That is, suppose you start with an encoding of a discrete random variable X whose distribution is unknown. Assuming that this encoding is optimal, can one derive the distribution of X?

Suppose that one can determine the distribution of X from the optimal encoding. Then this would allow us to determine the a priori probability of any string in the encoding. In other words, that would let us construct a concept of probability out of nothing but a formal language!

Related Set Theory, Logic, Probability, Statistics News on Phys.org
The source coding theorem tells us that given a discrete probability distribution, there is an optimal encoding for it. Is it possible to go in the reverse direction? That is, suppose you start with an encoding of a discrete random variable X whose distribution is unknown. Assuming that this encoding is optimal, can one derive the distribution of X?
Yeah, sure. You know that the length of each codeword is proportional to the log of the probability corresponding element, so you can just raise all the codeword lengths to the power of 2 and then normalize. Up to round-off error (because the encoding lengths must be integers), you will end up with the original distribution.

More generally, any code, optimal or not, corresponds to an assumed probability distribution, and the average excess length of such a code (relative to the optimal one) is then the KL divergence between the assumed distribution and the true one.

In other words, that would let us construct a concept of probability out of nothing but a formal language!
Not so fast. This whole approach is based on the assumption that the encoding is optimal with respect to the distribution, which is assumed to already exist. You aren't constructing anything here, but rather noting that the structure of the code tells you what the distribution was.

Yeah, sure. You know that the length of each codeword is proportional to the log of the probability corresponding element, so you can just raise all the codeword lengths to the power of 2 and then normalize. Up to round-off error (because the encoding lengths must be integers), you will end up with the original distribution.
If I am not misinterpreting you, then
|s| = k * log(P(d(s)))
P(d(s)) = 2^(|s|/k)
where k is a normalizing constant, and d(s) is the event encoded by s. Since the string s is produced if and only if the event d(s) occurs, P(d(s)) = P(s).

Setting k = -1/2 normalizes the distribution, giving a prior distribution over binary strings, on the assumption that these strings are an optimal encoding for some random variable:
P(s) = 1/4^|s|

If I am not misinterpreting you, then
|s| = k * log(P(d(s)))
P(d(s)) = 2^(|s|/k)
where k is a normalizing constant, and d(s) is the event encoded by s. Since the string s is produced if and only if the event d(s) occurs, P(d(s)) = P(s).
Yeah, that's the idea. Although, since $P(d(s)) \leq 1$, I'd write it as:

$$|s| = -k\log_2(P(d(s)))$$

or even just

$$|s| \propto -\log_2(P(d(s)))$$

I wish to solve the following problem using similar techniques. Suppose we have a set of computer programs written as binary strings, and a set of inputs also written as binary strings. Each computer program either terminates on an input, which means that it accepts the input, or it fails to terminate, which means that it rejects the input. I wish to answer the question, "given an input string, what is the maximum likelihood computer program that accepts that input string"?

In order to answer this question, it is necessary to derive a joint probability distribution over the computer programs and the input strings. It is not possible to assume that the computer programs are an optimal encoding of anything, because there may be more than one computer program that accepts a given input, a computer program may accept more than one input, and a computer program may reject all inputs. But perhaps it is possible to assume that the computer programs are optimal "recognizers" of the input, for some meaning of optimal.

This is related to another question: given the Kolmogorov complexity of a string, is it possible to derive the prior probability of that string, assuming that the computer programs used to generate strings are somehow optimal at doing so? In this case it's still not possible to interpret the computer programs as encoding the strings, because more than one computer program may generate the same string. But in this case each computer program corresponds to only one string, so it may be simpler in that regard.