# Minimum Description Length for Features

## Main Question or Discussion Point

In minimum description length, you start with some programming language L, and some data D, and seek to find a program P in L such that evaluating P outputs D. You try to minimize the length of P, and you consider P to be a valid model of the data if the length of P is less than the length of D.

However, what if your goals are more modest--suppose instead of describing all the data you only wish to find a "feature" in the data, like identifying a shape. Let a "feature descriptor" be a program F that outputs a string S, such that S agrees with D at a large number of points, and where S contains a "dummy" symbol (a symbol not used in D) at each other position. For example, if D = 0101001111111111010100011, then one might let F be a program that outputs the string S = xxxxxx1111111111xxxxxxxxx, which agrees with D at 10 places.

What is a criterion, along the lines of minimum description length, for finding valid features in the data? One possible criterion is to say that F is a valid feature descriptor if the length of F is less than the number of places where F's output, S, agrees with D. Is this a good criterion, or does it lead to spurious features?

A good criterion for finding valid features should very rarely allow one to find any "features" in a random string of data, no matter how long. So, suppose we have an infinite string D of 0's and 1's, each randomly selected by a fair coin flip. What is the probability that, by examining D, we can write a computer program P (say, in Python) that is a valid feature descriptor by the above criterion (length of P in bits is less than the number of places where the output of P agrees with D)? Is this probability close to 0 (as we'd like) or close to 1?

Related Set Theory, Logic, Probability, Statistics News on Phys.org
The aforementioned probability is close to 0 for most programming languages, as desired. I can argue thusly:
The case that would maximize that probability is where you have a programming language L = $$\{0,1\}^*$$, such that every program P in L outputs an infinite string consisting of almost all x's, except for |P|+1 0's and 1's.

Now, let's examine the probability that there is any program in L that is a valid descriptor for D. The probability for any program of length $$\ell$$ is $$2^{-1-\ell}$$. The probability that any program is a valid descriptor for D is maximized when the event that D matches the output of any one program is mutually exclusive with D matching the output of every other program, in which case we can just sum the probabilities for each program. So,
$$\text{p(valid descriptor) }\leq \sum_{\ell=1}^\infty \sum_{i=2^{\ell - 1}}^{2^{\ell}} 2^{-1 - \ell}$$
$$= \sum_{\ell=1}^\infty 2^{-1 - \ell} \left(2^{\ell - 1} - 1\right)$$
$$= \sum_{\ell=1}^\infty 2^{1 - \ell}/4$$
$$= \frac{1}{4}\frac{1}{1-1/2} = 1/2$$

So in this ideal case, for the programming language L, we have already ruled out the case that the probability of a valid descriptor for D is high. However, L was as "good" a programming language as we could possibly hope for. Suppose that instead of L, we have a programming language L', which is exactly like L except that only programs of length at least 10 bits are allowed. L' maximizes p(valid descriptor) over all programming languages that permit no programs of length less than 10 bits. For L', we find

$$\text{p(valid descriptor) }\leq \sum_{\ell=10}^\infty \sum_{i=2^{\ell - 1}}^{2^{\ell}} 2^{-1 - \ell}$$
$$= \sum_{\ell=10}^\infty 2^{-1 - \ell} \left(2^{\ell - 1} - 1\right)$$
$$= \sum_{\ell=10}^\infty 2^{1 - \ell}/4$$
= 1/1024.

So for L', which is already a very good programming language for finding valid descriptors, we can bound the probability of a valid descriptor by a small number. For practical languages like Python, one would therefore expect p(valid descriptor) to be very low (much lower than 1/1024).

In general, a feature might not be a set of 1's and 0's in specific places. Consider the feature, in an image, of a hand appearing somewhere in the scene: the presence of this feature does not dictate the values of any particular pixels taken alone.

To generalize the idea of feature descriptors to this situation using the concepts of minimum description length, here is what I propose. Let a generalized feature descriptor be a function F, written in a programming language L, that takes the data D as an input and outputs either a 1 (meaning that D does have the given feature) or a 0 (meaning that D does not have the feature). Now, let $$\ell_F = -\log_2 P(F(D) = 1)$$, where the probability is taken with the assumption that every bit of D is sampled from a fair coin flip independently of every other bit of D. If $$\ell_F < |F|$$, and F(D) = 1, then accept F as a valid feature descriptor of D.

This is consistent with the idea of a valid feature descriptor given in the first post. Simply replace the program P of the first post with a function F, such that F(D) = 1 iff D matches the bit mask that is output by P. Then if P is a valid feature descriptor of D, F is a valid generalized feature descriptor of D, and if P is not a valid descriptor of D, neither is F.