Minimum Description Length for Features

In summary: Furthermore, if P is a valid descriptor, then |P| < |F|. In summary, the concept of minimum description length can be applied to finding valid features in data by using a generalized feature descriptor function, which outputs a 1 or 0 depending on the presence of the feature in the data. This approach takes into consideration the length of the feature descriptor and the probability of the feature occurring in random data.
  • #1
mXSCNT
315
1
In minimum description length, you start with some programming language L, and some data D, and seek to find a program P in L such that evaluating P outputs D. You try to minimize the length of P, and you consider P to be a valid model of the data if the length of P is less than the length of D.

However, what if your goals are more modest--suppose instead of describing all the data you only wish to find a "feature" in the data, like identifying a shape. Let a "feature descriptor" be a program F that outputs a string S, such that S agrees with D at a large number of points, and where S contains a "dummy" symbol (a symbol not used in D) at each other position. For example, if D = 0101001111111111010100011, then one might let F be a program that outputs the string S = xxxxxx1111111111xxxxxxxxx, which agrees with D at 10 places.

What is a criterion, along the lines of minimum description length, for finding valid features in the data? One possible criterion is to say that F is a valid feature descriptor if the length of F is less than the number of places where F's output, S, agrees with D. Is this a good criterion, or does it lead to spurious features?

A good criterion for finding valid features should very rarely allow one to find any "features" in a random string of data, no matter how long. So, suppose we have an infinite string D of 0's and 1's, each randomly selected by a fair coin flip. What is the probability that, by examining D, we can write a computer program P (say, in Python) that is a valid feature descriptor by the above criterion (length of P in bits is less than the number of places where the output of P agrees with D)? Is this probability close to 0 (as we'd like) or close to 1?
 
Physics news on Phys.org
  • #2
The aforementioned probability is close to 0 for most programming languages, as desired. I can argue thusly:
The case that would maximize that probability is where you have a programming language L = [tex]\{0,1\}^*[/tex], such that every program P in L outputs an infinite string consisting of almost all x's, except for |P|+1 0's and 1's.

Now, let's examine the probability that there is any program in L that is a valid descriptor for D. The probability for any program of length [tex]\ell[/tex] is [tex]2^{-1-\ell}[/tex]. The probability that any program is a valid descriptor for D is maximized when the event that D matches the output of anyone program is mutually exclusive with D matching the output of every other program, in which case we can just sum the probabilities for each program. So,
[tex]\text{p(valid descriptor) }\leq \sum_{\ell=1}^\infty \sum_{i=2^{\ell - 1}}^{2^{\ell}} 2^{-1 - \ell}[/tex]
[tex] = \sum_{\ell=1}^\infty 2^{-1 - \ell} \left(2^{\ell - 1} - 1\right)[/tex]
[tex] = \sum_{\ell=1}^\infty 2^{1 - \ell}/4[/tex]
[tex] = \frac{1}{4}\frac{1}{1-1/2} = 1/2[/tex]

So in this ideal case, for the programming language L, we have already ruled out the case that the probability of a valid descriptor for D is high. However, L was as "good" a programming language as we could possibly hope for. Suppose that instead of L, we have a programming language L', which is exactly like L except that only programs of length at least 10 bits are allowed. L' maximizes p(valid descriptor) over all programming languages that permit no programs of length less than 10 bits. For L', we find

[tex]\text{p(valid descriptor) }\leq \sum_{\ell=10}^\infty \sum_{i=2^{\ell - 1}}^{2^{\ell}} 2^{-1 - \ell}[/tex]
[tex] = \sum_{\ell=10}^\infty 2^{-1 - \ell} \left(2^{\ell - 1} - 1\right)[/tex]
[tex] = \sum_{\ell=10}^\infty 2^{1 - \ell}/4[/tex]
= 1/1024.

So for L', which is already a very good programming language for finding valid descriptors, we can bound the probability of a valid descriptor by a small number. For practical languages like Python, one would therefore expect p(valid descriptor) to be very low (much lower than 1/1024).
 
  • #3
In general, a feature might not be a set of 1's and 0's in specific places. Consider the feature, in an image, of a hand appearing somewhere in the scene: the presence of this feature does not dictate the values of any particular pixels taken alone.

To generalize the idea of feature descriptors to this situation using the concepts of minimum description length, here is what I propose. Let a generalized feature descriptor be a function F, written in a programming language L, that takes the data D as an input and outputs either a 1 (meaning that D does have the given feature) or a 0 (meaning that D does not have the feature). Now, let [tex]\ell_F = -\log_2 P(F(D) = 1)[/tex], where the probability is taken with the assumption that every bit of D is sampled from a fair coin flip independently of every other bit of D. If [tex]\ell_F < |F|[/tex], and F(D) = 1, then accept F as a valid feature descriptor of D.

This is consistent with the idea of a valid feature descriptor given in the first post. Simply replace the program P of the first post with a function F, such that F(D) = 1 iff D matches the bit mask that is output by P. Then if P is a valid feature descriptor of D, F is a valid generalized feature descriptor of D, and if P is not a valid descriptor of D, neither is F.
 

1. What is Minimum Description Length (MDL) for Features?

Minimum Description Length for Features (MDL-F) is a statistical method used to select the most informative and relevant features from a dataset. It aims to find the most concise and accurate description of the data by balancing the complexity of the features and their ability to explain the data.

2. How does MDL-F work?

MDL-F works by evaluating the information content of each feature in a dataset and assigning a cost to each feature based on its complexity. It then selects the features that minimize the total description length, which is a combination of the feature costs and the description length of the data using those features. This results in a set of features that are both informative and parsimonious.

3. What are the benefits of using MDL-F?

Using MDL-F can lead to more accurate and robust models by selecting the most relevant features from a dataset. It also helps to avoid overfitting by penalizing complex features, leading to better generalization to new data. MDL-F can also handle noisy and redundant features, making it a useful tool for feature selection in a variety of domains.

4. Are there any limitations to MDL-F?

One potential limitation of MDL-F is that it assumes that the data follows a specific distribution, such as a Gaussian distribution. If this assumption is not met, the results may not be accurate. Additionally, MDL-F can be computationally expensive, especially for large datasets, as it involves evaluating all possible feature combinations.

5. How does MDL-F compare to other feature selection methods?

Compared to traditional feature selection methods, such as forward or backward selection, MDL-F is a more rigorous and principled approach. It takes into account both the complexity and relevance of features, whereas other methods may only consider one of these factors. This makes MDL-F a more robust and reliable method for feature selection.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
635
  • Advanced Physics Homework Help
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
3K
  • High Energy, Nuclear, Particle Physics
Replies
7
Views
1K
Replies
17
Views
1K
  • Introductory Physics Homework Help
Replies
14
Views
2K
  • Mechanical Engineering
Replies
20
Views
7K
Back
Top