A Feature Selection, Information Gain and Information Entropy

AI Thread Summary
The discussion focuses on the process of feature selection using information gain and entropy to reduce a set of features. A normalized weight is assigned to each feature, and the cutoff point for retaining a feature is defined as 2 raised to the entropy value. The conversation raises questions about the probability space involved in sampling features and how this relates to information gain. It also touches on concepts of lossy compression and the definition of sufficient subsets, referencing MacKay's work for further insights. The exploration emphasizes the importance of context in defining the utility of information.
WWGD
Science Advisor
Homework Helper
Messages
7,701
Reaction score
12,766
TL;DR Summary
To each feature in a set we apply a normalized weight; to each weight we assign an entropy value and then we use a cutoff value to decide which features remain. I am trying to get a better "feel" for the choice of cutoff value
Sorry for being fuzzy here, I started reading a small paper and I am a bit confused. These are some loose notes without sources or refs.

Say we start with a collection F of features we want to trim into a smaller set F' of features through information gain and entropy (where we are using the formula ## -P_{ai} log_2(Pa_i) ##).

We start by assigning a normalized weight N_F (so that all weights are between 0 and 1 )to each feature. I am not sure on the actual assignment rule. Then we assign and entropy value E_F to each N_F.

And ** Here ** is the key to my post, The cutoff point for a feature is defined as ## 2^{E_F} ## . What does this measure? It seems it has to see with inverting the log value, but how does this relate to information gain as a cutoff point?
 
Last edited:
Physics news on Phys.org
WWGD said:
(where we are using the formula ## -P_ai log_2(Pa_i) ##).

What probability space is being used? If something is being sampled, define what it is and how it's being sampled.

Pick a feature at random from the set of features? Pick a thing at random from a set of things and then see if the thing possesses the ##i##th feature?
 
  • Like
Likes WWGD and jedishrfu
Stephen Tashi said:
What probability space is being used? If something is being sampled, define what it is and how it's being sampled.

Pick a feature at random from the set of features? Pick a thing at random from a set of things and then see if the thing possesses the ##i##th feature?
Thank you. Unfortunately, I took this from some loose notes and not a book and I am not clear about the source nor details. I was hoping someone would know about the area enough to help me narrow things down.
 
I maybe completely wrong here (your definition of the problem is quite vague), but this strikes me as a problem of lossy compression, i.e., how can we compress the set F into a smaller set F' without losing too much information in the process. MacKay's book [1, pp. 75 - 84] contains some stuff that, while not identical, seems like it may be closely related to your problem. The author has made a PDF of the book available from their personal website here: https://www.inference.org.uk/mackay/itila/

In [1, Ch. 4, p. 75], an example of lossy compression is given using an alphabet ##\mathcal{A}_X##
$$
\mathcal{A}_X = \{\textrm{a,b,c,d,e,f,g,h}\},
$$
which occur with probabilities:
$$
\mathcal{P}_X = \{\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{3}{16}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}\}.
$$
Noting that the probability of a given letter being a, b, c, or d is 15/16, if we are willing to live with losing information one time out of 16 then we can reduce our alphabet to:
$$
\mathcal{A}'_X = \{\textrm{a,b,c,d}\}
$$
This is used to build-up to a definition of smallest ##\delta##-sufficient subsets. However, the thing that catches my eye related to your post is on page 80, in a discussion about compressing strings of length ##N## of symbols from ##\mathcal{A}_X##. The information content of such a string is given [1, eq. 4.28] as:
$$
\log_2 \dfrac{1}{P(\mathbf{x})} \approx N \sum_i p_i \log_2 \dfrac{1}{p_i} = NH
$$
Mackay then defines "typical elements" of the set as those with probability close to ##2^{-NH}##, which to me seems very similar to your cutoff threshold of ##2^{E_F}##. I am by no means an expert in information theory, so I am unable to offer the insight that you are looking for, but Mackay's book - especially the page range that I have referenced here - maybe of use.

[1] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, 7th ed. Cambridge: Cambridge University Press, 2003.
 
It's an interesting question whether "information" can be defined independently of some specific way of using information.

For example, if a we have information about people, then which type of list contains more information- a list in alphabetical order by last name? - or a list in numerical order from smallest age to largest age?

If the use of the list is in tasks like "Find out about Nelly Smith" then the alphabetical list is preferred. If use of the list in in tasks like "Find out about 18 year olds" then the latter list is better.
 
I was reading documentation about the soundness and completeness of logic formal systems. Consider the following $$\vdash_S \phi$$ where ##S## is the proof-system making part the formal system and ##\phi## is a wff (well formed formula) of the formal language. Note the blank on left of the turnstile symbol ##\vdash_S##, as far as I can tell it actually represents the empty set. So what does it mean ? I guess it actually means ##\phi## is a theorem of the formal system, i.e. there is a...
Back
Top