A Feature Selection, Information Gain and Information Entropy

WWGD
Science Advisor
Homework Helper
Messages
7,678
Reaction score
12,354
TL;DR Summary
To each feature in a set we apply a normalized weight; to each weight we assign an entropy value and then we use a cutoff value to decide which features remain. I am trying to get a better "feel" for the choice of cutoff value
Sorry for being fuzzy here, I started reading a small paper and I am a bit confused. These are some loose notes without sources or refs.

Say we start with a collection F of features we want to trim into a smaller set F' of features through information gain and entropy (where we are using the formula ## -P_{ai} log_2(Pa_i) ##).

We start by assigning a normalized weight N_F (so that all weights are between 0 and 1 )to each feature. I am not sure on the actual assignment rule. Then we assign and entropy value E_F to each N_F.

And ** Here ** is the key to my post, The cutoff point for a feature is defined as ## 2^{E_F} ## . What does this measure? It seems it has to see with inverting the log value, but how does this relate to information gain as a cutoff point?
 
Last edited:
Physics news on Phys.org
WWGD said:
(where we are using the formula ## -P_ai log_2(Pa_i) ##).

What probability space is being used? If something is being sampled, define what it is and how it's being sampled.

Pick a feature at random from the set of features? Pick a thing at random from a set of things and then see if the thing possesses the ##i##th feature?
 
  • Like
Likes WWGD and jedishrfu
Stephen Tashi said:
What probability space is being used? If something is being sampled, define what it is and how it's being sampled.

Pick a feature at random from the set of features? Pick a thing at random from a set of things and then see if the thing possesses the ##i##th feature?
Thank you. Unfortunately, I took this from some loose notes and not a book and I am not clear about the source nor details. I was hoping someone would know about the area enough to help me narrow things down.
 
I maybe completely wrong here (your definition of the problem is quite vague), but this strikes me as a problem of lossy compression, i.e., how can we compress the set F into a smaller set F' without losing too much information in the process. MacKay's book [1, pp. 75 - 84] contains some stuff that, while not identical, seems like it may be closely related to your problem. The author has made a PDF of the book available from their personal website here: https://www.inference.org.uk/mackay/itila/

In [1, Ch. 4, p. 75], an example of lossy compression is given using an alphabet ##\mathcal{A}_X##
$$
\mathcal{A}_X = \{\textrm{a,b,c,d,e,f,g,h}\},
$$
which occur with probabilities:
$$
\mathcal{P}_X = \{\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{3}{16}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}\}.
$$
Noting that the probability of a given letter being a, b, c, or d is 15/16, if we are willing to live with losing information one time out of 16 then we can reduce our alphabet to:
$$
\mathcal{A}'_X = \{\textrm{a,b,c,d}\}
$$
This is used to build-up to a definition of smallest ##\delta##-sufficient subsets. However, the thing that catches my eye related to your post is on page 80, in a discussion about compressing strings of length ##N## of symbols from ##\mathcal{A}_X##. The information content of such a string is given [1, eq. 4.28] as:
$$
\log_2 \dfrac{1}{P(\mathbf{x})} \approx N \sum_i p_i \log_2 \dfrac{1}{p_i} = NH
$$
Mackay then defines "typical elements" of the set as those with probability close to ##2^{-NH}##, which to me seems very similar to your cutoff threshold of ##2^{E_F}##. I am by no means an expert in information theory, so I am unable to offer the insight that you are looking for, but Mackay's book - especially the page range that I have referenced here - maybe of use.

[1] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, 7th ed. Cambridge: Cambridge University Press, 2003.
 
It's an interesting question whether "information" can be defined independently of some specific way of using information.

For example, if a we have information about people, then which type of list contains more information- a list in alphabetical order by last name? - or a list in numerical order from smallest age to largest age?

If the use of the list is in tasks like "Find out about Nelly Smith" then the alphabetical list is preferred. If use of the list in in tasks like "Find out about 18 year olds" then the latter list is better.
 
Hi all, I've been a roulette player for more than 10 years (although I took time off here and there) and it's only now that I'm trying to understand the physics of the game. Basically my strategy in roulette is to divide the wheel roughly into two halves (let's call them A and B). My theory is that in roulette there will invariably be variance. In other words, if A comes up 5 times in a row, B will be due to come up soon. However I have been proven wrong many times, and I have seen some...
Thread 'Detail of Diagonalization Lemma'
The following is more or less taken from page 6 of C. Smorynski's "Self-Reference and Modal Logic". (Springer, 1985) (I couldn't get raised brackets to indicate codification (Gödel numbering), so I use a box. The overline is assigning a name. The detail I would like clarification on is in the second step in the last line, where we have an m-overlined, and we substitute the expression for m. Are we saying that the name of a coded term is the same as the coded term? Thanks in advance.
Back
Top