Feature Selection, Information Gain and Information Entropy

In summary, the conversation discusses the use of information gain and entropy to trim a collection of features into a smaller set. The key to this process is the cutoff point, which is defined as ##2^{E_F}## and relates to inverting the log value. However, the source and details of this process are unclear. The concept of lossy compression is mentioned, and it is suggested to look into a book by MacKay for more information on this topic. The conversation ends with a question about whether information can be defined independently of a specific way of using it.
  • #1
WWGD
Science Advisor
Gold Member
7,006
10,453
TL;DR Summary
To each feature in a set we apply a normalized weight; to each weight we assign an entropy value and then we use a cutoff value to decide which features remain. I am trying to get a better "feel" for the choice of cutoff value
Sorry for being fuzzy here, I started reading a small paper and I am a bit confused. These are some loose notes without sources or refs.

Say we start with a collection F of features we want to trim into a smaller set F' of features through information gain and entropy (where we are using the formula ## -P_{ai} log_2(Pa_i) ##).

We start by assigning a normalized weight N_F (so that all weights are between 0 and 1 )to each feature. I am not sure on the actual assignment rule. Then we assign and entropy value E_F to each N_F.

And ** Here ** is the key to my post, The cutoff point for a feature is defined as ## 2^{E_F} ## . What does this measure? It seems it has to see with inverting the log value, but how does this relate to information gain as a cutoff point?
 
Last edited:
Physics news on Phys.org
  • #3
WWGD said:
(where we are using the formula ## -P_ai log_2(Pa_i) ##).

What probability space is being used? If something is being sampled, define what it is and how it's being sampled.

Pick a feature at random from the set of features? Pick a thing at random from a set of things and then see if the thing possesses the ##i##th feature?
 
  • Like
Likes WWGD and jedishrfu
  • #4
Stephen Tashi said:
What probability space is being used? If something is being sampled, define what it is and how it's being sampled.

Pick a feature at random from the set of features? Pick a thing at random from a set of things and then see if the thing possesses the ##i##th feature?
Thank you. Unfortunately, I took this from some loose notes and not a book and I am not clear about the source nor details. I was hoping someone would know about the area enough to help me narrow things down.
 
  • #5
I maybe completely wrong here (your definition of the problem is quite vague), but this strikes me as a problem of lossy compression, i.e., how can we compress the set F into a smaller set F' without losing too much information in the process. MacKay's book [1, pp. 75 - 84] contains some stuff that, while not identical, seems like it may be closely related to your problem. The author has made a PDF of the book available from their personal website here: https://www.inference.org.uk/mackay/itila/

In [1, Ch. 4, p. 75], an example of lossy compression is given using an alphabet ##\mathcal{A}_X##
$$
\mathcal{A}_X = \{\textrm{a,b,c,d,e,f,g,h}\},
$$
which occur with probabilities:
$$
\mathcal{P}_X = \{\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{3}{16}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}\}.
$$
Noting that the probability of a given letter being a, b, c, or d is 15/16, if we are willing to live with losing information one time out of 16 then we can reduce our alphabet to:
$$
\mathcal{A}'_X = \{\textrm{a,b,c,d}\}
$$
This is used to build-up to a definition of smallest ##\delta##-sufficient subsets. However, the thing that catches my eye related to your post is on page 80, in a discussion about compressing strings of length ##N## of symbols from ##\mathcal{A}_X##. The information content of such a string is given [1, eq. 4.28] as:
$$
\log_2 \dfrac{1}{P(\mathbf{x})} \approx N \sum_i p_i \log_2 \dfrac{1}{p_i} = NH
$$
Mackay then defines "typical elements" of the set as those with probability close to ##2^{-NH}##, which to me seems very similar to your cutoff threshold of ##2^{E_F}##. I am by no means an expert in information theory, so I am unable to offer the insight that you are looking for, but Mackay's book - especially the page range that I have referenced here - maybe of use.

[1] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, 7th ed. Cambridge: Cambridge University Press, 2003.
 
  • #6
It's an interesting question whether "information" can be defined independently of some specific way of using information.

For example, if a we have information about people, then which type of list contains more information- a list in alphabetical order by last name? - or a list in numerical order from smallest age to largest age?

If the use of the list is in tasks like "Find out about Nelly Smith" then the alphabetical list is preferred. If use of the list in in tasks like "Find out about 18 year olds" then the latter list is better.
 

1. What is feature selection?

Feature selection is the process of selecting a subset of relevant features from a larger set of features in a dataset. It aims to improve the performance of a machine learning model by reducing the complexity and overfitting.

2. What is information gain?

Information gain is a measure used in decision trees to evaluate the usefulness of a feature in classifying a dataset. It measures the reduction in entropy after a dataset is split on a particular feature. Higher information gain indicates that the feature is more important for predicting the target variable.

3. How is information gain calculated?

Information gain is calculated by taking the difference between the entropy of the parent node and the weighted average of the child nodes' entropy after splitting on a particular feature. The feature with the highest information gain is chosen as the splitting feature.

4. What is information entropy?

Information entropy is a measure of uncertainty or randomness in a dataset. It is used to quantify the amount of information contained in a dataset. Higher entropy indicates more randomness, while lower entropy indicates more predictability.

5. How does feature selection using information gain help in machine learning?

Feature selection using information gain helps in machine learning by reducing the number of features and selecting the most relevant ones, which leads to a simpler and more accurate model. It also reduces overfitting and improves the model's generalization ability.

Similar threads

Replies
8
Views
2K
  • Electrical Engineering
Replies
4
Views
2K
  • Advanced Physics Homework Help
Replies
5
Views
1K
Replies
1
Views
1K
Replies
7
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
2K
  • Special and General Relativity
Replies
23
Views
2K
Replies
0
Views
2K
Back
Top