Feature Selection, Information Gain and Information Entropy

Click For Summary

Discussion Overview

The discussion revolves around the concepts of feature selection, information gain, and information entropy, particularly in the context of reducing a set of features through these measures. Participants explore the mathematical formulations involved and the implications of using entropy as a cutoff point for feature selection.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant expresses confusion about the assignment of normalized weights to features and how entropy values are calculated, specifically questioning the significance of the cutoff point defined as ## 2^{E_F} ##.
  • Another participant suggests a resource, an online book by Burkov, which may contain relevant information, although they do not provide a direct answer to the initial question.
  • Several participants inquire about the probability space being used in the context of sampling features, seeking clarification on the sampling method and the nature of the features involved.
  • A participant proposes a connection between the problem of feature selection and lossy compression, referencing MacKay's work on information theory and suggesting that the concept of typical elements may relate to the cutoff threshold mentioned earlier.
  • Another participant raises a philosophical question about the nature of information, discussing how the organization of data can influence its perceived value depending on the context of use.

Areas of Agreement / Disagreement

Participants do not reach a consensus, as there are multiple competing views regarding the definitions and implications of information gain and entropy in feature selection. The discussion remains unresolved with various interpretations and questions raised.

Contextual Notes

Participants note limitations in the clarity of definitions and the specifics of the sampling process, which may affect the understanding of the concepts discussed. There is also a recognition of the vagueness in the initial problem statement.

WWGD
Science Advisor
Homework Helper
Messages
7,802
Reaction score
13,106
TL;DR
To each feature in a set we apply a normalized weight; to each weight we assign an entropy value and then we use a cutoff value to decide which features remain. I am trying to get a better "feel" for the choice of cutoff value
Sorry for being fuzzy here, I started reading a small paper and I am a bit confused. These are some loose notes without sources or refs.

Say we start with a collection F of features we want to trim into a smaller set F' of features through information gain and entropy (where we are using the formula ## -P_{ai} log_2(Pa_i) ##).

We start by assigning a normalized weight N_F (so that all weights are between 0 and 1 )to each feature. I am not sure on the actual assignment rule. Then we assign and entropy value E_F to each N_F.

And ** Here ** is the key to my post, The cutoff point for a feature is defined as ## 2^{E_F} ## . What does this measure? It seems it has to see with inverting the log value, but how does this relate to information gain as a cutoff point?
 
Last edited:
Physics news on Phys.org
WWGD said:
(where we are using the formula ## -P_ai log_2(Pa_i) ##).

What probability space is being used? If something is being sampled, define what it is and how it's being sampled.

Pick a feature at random from the set of features? Pick a thing at random from a set of things and then see if the thing possesses the ##i##th feature?
 
  • Like
Likes   Reactions: WWGD and jedishrfu
Stephen Tashi said:
What probability space is being used? If something is being sampled, define what it is and how it's being sampled.

Pick a feature at random from the set of features? Pick a thing at random from a set of things and then see if the thing possesses the ##i##th feature?
Thank you. Unfortunately, I took this from some loose notes and not a book and I am not clear about the source nor details. I was hoping someone would know about the area enough to help me narrow things down.
 
I maybe completely wrong here (your definition of the problem is quite vague), but this strikes me as a problem of lossy compression, i.e., how can we compress the set F into a smaller set F' without losing too much information in the process. MacKay's book [1, pp. 75 - 84] contains some stuff that, while not identical, seems like it may be closely related to your problem. The author has made a PDF of the book available from their personal website here: https://www.inference.org.uk/mackay/itila/

In [1, Ch. 4, p. 75], an example of lossy compression is given using an alphabet ##\mathcal{A}_X##
$$
\mathcal{A}_X = \{\textrm{a,b,c,d,e,f,g,h}\},
$$
which occur with probabilities:
$$
\mathcal{P}_X = \{\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{3}{16}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}\}.
$$
Noting that the probability of a given letter being a, b, c, or d is 15/16, if we are willing to live with losing information one time out of 16 then we can reduce our alphabet to:
$$
\mathcal{A}'_X = \{\textrm{a,b,c,d}\}
$$
This is used to build-up to a definition of smallest ##\delta##-sufficient subsets. However, the thing that catches my eye related to your post is on page 80, in a discussion about compressing strings of length ##N## of symbols from ##\mathcal{A}_X##. The information content of such a string is given [1, eq. 4.28] as:
$$
\log_2 \dfrac{1}{P(\mathbf{x})} \approx N \sum_i p_i \log_2 \dfrac{1}{p_i} = NH
$$
Mackay then defines "typical elements" of the set as those with probability close to ##2^{-NH}##, which to me seems very similar to your cutoff threshold of ##2^{E_F}##. I am by no means an expert in information theory, so I am unable to offer the insight that you are looking for, but Mackay's book - especially the page range that I have referenced here - maybe of use.

[1] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, 7th ed. Cambridge: Cambridge University Press, 2003.
 
It's an interesting question whether "information" can be defined independently of some specific way of using information.

For example, if a we have information about people, then which type of list contains more information- a list in alphabetical order by last name? - or a list in numerical order from smallest age to largest age?

If the use of the list is in tasks like "Find out about Nelly Smith" then the alphabetical list is preferred. If use of the list in in tasks like "Find out about 18 year olds" then the latter list is better.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 13 ·
Replies
13
Views
10K
  • · Replies 21 ·
Replies
21
Views
6K
Replies
5
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
Replies
12
Views
4K
  • · Replies 2 ·
Replies
2
Views
458
  • · Replies 23 ·
Replies
23
Views
3K