Feature Selection, Information Gain and Information Entropy

WWGD · Jul 22, 2020

Sorry for being fuzzy here, I started reading a small paper and I am a bit confused. These are some loose notes without sources or refs.

Say we start with a collection F of features we want to trim into a smaller set F' of features through information gain and entropy (where we are using the formula ## -P_{ai} log_2(Pa_i) ##).

We start by assigning a normalized weight N_F (so that all weights are between 0 and 1 )to each feature. I am not sure on the actual assignment rule. Then we assign and entropy value E_F to each N_F.

And ** Here ** is the key to my post, The cutoff point for a feature is defined as ## 2^{E_F} ## . What does this measure? It seems it has to see with inverting the log value, but how does this relate to information gain as a cutoff point?

jedishrfu · Jul 22, 2020

While I don't have an answer to your question, there is an online book that may have it:

100 Page Machine Learning Book by Burkov

https://towardsdatascience.com/the-hundred-page-machine-learning-book-book-review-72b51c5ad083

Burkov said that the book is available as a try and buy license once you find where its stored on github.

https://github.com/aburkov/theMLbook

Stephen Tashi · Jul 22, 2020

WWGD said:

(where we are using the formula ## -P_ai log_2(Pa_i) ##).

What probability space is being used? If something is being sampled, define what it is and how it's being sampled.

Pick a feature at random from the set of features? Pick a thing at random from a set of things and then see if the thing possesses the ##i##th feature?

WWGD · Jul 22, 2020

Stephen Tashi said:

What probability space is being used? If something is being sampled, define what it is and how it's being sampled.

Pick a feature at random from the set of features? Pick a thing at random from a set of things and then see if the thing possesses the ##i##th feature?

Thank you. Unfortunately, I took this from some loose notes and not a book and I am not clear about the source nor details. I was hoping someone would know about the area enough to help me narrow things down.

now in a minute · Aug 20, 2020

I maybe completely wrong here (your definition of the problem is quite vague), but this strikes me as a problem of lossy compression, i.e., how can we compress the set F into a smaller set F' without losing too much information in the process. MacKay's book [1, pp. 75 - 84] contains some stuff that, while not identical, seems like it may be closely related to your problem. The author has made a PDF of the book available from their personal website here: https://www.inference.org.uk/mackay/itila/

In [1, Ch. 4, p. 75], an example of lossy compression is given using an alphabet ##\mathcal{A}_X##
$$
\mathcal{A}_X = \{\textrm{a,b,c,d,e,f,g,h}\},
$$
which occur with probabilities:
$$
\mathcal{P}_X = \{\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{3}{16}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}\}.
$$
Noting that the probability of a given letter being a, b, c, or d is 15/16, if we are willing to live with losing information one time out of 16 then we can reduce our alphabet to:
$$
\mathcal{A}'_X = \{\textrm{a,b,c,d}\}
$$
This is used to build-up to a definition of smallest ##\delta##-sufficient subsets. However, the thing that catches my eye related to your post is on page 80, in a discussion about compressing strings of length ##N## of symbols from ##\mathcal{A}_X##. The information content of such a string is given [1, eq. 4.28] as:
$$
\log_2 \dfrac{1}{P(\mathbf{x})} \approx N \sum_i p_i \log_2 \dfrac{1}{p_i} = NH
$$
Mackay then defines "typical elements" of the set as those with probability close to ##2^{-NH}##, which to me seems very similar to your cutoff threshold of ##2^{E_F}##. I am by no means an expert in information theory, so I am unable to offer the insight that you are looking for, but Mackay's book - especially the page range that I have referenced here - maybe of use.

[1] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, 7th ed. Cambridge: Cambridge University Press, 2003.

Stephen Tashi · Aug 20, 2020

It's an interesting question whether "information" can be defined independently of some specific way of using information.

For example, if a we have information about people, then which type of list contains more information- a list in alphabetical order by last name? - or a list in numerical order from smallest age to largest age?

If the use of the list is in tasks like "Find out about Nelly Smith" then the alphabetical list is preferred. If use of the list in in tasks like "Find out about 18 year olds" then the latter list is better.

Feature Selection, Information Gain and Information Entropy

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect