WWGD said:
Summary:: How to find features that provide a high correlation with a dependent variable.
Hi, I remember reading a paper a while back that argued/proved that a large population of young people ( say <19 y.o or so) and a population pyramid that is thick at the bottom is a necessary feature for the onset of revolution .
** My Question** Is this determination based on Statistics alone, subject matter knowledge or a combination of both? What process would one follow in order to do feature selection if one wanted to determine the \a choice of feature to associate with a given Dependent Variable other than just using basic correlation analysis. Maybe some type of Anova?
There are a number of ways.
https://scikit-learn.org/stable/modules/feature_selection.html
Always make sure that you test on a subset of data that didn't inform the selection. Or, for example, if you are using the features for a predictive model, you can do feature selection within the cross-validation loop if you're using that, but not before hand on the full data. If your feature selection process include an optimal parameter search, you should do that within an inner/nested cross validation loop. In classical machine learning, often the feature selection process itself is part of the model (the whole pipeline is, including preprocessing, feature selection, and parameter tuning).
In some cases, there are a high number of candidate features, and just searching for the best ones can fail, since there is some chance that fluctuations/noise can by chance produce a distribution showing correlation. In those cases, and in general to some extent, it is important to also have some reason to believe the feature might have a causal relationship, or that the population distributions should show a correlation. That way, you begin with a hypothesis, and a much smaller number of candidates, and you have a better chance that your finding is reliable.
It is believed that a very large subset of statistical research is faulty because of this issue. Different scientific fields/sub-fields are always trying to work towards more robust methodology to avoid these kind of pitfalls. For example, the p-value threshold to be relied on depends strongly on the application. Many, many works have presented false discoveries, or bad results in general due to this issue, for example, by assuming p<0.05 is enough (not to mention the hacking).