Variable Normalization for different variable ranges....

  • Context: Undergrad 
  • Thread starter Thread starter fog37
  • Start date Start date
  • Tags Tags
    Normalization Variable
Click For Summary

Discussion Overview

The discussion revolves around the necessity and implications of variable normalization in the context of feature scaling for regression models. Participants explore when normalization should be applied, its effects on model performance, and the challenges posed by correlated independent variables.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants question whether normalization should be applied universally or only in specific circumstances, particularly when independent variables have significantly different ranges.
  • There is a concern that variables with larger ranges may disproportionately influence the dependent variable, potentially skewing results.
  • One participant suggests that normalization is particularly important when variables differ greatly in magnitude and variance, as it can prevent ill-conditioning in matrices used in calculations.
  • Participants discuss the implications of high correlation among independent variables, with some noting that this can lead to misleading interpretations of correlation with the dependent variable.
  • There is mention of "confounding variables" and an example involving sunburns and ice cream consumption to illustrate the potential pitfalls of correlation without causation.
  • One participant proposes the use of stepwise regression as a method to address issues of correlated independent variables and to determine their significance in the model.

Areas of Agreement / Disagreement

Participants express differing views on the necessity of normalization and the treatment of correlated variables. While some advocate for normalization as a good practice, others question its universal applicability. The discussion on how to handle correlated independent variables remains unresolved, with multiple perspectives presented.

Contextual Notes

Participants highlight the complexity of determining acceptable levels of correlation and the potential for misinterpretation of relationships between variables. The discussion reflects a range of assumptions and conditions regarding normalization and correlation without reaching a consensus.

fog37
Messages
1,566
Reaction score
108
TL;DR
Understand when and when not to normalize the range of the independent variables...
Hello,

On the topic of feature scaling: I am wondering if normalization needs to be used all the time or only in some particular circumstances. Normalization means transforming/remapping the range of a variable with values ##[x_0,x_f]## to the range ##[0,1]##.

For example, let's consider a linear regression model with 3 independent variables and one dependent variable: $$Y= a X_1 +b X2 + c X3$$
It is generally likely that the independent variables ##X_1 , X_2, X_3## have very different ranges. For example, ##X_1## may have values between 0 and 2000 while ##X_3## value only between 0 and 0.5...Is that an issue? Would the variable with the largest range possibly influence the dependent variable ##Y## more significantly just because of its wider range and not because it is truly important? I don't see normalization being applied all the time...

Is it always good practice, no matter the model we are going for, to first normalize all the independent variables so they their values all fall within the same range?

Another possible issue we may have with independent variables is that the may be pairwise linearly correlated: too much correlation is not good. How much correlation can we accept? Is the presence of such correlation not desirable because it leads to think of an actual correlation between ##Y## and, say, ##X_1##, just by proxy via another independent, say ##X_2## if ##X_1## and ##X_2## are correlated? I don't see a problem with that...

Thank you!
 
Physics news on Phys.org
fog37 said:
Is it always good practice, no matter the model we are going for, to first normalize all the independent variables so they their values all fall within the same range?
I don't think it ever hurts. Some standard algorithms will automatically normalize them every time. The times when you should really consider it is when the variables differ greatly in their magnitude and variance. Those are times when some matrices in the calculations become "ill-conditioned". In those cases, small changes in the calculation round-off and accuracy can make significant changes in the solution.
fog37 said:
Another possible issue we may have with independent variables is that the may be pairwise linearly correlated: too much correlation is not good. How much correlation can we accept? Is the presence of such correlation not desirable because it leads to think of an actual correlation between ##Y## and, say, ##X_1##, just by proxy via another independent, say ##X_2## if ##X_1## and ##X_2## are correlated? I don't see a problem with that...
There is no problem with the conclusion that ##Y## and ##X_1## are correlated just because ##X_2## and ##X_1## are correlated. Both correlations exist. You should be very careful about drawing any conclusions regarding cause and effect, but either ##X_1## or ##X_2## can be used to estimate ##Y##.
There is a process called "stepwise regression" that might interest you. It treats the issue of correlated independent variables directly.
 
  • Like
Likes   Reactions: fog37
Thank you. I think highly correlated independent variables are called "confounding variables". An example I found online: sunburns and ice cream consumption.

Ice cream consumption is highly correlated with sunburn. That does not mean ice cream consumption causes sunburn, obviously (no cause-effect correlation).

But should we eliminate of those independent variables from our model because of the high correlation?
 
fog37 said:
Thank you. I think highly correlated independent variables are called "confounding variables". An example I found online: sunburns and ice cream consumption.

Ice cream consumption is highly correlated with sunburn. That does not mean ice cream consumption causes sunburn, obviously (no cause-effect correlation).

But should we eliminate of those independent variables from our model because of the high correlation?
Suppose you want to predict ##Y## based on correlated ##X_1## and ##X_2##. Suppose you start with a model that includes only the variable, say ##X_1##, most highly correlated with ##Y##. What to do about ##X_2##?
Suppose that you remove the correlation of ##X_2## with ##X_1## to get a residual variable, ##\hat{X_2}##. Likewise, you can remove the correlation of ##Y## with ##X_1## to get a residual variable, ##\hat{Y}##. Then the question is whether, having accounted for ##X_1##, there is a significant enough remaining correlation between the residual variables, ##\hat{Y}## and ##\hat{X_2}## to include it.
I suggest that you take a hard look at the process of stepwise regression if you have further questions.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 13 ·
Replies
13
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K