Variable Normalization for different variable ranges....

  • #1
1,435
98
TL;DR Summary
Understand when and when not to normalize the range of the independent variables...
Hello,

On the topic of feature scaling: I am wondering if normalization needs to be used all the time or only in some particular circumstances. Normalization means transforming/remapping the range of a variable with values ##[x_0,x_f]## to the range ##[0,1]##.

For example, let's consider a linear regression model with 3 independent variables and one dependent variable: $$Y= a X_1 +b X2 + c X3$$
It is generally likely that the independent variables ##X_1 , X_2, X_3## have very different ranges. For example, ##X_1## may have values between 0 and 2000 while ##X_3## value only between 0 and 0.5...Is that an issue? Would the variable with the largest range possibly influence the dependent variable ##Y## more significantly just because of its wider range and not because it is truly important? I don't see normalization being applied all the time...

Is it always good practice, no matter the model we are going for, to first normalize all the independent variables so they their values all fall within the same range?

Another possible issue we may have with independent variables is that the may be pairwise linearly correlated: too much correlation is not good. How much correlation can we accept? Is the presence of such correlation not desirable because it leads to think of an actual correlation between ##Y## and, say, ##X_1##, just by proxy via another independent, say ##X_2## if ##X_1## and ##X_2## are correlated? I don't see a problem with that...

Thank you!
 
  • #2
Is it always good practice, no matter the model we are going for, to first normalize all the independent variables so they their values all fall within the same range?
I don't think it ever hurts. Some standard algorithms will automatically normalize them every time. The times when you should really consider it is when the variables differ greatly in their magnitude and variance. Those are times when some matrices in the calculations become "ill-conditioned". In those cases, small changes in the calculation round-off and accuracy can make significant changes in the solution.
Another possible issue we may have with independent variables is that the may be pairwise linearly correlated: too much correlation is not good. How much correlation can we accept? Is the presence of such correlation not desirable because it leads to think of an actual correlation between ##Y## and, say, ##X_1##, just by proxy via another independent, say ##X_2## if ##X_1## and ##X_2## are correlated? I don't see a problem with that...
There is no problem with the conclusion that ##Y## and ##X_1## are correlated just because ##X_2## and ##X_1## are correlated. Both correlations exist. You should be very careful about drawing any conclusions regarding cause and effect, but either ##X_1## or ##X_2## can be used to estimate ##Y##.
There is a process called "stepwise regression" that might interest you. It treats the issue of correlated independent variables directly.
 
  • #3
Thank you. I think highly correlated independent variables are called "confounding variables". An example I found online: sunburns and ice cream consumption.

Ice cream consumption is highly correlated with sunburn. That does not mean ice cream consumption causes sunburn, obviously (no cause-effect correlation).

But should we eliminate of those independent variables from our model because of the high correlation?
 
  • #4
Thank you. I think highly correlated independent variables are called "confounding variables". An example I found online: sunburns and ice cream consumption.

Ice cream consumption is highly correlated with sunburn. That does not mean ice cream consumption causes sunburn, obviously (no cause-effect correlation).

But should we eliminate of those independent variables from our model because of the high correlation?
Suppose you want to predict ##Y## based on correlated ##X_1## and ##X_2##. Suppose you start with a model that includes only the variable, say ##X_1##, most highly correlated with ##Y##. What to do about ##X_2##?
Suppose that you remove the correlation of ##X_2## with ##X_1## to get a residual variable, ##\hat{X_2}##. Likewise, you can remove the correlation of ##Y## with ##X_1## to get a residual variable, ##\hat{Y}##. Then the question is whether, having accounted for ##X_1##, there is a significant enough remaining correlation between the residual variables, ##\hat{Y}## and ##\hat{X_2}## to include it.
I suggest that you take a hard look at the process of stepwise regression if you have further questions.
 

Suggested for: Variable Normalization for different variable ranges....

Replies
5
Views
444
Replies
2
Views
816
Replies
5
Views
621
Replies
4
Views
598
Replies
25
Views
1K
Replies
4
Views
528
Replies
7
Views
665
Replies
1
Views
626
Back
Top