I Variable Normalization for different variable ranges....

fog37
Messages
1,566
Reaction score
108
TL;DR Summary
Understand when and when not to normalize the range of the independent variables...
Hello,

On the topic of feature scaling: I am wondering if normalization needs to be used all the time or only in some particular circumstances. Normalization means transforming/remapping the range of a variable with values ##[x_0,x_f]## to the range ##[0,1]##.

For example, let's consider a linear regression model with 3 independent variables and one dependent variable: $$Y= a X_1 +b X2 + c X3$$
It is generally likely that the independent variables ##X_1 , X_2, X_3## have very different ranges. For example, ##X_1## may have values between 0 and 2000 while ##X_3## value only between 0 and 0.5...Is that an issue? Would the variable with the largest range possibly influence the dependent variable ##Y## more significantly just because of its wider range and not because it is truly important? I don't see normalization being applied all the time...

Is it always good practice, no matter the model we are going for, to first normalize all the independent variables so they their values all fall within the same range?

Another possible issue we may have with independent variables is that the may be pairwise linearly correlated: too much correlation is not good. How much correlation can we accept? Is the presence of such correlation not desirable because it leads to think of an actual correlation between ##Y## and, say, ##X_1##, just by proxy via another independent, say ##X_2## if ##X_1## and ##X_2## are correlated? I don't see a problem with that...

Thank you!
 
Physics news on Phys.org
fog37 said:
Is it always good practice, no matter the model we are going for, to first normalize all the independent variables so they their values all fall within the same range?
I don't think it ever hurts. Some standard algorithms will automatically normalize them every time. The times when you should really consider it is when the variables differ greatly in their magnitude and variance. Those are times when some matrices in the calculations become "ill-conditioned". In those cases, small changes in the calculation round-off and accuracy can make significant changes in the solution.
fog37 said:
Another possible issue we may have with independent variables is that the may be pairwise linearly correlated: too much correlation is not good. How much correlation can we accept? Is the presence of such correlation not desirable because it leads to think of an actual correlation between ##Y## and, say, ##X_1##, just by proxy via another independent, say ##X_2## if ##X_1## and ##X_2## are correlated? I don't see a problem with that...
There is no problem with the conclusion that ##Y## and ##X_1## are correlated just because ##X_2## and ##X_1## are correlated. Both correlations exist. You should be very careful about drawing any conclusions regarding cause and effect, but either ##X_1## or ##X_2## can be used to estimate ##Y##.
There is a process called "stepwise regression" that might interest you. It treats the issue of correlated independent variables directly.
 
Thank you. I think highly correlated independent variables are called "confounding variables". An example I found online: sunburns and ice cream consumption.

Ice cream consumption is highly correlated with sunburn. That does not mean ice cream consumption causes sunburn, obviously (no cause-effect correlation).

But should we eliminate of those independent variables from our model because of the high correlation?
 
fog37 said:
Thank you. I think highly correlated independent variables are called "confounding variables". An example I found online: sunburns and ice cream consumption.

Ice cream consumption is highly correlated with sunburn. That does not mean ice cream consumption causes sunburn, obviously (no cause-effect correlation).

But should we eliminate of those independent variables from our model because of the high correlation?
Suppose you want to predict ##Y## based on correlated ##X_1## and ##X_2##. Suppose you start with a model that includes only the variable, say ##X_1##, most highly correlated with ##Y##. What to do about ##X_2##?
Suppose that you remove the correlation of ##X_2## with ##X_1## to get a residual variable, ##\hat{X_2}##. Likewise, you can remove the correlation of ##Y## with ##X_1## to get a residual variable, ##\hat{Y}##. Then the question is whether, having accounted for ##X_1##, there is a significant enough remaining correlation between the residual variables, ##\hat{Y}## and ##\hat{X_2}## to include it.
I suggest that you take a hard look at the process of stepwise regression if you have further questions.
 
Namaste & G'day Postulate: A strongly-knit team wins on average over a less knit one Fundamentals: - Two teams face off with 4 players each - A polo team consists of players that each have assigned to them a measure of their ability (called a "Handicap" - 10 is highest, -2 lowest) I attempted to measure close-knitness of a team in terms of standard deviation (SD) of handicaps of the players. Failure: It turns out that, more often than, a team with a higher SD wins. In my language, that...
Hi all, I've been a roulette player for more than 10 years (although I took time off here and there) and it's only now that I'm trying to understand the physics of the game. Basically my strategy in roulette is to divide the wheel roughly into two halves (let's call them A and B). My theory is that in roulette there will invariably be variance. In other words, if A comes up 5 times in a row, B will be due to come up soon. However I have been proven wrong many times, and I have seen some...

Similar threads

Back
Top