Variable Normalization for different variable ranges....

In summary, feature scaling, specifically normalization, can be important in certain cases to prevent issues with ill-conditioned matrices. Additionally, high correlation between independent variables can be handled through the process of stepwise regression.
  • #1
fog37
1,568
108
TL;DR Summary
Understand when and when not to normalize the range of the independent variables...
Hello,

On the topic of feature scaling: I am wondering if normalization needs to be used all the time or only in some particular circumstances. Normalization means transforming/remapping the range of a variable with values ##[x_0,x_f]## to the range ##[0,1]##.

For example, let's consider a linear regression model with 3 independent variables and one dependent variable: $$Y= a X_1 +b X2 + c X3$$
It is generally likely that the independent variables ##X_1 , X_2, X_3## have very different ranges. For example, ##X_1## may have values between 0 and 2000 while ##X_3## value only between 0 and 0.5...Is that an issue? Would the variable with the largest range possibly influence the dependent variable ##Y## more significantly just because of its wider range and not because it is truly important? I don't see normalization being applied all the time...

Is it always good practice, no matter the model we are going for, to first normalize all the independent variables so they their values all fall within the same range?

Another possible issue we may have with independent variables is that the may be pairwise linearly correlated: too much correlation is not good. How much correlation can we accept? Is the presence of such correlation not desirable because it leads to think of an actual correlation between ##Y## and, say, ##X_1##, just by proxy via another independent, say ##X_2## if ##X_1## and ##X_2## are correlated? I don't see a problem with that...

Thank you!
 
Physics news on Phys.org
  • #2
fog37 said:
Is it always good practice, no matter the model we are going for, to first normalize all the independent variables so they their values all fall within the same range?
I don't think it ever hurts. Some standard algorithms will automatically normalize them every time. The times when you should really consider it is when the variables differ greatly in their magnitude and variance. Those are times when some matrices in the calculations become "ill-conditioned". In those cases, small changes in the calculation round-off and accuracy can make significant changes in the solution.
fog37 said:
Another possible issue we may have with independent variables is that the may be pairwise linearly correlated: too much correlation is not good. How much correlation can we accept? Is the presence of such correlation not desirable because it leads to think of an actual correlation between ##Y## and, say, ##X_1##, just by proxy via another independent, say ##X_2## if ##X_1## and ##X_2## are correlated? I don't see a problem with that...
There is no problem with the conclusion that ##Y## and ##X_1## are correlated just because ##X_2## and ##X_1## are correlated. Both correlations exist. You should be very careful about drawing any conclusions regarding cause and effect, but either ##X_1## or ##X_2## can be used to estimate ##Y##.
There is a process called "stepwise regression" that might interest you. It treats the issue of correlated independent variables directly.
 
  • Like
Likes fog37
  • #3
Thank you. I think highly correlated independent variables are called "confounding variables". An example I found online: sunburns and ice cream consumption.

Ice cream consumption is highly correlated with sunburn. That does not mean ice cream consumption causes sunburn, obviously (no cause-effect correlation).

But should we eliminate of those independent variables from our model because of the high correlation?
 
  • #4
fog37 said:
Thank you. I think highly correlated independent variables are called "confounding variables". An example I found online: sunburns and ice cream consumption.

Ice cream consumption is highly correlated with sunburn. That does not mean ice cream consumption causes sunburn, obviously (no cause-effect correlation).

But should we eliminate of those independent variables from our model because of the high correlation?
Suppose you want to predict ##Y## based on correlated ##X_1## and ##X_2##. Suppose you start with a model that includes only the variable, say ##X_1##, most highly correlated with ##Y##. What to do about ##X_2##?
Suppose that you remove the correlation of ##X_2## with ##X_1## to get a residual variable, ##\hat{X_2}##. Likewise, you can remove the correlation of ##Y## with ##X_1## to get a residual variable, ##\hat{Y}##. Then the question is whether, having accounted for ##X_1##, there is a significant enough remaining correlation between the residual variables, ##\hat{Y}## and ##\hat{X_2}## to include it.
I suggest that you take a hard look at the process of stepwise regression if you have further questions.
 

1. What is variable normalization?

Variable normalization is a process used in data analysis to transform variables with different ranges into a common scale. This allows for fair comparison and analysis of the variables.

2. Why is variable normalization important?

Variable normalization is important because it helps to eliminate the impact of different variable ranges on data analysis. It ensures that all variables are equally weighted and avoids giving undue importance to variables with larger ranges.

3. How is variable normalization performed?

Variable normalization can be performed using various techniques such as min-max scaling, z-score normalization, and decimal scaling. These methods involve transforming the original variable values into a new scale that ranges from 0 to 1 or -1 to 1.

4. What are the benefits of variable normalization?

Variable normalization has several benefits, including improving the accuracy and reliability of data analysis, making data more interpretable, and reducing the impact of outliers on the analysis. It also helps in improving the performance of machine learning algorithms.

5. Are there any limitations of variable normalization?

While variable normalization has many benefits, it also has some limitations. For example, it can distort the original distribution of the data and may not be suitable for all types of data. It also requires careful consideration of the appropriate normalization technique for the specific dataset.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
462
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
486
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
930
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
Back
Top