Linear regression, feature scaling, and regression coefficients

Click For Summary
SUMMARY

This discussion centers on the importance of feature scaling in linear regression, particularly regarding how the range of independent variables affects regression coefficients. Standardizing independent variables before model building is essential for accurate interpretation of coefficients, as larger ranges can lead to misleading coefficient magnitudes. While scaling is not always required, it is crucial for certain algorithms and can prevent numerical issues. The consensus is that linear regression coefficients are computed correctly without scaling, provided there are no data entry errors, but scaling can enhance interpretability and model performance.

PREREQUISITES
  • Understanding of linear regression and ordinary least squares (OLS)
  • Familiarity with feature scaling techniques such as standardization and normalization
  • Knowledge of statistical significance and p-values
  • Awareness of machine learning algorithms that require scaling, such as K-nearest neighbors
NEXT STEPS
  • Research the impact of feature scaling on regression coefficients in Python using libraries like scikit-learn
  • Explore the differences between standardization and normalization in data preprocessing
  • Learn about the assumptions underlying linear regression models and their implications
  • Investigate how different machine learning algorithms handle feature scaling and the consequences of neglecting it
USEFUL FOR

Data scientists, statisticians, and machine learning practitioners looking to enhance their understanding of linear regression and the role of feature scaling in model accuracy and interpretability.

fog37
Messages
1,566
Reaction score
108
TL;DR
Linear regression, feature scaling, and regression coefficients
Hello,

In studying linear regression more deeply, I learned that scaling play an important role in multiple ways:

a) the range of the independent variables ##X## affects the values of the regression coefficients. For example, a predictor variable ##X## with a large range typically get assigned a larger regression coefficient and comparing the relative importance of the regression coefficients solely based on coefficient magnitude is misleading. The more appropriate way to compare coefficients to determine relative importance is to standardize the independent variables (standardization is a form of scaling) before building the model.

Another benefit of scaling the predictor variables (standardization, normalization or any other scaling technique) is to extract more meaning from the interpretation of the coefficients: sometimes a regression coefficient may be extremely small and that may just be due to the particular scaling of the data. It is possible to get a larger coefficient and extract more understanding about the relationship between ##Y## and ##X## by properly scaling the predictor variable.

I also read that certain statistical and ML algorithm really require scaling while other (rule-based ones) don't.

So, in essence, scaling is useful but not always required. However, in some cases, it is required as a pre-processing step...

Finally my question: without any type of scaling the independent variables, does linear regression (multiple or single) perform properly, i.e. are the regression coefficient computed correctly? Aside from interpretability issues, does linear regression (OLS) generate larger coefficients for variables with larger range?

Thank you for any input on this!
 
Physics news on Phys.org
Whether or not to scale is primarily determined just by concerns about computational overflows and round-off errors. You should always look at the statistical significance of the coefficients (how many standard deviations they are away from zero) rather than just their magnitude. Any reasonable statistics package will have a regression algorithm that includes the information you need.
 
  • Like
Likes   Reactions: fog37
True. I have an example where the coefficient is practically zero and the p-value is very very small (<0.05).
Linear scaling leads to a larger regression coefficient keeping the same p-value.

I guess my dilemma is that the certain algorithm change "require" feature scaling to perform correctly and I am wondering if linear regression is one of them...
 
"Finally my question: without any type of scaling the independent variables, does linear regression (multiple or single) perform properly, i.e. are the regression coefficient computed correctly?"
Always, as long as there are not any data entry errors. The problem here is that your question is not well-phrased: if you take any set of data, correctly entered, and apply least squares, then assuming the program carries out LS correctly the coefficients are computed correctly -- you get the answers you should get based on the inputs.
What I think you mean by "computed correctly" is this: are they the ones appropriate for the context of the problem? IMO the answer there is more subtle: note that
We never know the true form of of any model: whenever you specify the form of a linear regression model you are making an assumption that it is correct. This means that, by default, assuming no errors in data entry, recording, or in the calculations, the coefficients are computed correctly for the assumed model form.
If you're asking about scaling there are two things [at least] to think about.
First: suppose, as an extreme example, you're trying to perform linear regression with a person's age in years based on their salary, in dollars. Typically salaries will be in thousands, age will be at most 100 [and most likely under 70, since we're talking about salaries]. In order to get an equation that looks like
Age = intercept + slope Salary
work the slope will need to be very small to give values on the right down to the scale of Age.
however, if Salary is in tens of thousands of dollars the slope won't be tiny, since the recorded values for salary are already roughly on the scale of age.
In short, in linear regression scaling is most often a matter of choice.
Second: there are some more sophisticated methods [K-nearest neighbors for one] where the essential calculations are based on distances between values, and if one or more of the variables are on significantly greater magnitude than others those variables will have dominate the calculations: here good practice is to scale all variables to have the same magnitude and variability prior to performing the analysis
 
  • Like
Likes   Reactions: fog37 and FactChecker
statdad said:
Second: there are some more sophisticated methods [K-nearest neighbors for one] where the essential calculations are based on distances between values, and if one or more of the variables are on significantly greater magnitude than others those variables will have dominate the calculations: here good practice is to scale all variables to have the same magnitude and variability prior to performing the analysis
Would that be a standard step in the tool algorithm or at least an option that the user can select?
 
To some extent whether the scaling is done automatically or left as an on/off option for the user depends on the software. Regardless, for the types of processes I mentioned scaling should be done
 
  • Like
Likes   Reactions: FactChecker
statdad said:
To some extent whether the scaling is done automatically or left as an on/off option for the user depends on the software. Regardless, for the types of processes I mentioned scaling should be done
I can't think of any case where scaling was bad to do, and there are certainly cases where it should be done.
 
I can't think of any where it would be bad, and [I believe] your comment also implies that there are situations where it isn't required.
 
statdad said:
I can't think of any where it would be bad, and [I believe] your comment also implies that there are situations where it isn't required.
It's not a required part of the algorithm. It is just safer in some cases to avoid numerical problems with the calculations.
 

Similar threads

  • · Replies 13 ·
Replies
13
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
3
Views
3K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K