Limitations of Multivariate Linear Regression

In summary, multivariate linear regression can still be used if there are linear relationships between the dependent and independent variables or between the independent variables themselves. However, if there are curvilinear relationships, other methods may be more optimal. It is important to be cautious when interpreting results and to consider dropping correlated variables from the model. Stepwise regression techniques can also be used to select the most statistically significant predictors. However, it is important to split the data into separate pools for generating and testing the model.
  • #1
fog37
1,568
108
TL;DR Summary
Understand the possible limitations of multivariate linear regression
Hello,

With multivariate linear regression, there is a single dependent variable ##y## and multiple independent variables ##x_1##, ##x_2##, ##x_3##, etc.
There is a linear, weighted relationship between ##y## and the various ##x## variables:
$$ y = c_1 x_1 + c_2 x_2 + c_3 x_3 $$
The independent variables are ideally totally independent from each other. Otherwise we run into the problem of collinearity. However, multivariate linear regression can still be used if pairs of independent variables are linearly related...

What happens if we discover that one or two of the independent variables ##x## has a curvilinear correlation with the dependent variable ##y## while the other have a linear correlation? Or if there is curvilinear correlation between the independent variables themselves?
Should multivariate linear regression still be used?

Thank you!
 
Physics news on Phys.org
  • #2
You can use some transformations of the ##x_i## variables if they look like their relationship to ##y## is nonlinear. The new transformed variables can then be used in the linear regression. In fact, it is common to include terms like ##c_ix_ix_k##. (Because you are allowed to transform the ##x_i## variables, the term "linear" refers more to how the ##c_i## coefficients appear than to how the ##x_i## variables appear.)
Also, there are stepwise regression techniques that will not end up including ##x_i## variables unless they give a statistically significant improvement to the relationship. So stepwise linear regression can still be applied even if the ##x_i## variables are strongly dependent. That is a good thing, the total independence of many variables is not common in many applications.
 
  • #3
fog37 said:
Summary:: Understand the possible limitations of multivariate linear regression

What happens if we discover that one or two of the independent variables x has a curvilinear correlation with the dependent variable y while the other have a linear correlation?
This can be handled in a couple of different ways depending on the relationship. If the dependent variable is a linear combination of some function of the predictor ##y=c_1 f_1(x_1, ...)+c_2 ...## then you can still use multivariate linear regression. If the dependent variable is some function of a linear combination of the predictors ##y=f(c_1 x_1 + c_2 ...)## then you can use a generalized linear model.

fog37 said:
Or if there is curvilinear correlation between the independent variables themselves?
Should multivariate linear regression still be used?
This is tricky. When you do this the result of the overall regression is valid, however the estimates of the fit parameters ##c_i## for the correlated predictors are unstable and inaccurate. So there are things that you can do with that model, but there are lots of problematic inferences you can make too. When I have a model with strong colinearity in the predictors I usually try to drop one of the predictors from the model. You will lose a little ##R^2## but there is less danger in interpreting the results.
 
Last edited:
  • Skeptical
Likes madness
  • #4
fog37 said:
Summary:: Understand the possible limitations of multivariate linear regression

Hello,

With multivariate linear regression, there is a single dependent variable ##y## and multiple independent variables ##x_1##, ##x_2##, ##x_3##, etc.
There is a linear, weighted relationship between ##y## and the various ##x## variables:
$$ y = c_1 x_1 + c_2 x_2 + c_3 x_3 $$
The independent variables are ideally totally independent from each other. Otherwise we run into the problem of collinearity. However, multivariate linear regression can still be used if pairs of independent variables are linearly related...

Collinearity would imply that one variable is a linear combination of the other two. The variables can be correlated (i.e. not independent) without being collinear, in which case multivariate linear regression should still do ok.

fog37 said:
What happens if we discover that one or two of the independent variables ##x## has a curvilinear correlation with the dependent variable ##y## while the other have a linear correlation? Or if there is curvilinear correlation between the independent variables themselves?
Should multivariate linear regression still be used?

Thank you!

It might help to take a Bayesian perspective. Performing multivariate linear regression is equivalent to assuming that the data follow a linear-Gaussian model in which the predicted variable is a linear combination of the regressors corrupted by additive Gaussian noise. If in fact there is some curvilinear relationship or non-Gaussian noise in the data, then multivariate linear regression is no longer the optimal method. If we knew the form of the curvilinear relationship then we could fit a model to the data which reflects that structure we believe to be present. If we don't know the form of the curvilinear relationship then various "no free lunch" theorems tell us that there is no one optimal method.
 
Last edited:
  • #5
Dale said:
This is tricky. When you do this the result of the overall regression is valid, however the estimates of the fit parameters ##c_i## for the correlated predictors are unstable and inaccurate.
If some ##x_i## variables are linearly dependent, there are trade-offs that allow multiple answers, but any of those answers are correct and accurate. If there are variables that are strongly correlated, there are trade-offs but some are statistically better predictors of the dependent variable than others. A stepwise regression algorithm would include the better estimators and only include others if they were still statistically significant in reducing the remaining SSE.
So there are things that you can do with that model, but there are lots of problematic inferences you can make too.
Yes, but there are always dangers in interpreting a regression. Correlation does not imply causation.
When I have a model with strong colinearity in the predictors I usually try to drop one of the predictors from the model. You will lose a little ##R^2## but there is less danger in interpreting the results.
That is what a backward elimination stepwise regression would do in a very methodical way.
 
  • #6
FactChecker said:
That is what a backward elimination stepwise regression would do in a very methodical way.
That is indeed one approach, but personally I prefer to do the elimination before the regression using any relevant non-statistical problem-specific knowledge I have to inform the model. If you use stepwise regression then you need to split your data into one pool for generating the model and a separate pool for testing it. Too many people don't do that when they use stepwise regression.
 
  • #7
madness said:
Collinearity would imply that one variable is a linear combination of the other two. The variables can be correlated (i.e. not independent) without being collinear, in which case multivariate linear regression should still do ok.
It might help to take a Bayesian perspective. Performing multivariate linear regression is equivalent to assuming that the data follow a linear-Gaussian model in which the predicted variable is a linear combination of the regressors corrupted by additive Gaussian noise. If in fact there is some curvilinear relationship or non-Gaussian noise in the data, then multivariate linear regression is no longer the optimal method. If we knew the form of the curvilinear relationship then we could fit a model to the data which reflects that structure we believe to be present. If we don't know the form of the curvilinear relationship then various "no free lunch" theorems tell us that there is no one optimal method.
"Performing multivariate linear regression is equivalent to assuming that the data follow a linear-Gaussian model in which the predicted variable is a linear combination of the regressors corrupted by additive Gaussian noise."

No. The assumption of Gaussian errors is not one of the traditional regression assumptions. If you make that assumption you are adding one more item to your assumptions about the relationship.
 

1. What are the assumptions of multivariate linear regression?

Multivariate linear regression assumes that the relationship between the dependent and independent variables is linear, the errors are normally distributed and homoscedastic (constant variance), there is no multicollinearity (high correlation between independent variables), and that observations are independent of each other. Violation of these assumptions can affect the validity and reliability of the model's outputs.

2. How does multicollinearity affect multivariate linear regression?

Multicollinearity refers to a situation in which two or more predictor variables are highly correlated. This can lead to unstable coefficients, where small changes in the data can lead to large changes in the model coefficients, making them difficult to interpret. Moreover, it can inflate the variance of the coefficient estimates, which might result in a failure to identify important predictors.

3. Can multivariate linear regression handle non-linear relationships?

Multivariate linear regression is designed to model relationships that are linear. It does not perform well with non-linear relationships unless those relationships are transformed into linear ones through techniques like logarithmic or polynomial transformations. For inherently non-linear relationships, models such as polynomial regression, logistic regression, or other non-linear models might be more appropriate.

4. What is the impact of outliers in multivariate linear regression?

Outliers can have a significant impact on a multivariate linear regression model. They can skew the results by affecting the slope of the regression line, leading to misleading interpretations. Outliers can disproportionately influence the model's estimate of the relationship between variables, potentially resulting in a poorer fit for the majority of the data. It's crucial to identify and handle outliers appropriately, possibly by removing them or using robust regression techniques.

5. How does the sample size affect the performance of multivariate linear regression?

A larger sample size generally improves the performance of a multivariate linear regression model by providing a more accurate estimate of the population parameters, reducing the standard error of the estimates. A small sample size can lead to overfitting, where the model describes random error or noise instead of the underlying relationship. Thus, ensuring an adequate sample size is essential for the reliability and validity of the model's conclusions.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
841
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
Back
Top