Multiple linear regression

Click For Summary

Discussion Overview

The discussion revolves around the application of multiple linear regression on a dataset of test scores, specifically addressing the issue of multi-collinearity among the independent variables: income, reading score, and math score. Participants explore whether certain variables should be excluded based on their correlations and the implications of these correlations for model selection.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests that due to multi-collinearity, two of the three variables must be removed, noting that reading score has the highest correlation with test scores, while income has a lower correlation of 0.85.
  • Another participant questions whether it would be appropriate to retain the reading score despite it being part of the sum that defines the test score.
  • A different participant proposes using Principal Component Analysis (PCA) as a method to address multi-collinearity in the regression analysis.
  • One participant argues that if the goal is simply to create a regression model, it may not be necessary to use PCA and suggests testing different models with varying predictors while examining residual plots.
  • Another participant describes a standard step-wise multiple linear regression approach, emphasizing the importance of statistical significance in adding variables to the model.

Areas of Agreement / Disagreement

Participants express differing views on how to handle multi-collinearity and the appropriateness of excluding certain variables. There is no consensus on the best approach to take regarding the inclusion or exclusion of the correlated variables.

Contextual Notes

Participants have not fully defined the specific goals of the regression analysis, which may influence their recommendations. There are also unresolved considerations regarding the assumptions underlying the use of PCA and step-wise regression methods.

cutesteph
Messages
62
Reaction score
0
I am doing a multiple linear regression on a dataset. It is test scores. It has three highly correlated variables being income, reading score, and math score. Obviously since the test score is the sum of the math score and reading score would it be appropriate to exclude them simply based off that. Obvious two of the three must be removed due to multi-collinearity. Reading score has the highest correlation to test score and math is close. Income is only .85.
 
Physics news on Phys.org
Or should it be appropriate to use reading score since it has the best correlation and least spread even though test scores is the average of reading score and math score.
 
Hey cutesteph.

Removing data with multi-collinearity (and hence correlation) can be done in a number of ways.

I suggest you look at Principal Component Analyses (PCA) techniques for dealing with that in multi-variate regression.

The PCA techniques should be available in most statistical software packages - including R which is open source.

http://www.r-project.org/
 
cutesteph said:
I am doing a multiple linear regression on a dataset. It is test scores. It has three highly correlated variables being income, reading score, and math score. Obviously since the test score is the sum of the math score and reading score would it be appropriate to exclude them simply based off that. Obvious two of the three must be removed due to multi-collinearity. Reading score has the highest correlation to test score and math is close. Income is only .85.

If all you need is a regression model for describing test score using some subset of the three variables income, reading score, and math score, you don't need component analysis. Run through the different models (1 predictor, 2 predictors except for reading and math scores together), and judge the best one. Look carefully at residual plots in each case.

With that said, I'm still a little unsure of exactly what the goal of your project could be. If it is more sophisticated than simply coming away with a regression model
some extra information is needed.
 
A standard step-wise multiple linear regression would first do a regression using the independent variable that has the most statistical significance. Then it would remove the influence of that variable and determine if a second independent variable has enough significance in the modified data to add into the model. It would add the second variable that shows the most statistical significance. So it proceeds in a logical manor, only adding variables that make the most statistical sense. See MATLAB stepwisefit. or R stepAIC.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 13 ·
Replies
13
Views
5K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
5K
Replies
3
Views
3K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 7 ·
Replies
7
Views
3K