Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Multiple linear regression

  1. Mar 13, 2015 #1
    I am doing a multiple linear regression on a dataset. It is test scores. It has three highly correlated variables being income, reading score, and math score. Obviously since the test score is the sum of the math score and reading score would it be appropriate to exclude them simply based off that. Obvious two of the three must be removed due to multi-collinearity. Reading score has the highest correlation to test score and math is close. Income is only .85.
  2. jcsd
  3. Mar 13, 2015 #2
    Or should it be appropriate to use reading score since it has the best correlation and least spread even though test scores is the average of reading score and math score.
  4. Mar 14, 2015 #3


    User Avatar
    Science Advisor

    Hey cutesteph.

    Removing data with multi-collinearity (and hence correlation) can be done in a number of ways.

    I suggest you look at Principal Component Analyses (PCA) techniques for dealing with that in multi-variate regression.

    The PCA techniques should be available in most statistical software packages - including R which is open source.

  5. Mar 14, 2015 #4


    User Avatar
    Homework Helper

    If all you need is a regression model for describing test score using some subset of the three variables income, reading score, and math score, you don't need component analysis. Run through the different models (1 predictor, 2 predictors except for reading and math scores together), and judge the best one. Look carefully at residual plots in each case.

    With that said, I'm still a little unsure of exactly what the goal of your project could be. If it is more sophisticated than simply coming away with a regression model
    some extra information is needed.
  6. Apr 4, 2015 #5


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    A standard step-wise multiple linear regression would first do a regression using the independent variable that has the most statistical significance. Then it would remove the influence of that variable and determine if a second independent variable has enough significance in the modified data to add into the model. It would add the second variable that shows the most statistical significance. So it proceeds in a logical manor, only adding variables that make the most statistical sense. See MATLAB stepwisefit. or R stepAIC.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook