Multiple linear regression

In summary, the goal of the project is to come up with a regression model that describes test score using only the three variables income, reading score, and math score.
  • #1
cutesteph
63
0
I am doing a multiple linear regression on a dataset. It is test scores. It has three highly correlated variables being income, reading score, and math score. Obviously since the test score is the sum of the math score and reading score would it be appropriate to exclude them simply based off that. Obvious two of the three must be removed due to multi-collinearity. Reading score has the highest correlation to test score and math is close. Income is only .85.
 
Physics news on Phys.org
  • #2
Or should it be appropriate to use reading score since it has the best correlation and least spread even though test scores is the average of reading score and math score.
 
  • #3
Hey cutesteph.

Removing data with multi-collinearity (and hence correlation) can be done in a number of ways.

I suggest you look at Principal Component Analyses (PCA) techniques for dealing with that in multi-variate regression.

The PCA techniques should be available in most statistical software packages - including R which is open source.

http://www.r-project.org/
 
  • #4
cutesteph said:
I am doing a multiple linear regression on a dataset. It is test scores. It has three highly correlated variables being income, reading score, and math score. Obviously since the test score is the sum of the math score and reading score would it be appropriate to exclude them simply based off that. Obvious two of the three must be removed due to multi-collinearity. Reading score has the highest correlation to test score and math is close. Income is only .85.

If all you need is a regression model for describing test score using some subset of the three variables income, reading score, and math score, you don't need component analysis. Run through the different models (1 predictor, 2 predictors except for reading and math scores together), and judge the best one. Look carefully at residual plots in each case.

With that said, I'm still a little unsure of exactly what the goal of your project could be. If it is more sophisticated than simply coming away with a regression model
some extra information is needed.
 
  • #5
A standard step-wise multiple linear regression would first do a regression using the independent variable that has the most statistical significance. Then it would remove the influence of that variable and determine if a second independent variable has enough significance in the modified data to add into the model. It would add the second variable that shows the most statistical significance. So it proceeds in a logical manor, only adding variables that make the most statistical sense. See MATLAB stepwisefit. or R stepAIC.
 

1. What is multiple linear regression?

Multiple linear regression is a statistical technique used to analyze the relationship between a dependent variable and two or more independent variables. It is used to predict the value of the dependent variable based on the values of the independent variables.

2. When is multiple linear regression used?

Multiple linear regression is used when there is a need to understand the effect of two or more independent variables on a dependent variable. It is commonly used in fields such as economics, social sciences, and business to make predictions and identify patterns in data.

3. What are the assumptions of multiple linear regression?

There are several assumptions of multiple linear regression, including linearity, normality, homoscedasticity, and independence of errors. Linearity assumes that the relationship between the dependent and independent variables is linear. Normality assumes that the residuals of the model are normally distributed. Homoscedasticity assumes that the variance of the residuals is constant across all values of the independent variables. Independence of errors assumes that the errors are not correlated with each other.

4. How is the performance of a multiple linear regression model evaluated?

The performance of a multiple linear regression model is evaluated by looking at the overall fit of the model and the significance of the independent variables. This can be done by looking at metrics such as the coefficient of determination (R-squared), the F-statistic, and the p-values of the independent variables. Additionally, it is important to check for any violations of the assumptions of the model.

5. Can categorical variables be included in a multiple linear regression model?

Yes, categorical variables can be included in a multiple linear regression model by using dummy coding. This involves creating dummy variables for each category and including them as independent variables in the model. However, it is important to note that this may affect the interpretation of the coefficients and assumptions of the model.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
844
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
2
Replies
64
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
475
Back
Top