Linear regression and high correlation problems

In summary, In the first case where height and weight are both included in the regression, both become significant. But in the second case where height and weight are not included, height becomes significant but weight does not.
  • #1
bbb999
27
0
Hi guys,

I have data of 20 peoples height, weight, calorie intake and skinfold thickness. I have carried out a regression of calorie on height, on weight and on height and weight. I have done the same thing for skinfold thickness. I then used R to work out the summary of results. each model also has an intercept i.e. y= beta1 + beta2X .

using the 't' values I have found for calories both height and weight are significantly different from zero in the individual models. But when I look at the model where height and weight are both included then both become non significant.

For the skinfold a similar thing happens. This time height and weight are not significantly different from zero individually but in the model including both they both become significant.

I have found the correlation between weight and height to be -0.88 which is high. Can anyone help me explain what causes the changes in signficance?

thanks in advance
 
Physics news on Phys.org
  • #3
Ok I have read that. So am I right to say that for case 1 the high correlation causes the standard errors to increase, which lowers the 't' values which leads to the joint model having weight and height not being significant. But I am not sure why the opposite happens for the second case?

Also is it possible in case 1 to transform the model to reduce the correlation between height and weight?

thanks in advance
 
  • #4
In a finite sample "anything's possible" due to outliers and other idiosyncrasies. With that caveat, one guess is that the true model is SF = b0 + b1 H + b2 W + u and the regression does a good job of identifying both factors despite their high negative correlation. In contrast, each of the partial models SF = a0 + a1 H + u and SF = c0 + c1 W + u underestimates the slope coefficient due to omitted variable bias.
 
Last edited:
  • #5
ah ok, and also is there a way of reducing the correlation in case 1 by transforming the models?
 
  • #6
I would start with re-defining calories as calorie/inch or calorie/pound, and regress calorie/inch on W (alternatively H and W together), and regress calorie/pound on H (alternatively H and W together).
 
  • #7
Erm, the data I have is calorie intake index. I am not quite sure how to convert this to calorie/inch for example
 
  • #8
Imagine the index is "normalized calories per person per unit of time." When divided by weight, for example, you will have "normalized calories per pound per unit of time."
 
  • #9
I have just tried that and it doesn't change the correlation between height and weight unless I am doing something wrong?
 
  • #10
Do you need all of the predictors? High correlation among them indicates that they "have the same information". I am especially curious since, with a sample size of 20, using a multiple regression with 4 predictors is a bit odd.
 
  • #11
I know what you mean, but its a study we have to do and we have been told to carry out a regression of calorie intake on weight, height and weight on height and then comment on what goes wrong and how to solve it.
 
  • #12
i.e how to solve the fact neither are significant in the regression on height and weight due to the correlation between the two.
 
  • #13
bbb999 said:
i.e how to solve the fact neither are significant in the regression on height and weight due to the correlation between the two.

By "solve" do you mean "explain, and take remedy" or "perform some work that will allow both to be used in the regression"?
 
  • #14
Sorry, it says to explain why the contradiction occurs and why if the results are interpreted correctly the right conclusion can be drawn with no contradictions
 
  • #15
can anyone help me with how the results if interpreted correctly can still draw the right conclusion?
 
  • #16
What conclusions do you draw about the results, can you post about that?
 
  • #17
Are the conclusions not just that calorie can be modeled using height as a variable and using weight but not using both together?
 
  • #18
A question is, are you losing any information when you have both H and W as regressors?
 
  • #19
No are you not gaining information?
 
  • #20
"Bingo!" You are gaining some additional info when you include both H and W, even though they are highly correlated and neither has an individually significant t-statistic.

Which regression statistic tells you about the joint significance of all of the slope variables simultaneously?
 
  • #21
Would it be the F statistic?

So I can say that even though the correlation causes neither to be significant, the fact that we have more information means it is a better model?
 
  • #22
I think you have the answer.
 
  • #23
thanks for all the help. So can you just check this:

I need to use the F statistic to show that even though neither are significant due to the 't' values being lower because of the correlation, the fact that we have more information than the two models containing just height or weight means that it is still a good model?
 
  • #24
Yes; in other words, one cannot justifiably argue that it's a poor model because "none of the variables is significant."
 
  • #25
thanks again, just to check, would I need to talk about the F statistic or would I be able to say the above without it?
 
  • #26
What I mean is, can I just say that even though the last model shows height and weight not to be significant, the first two shows they are and the last model is just adding more information to these intial models. So despite the 't' values the third model is still a good model of calorie intake?
 
  • #27
bbb999 said:
What I mean is, can I just say that even though the last model shows height and weight not to be significant, the first two shows they are and the last model is just adding more information to these intial models. So despite the 't' values the third model is still a good model of calorie intake?
Yes.
 
  • #28
thanks I just wanted to make sure I didn't need to mention the f statistic
 
  • #29
Why?
 

1. What is linear regression and how is it used in data analysis?

Linear regression is a statistical method used to model the relationship between two or more variables. It is commonly used in data analysis to understand and predict the behavior of a dependent variable based on one or more independent variables. It involves fitting a straight line to a set of data points to find the best fit for the relationship between the variables.

2. How do you determine if there is a high correlation between two variables?

Correlation refers to the strength and direction of the relationship between two variables. To determine if there is a high correlation, we use a correlation coefficient, such as Pearson's r, which measures the linear relationship between two variables. A correlation coefficient close to 1 or -1 indicates a strong positive or negative correlation, while a coefficient close to 0 indicates a weak or no correlation.

3. What are some common problems encountered when using linear regression for high correlation data?

One common problem is multicollinearity, which occurs when there is a high correlation between independent variables in the regression model. This can lead to inaccurate coefficient estimates and difficulty in interpreting the results. Another problem is overfitting, where the model fits the training data too closely and may not perform well on new data.

4. How can high correlation problems be addressed in linear regression?

To address multicollinearity, we can use techniques such as feature selection or regularization to select the most important variables or penalize the coefficients of highly correlated variables. To avoid overfitting, we can use cross-validation techniques to evaluate the model's performance on unseen data and make adjustments as needed.

5. Are there any alternative methods to linear regression for dealing with high correlation data?

Yes, there are alternative methods such as decision trees, random forests, and support vector machines that can handle high correlation data. These methods do not rely on the assumption of a linear relationship between variables and can handle multicollinearity more effectively. However, they may not provide as much interpretability as linear regression.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
599
  • Set Theory, Logic, Probability, Statistics
2
Replies
64
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
972
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
Back
Top