Linear regression and high correlation problems

bbb999 · Jun 13, 2010

Hi guys,

I have data of 20 peoples height, weight, calorie intake and skinfold thickness. I have carried out a regression of calorie on height, on weight and on height and weight. I have done the same thing for skinfold thickness. I then used R to work out the summary of results. each model also has an intercept i.e. y= beta1 + beta2X .

using the 't' values I have found for calories both height and weight are significantly different from zero in the individual models. But when I look at the model where height and weight are both included then both become non significant.

For the skinfold a similar thing happens. This time height and weight are not significantly different from zero individually but in the model including both they both become significant.

I have found the correlation between weight and height to be -0.88 which is high. Can anyone help me explain what causes the changes in signficance?

thanks in advance

EnumaElish · Jun 13, 2010

http://en.wikipedia.org/wiki/Multicollinearity#Consequences_of_multicollinearity

bbb999 · Jun 13, 2010

Ok I have read that. So am I right to say that for case 1 the high correlation causes the standard errors to increase, which lowers the 't' values which leads to the joint model having weight and height not being significant. But I am not sure why the opposite happens for the second case?

Also is it possible in case 1 to transform the model to reduce the correlation between height and weight?

thanks in advance

EnumaElish · Jun 13, 2010

In a finite sample "anything's possible" due to outliers and other idiosyncrasies. With that caveat, one guess is that the true model is SF = b0 + b1 H + b2 W + u and the regression does a good job of identifying both factors despite their high negative correlation. In contrast, each of the partial models SF = a0 + a1 H + u and SF = c0 + c1 W + u underestimates the slope coefficient due to omitted variable bias.

bbb999 · Jun 13, 2010

ah ok, and also is there a way of reducing the correlation in case 1 by transforming the models?

EnumaElish · Jun 13, 2010

I would start with re-defining calories as calorie/inch or calorie/pound, and regress calorie/inch on W (alternatively H and W together), and regress calorie/pound on H (alternatively H and W together).

bbb999 · Jun 13, 2010

Erm, the data I have is calorie intake index. I am not quite sure how to convert this to calorie/inch for example

EnumaElish · Jun 13, 2010

Imagine the index is "normalized calories per person per unit of time." When divided by weight, for example, you will have "normalized calories per pound per unit of time."

bbb999 · Jun 14, 2010

I have just tried that and it doesn't change the correlation between height and weight unless I am doing something wrong?

statdad · Jun 14, 2010

Do you need all of the predictors? High correlation among them indicates that they "have the same information". I am especially curious since, with a sample size of 20, using a multiple regression with 4 predictors is a bit odd.

bbb999 · Jun 14, 2010

I know what you mean, but its a study we have to do and we have been told to carry out a regression of calorie intake on weight, height and weight on height and then comment on what goes wrong and how to solve it.

bbb999 · Jun 14, 2010

i.e how to solve the fact neither are significant in the regression on height and weight due to the correlation between the two.

statdad · Jun 14, 2010

bbb999 said:

i.e how to solve the fact neither are significant in the regression on height and weight due to the correlation between the two.

By "solve" do you mean "explain, and take remedy" or "perform some work that will allow both to be used in the regression"?

bbb999 · Jun 14, 2010

Sorry, it says to explain why the contradiction occurs and why if the results are interpreted correctly the right conclusion can be drawn with no contradictions

bbb999 · Jun 17, 2010

can anyone help me with how the results if interpreted correctly can still draw the right conclusion?

EnumaElish · Jun 17, 2010

What conclusions do you draw about the results, can you post about that?

bbb999 · Jun 17, 2010

Are the conclusions not just that calorie can be modeled using height as a variable and using weight but not using both together?

EnumaElish · Jun 17, 2010

A question is, are you losing any information when you have both H and W as regressors?

bbb999 · Jun 17, 2010

No are you not gaining information?

EnumaElish · Jun 17, 2010

"Bingo!" You are gaining some additional info when you include both H and W, even though they are highly correlated and neither has an individually significant t-statistic.

Which regression statistic tells you about the joint significance of all of the slope variables simultaneously?

bbb999 · Jun 17, 2010

Would it be the F statistic?

So I can say that even though the correlation causes neither to be significant, the fact that we have more information means it is a better model?

EnumaElish · Jun 17, 2010

I think you have the answer.

bbb999 · Jun 17, 2010

thanks for all the help. So can you just check this:

I need to use the F statistic to show that even though neither are significant due to the 't' values being lower because of the correlation, the fact that we have more information than the two models containing just height or weight means that it is still a good model?

EnumaElish · Jun 17, 2010

Yes; in other words, one cannot justifiably argue that it's a poor model because "none of the variables is significant."

bbb999 · Jun 18, 2010

thanks again, just to check, would I need to talk about the F statistic or would I be able to say the above without it?

bbb999 · Jun 18, 2010

What I mean is, can I just say that even though the last model shows height and weight not to be significant, the first two shows they are and the last model is just adding more information to these intial models. So despite the 't' values the third model is still a good model of calorie intake?

EnumaElish · Jun 18, 2010

bbb999 said:

What I mean is, can I just say that even though the last model shows height and weight not to be significant, the first two shows they are and the last model is just adding more information to these intial models. So despite the 't' values the third model is still a good model of calorie intake?

Yes.

bbb999 · Jun 18, 2010

thanks I just wanted to make sure I didn't need to mention the f statistic

EnumaElish · Jun 20, 2010

Linear regression and high correlation problems

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad Understanding permutations and combinations in a coin toss experiment

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect