I Residuals are not 'random' -- How do I resolve?

  • I
  • Thread starter Thread starter semidevil
  • Start date Start date
  • Tags Tags
    Random
AI Thread Summary
The discussion centers on issues with residuals in a regression analysis using age and gender as variables. The residual plot appears non-random and resembles a curve, indicating that a simple linear model may not be suitable. Participants suggest considering nonlinear models by incorporating higher-order terms, such as squares or cubes, to better fit the data. There is also a call for clarity on the dependent variable, as the current setup with gender and age lacks context for meaningful regression analysis. Overall, the conversation emphasizes the need for a more appropriate modeling approach to address the observed residual patterns.
semidevil
Messages
156
Reaction score
2
I'm dabbling with regression (In excel), but I'm stuck because my residual plot is not normal. I have 2 variables: age, and gender(0 or 1). I regress it in excel and also plot the residuals, it is not random. In general, how do I solve this issue?

If it matters, the result of my raw when looks like a curve. It curves up and slowly curves back down(imagine a sine function that goes from 0 to pi)
 
Physics news on Phys.org
Generally, this means that your model is not appropriate (in this case, that a simple linear model does not adequately explain the data). It's hard to say more without seeing the data/model.
 
thanks for the feedback. When you say that a simple linear model can not explain the data, do you mean that it could be non linear? Is it possible to determine that by visual inspection of the scatter plot? From what I can see, it does curve up and then down like a sine function, so maybe it is non linear? If it is non linear, how do I go about regressing it? If If that's not the issue, how do I go about determine the best model( linear or otherwise)?
 
semidevil said:
I'm dabbling with regression (In excel), but I'm stuck because my residual plot is not normal. I have 2 variables: age, and gender(0 or 1). I regress it in excel and also plot the residuals, it is not random. In general, how do I solve this issue?

If it matters, the result of my raw when looks like a curve. It curves up and slowly curves back down(imagine a sine function that goes from 0 to pi)

Linear regression is used if there is some reason to believe the data is linear. This seems not to be the case here.

There is no procedure to input data and get a model in return. You have to have some reason to think that the data matches your model. Statistics is meant to tell you whether your guess is reasonable or not, not to guess for you.
 
semidevil said:
thanks for the feedback. When you say that a simple linear model can not explain the data, do you mean that it could be non linear? Is it possible to determine that by visual inspection of the scatter plot? From what I can see, it does curve up and then down like a sine function, so maybe it is non linear? If it is non linear, how do I go about regressing it? If If that's not the issue, how do I go about determine the best model( linear or otherwise)?

Yes, it seems to be nonlinear. Linear regression could still apply, but you should add in higher order terms. Try to make a model with squares or cubes instead of just a linear parameter. That might do the trick.
It's hard to give more advice without some specific pictures and details about the model you're trying to fit.
 
  • Like
Likes FactChecker
semidevil said:
thanks for the feedback. When you say that a simple linear model can not explain the data, do you mean that it could be non linear? Is it possible to determine that by visual inspection of the scatter plot? From what I can see, it does curve up and then down like a sine function, so maybe it is non linear? If it is non linear, how do I go about regressing it? If If that's not the issue, how do I go about determine the best model( linear or otherwise)?
What confidence interval do you get for the slope, what value for r^2?
 
micromass said:
Yes, it seems to be nonlinear. Linear regression could still apply, but you should add in higher order terms. Try to make a model with squares or cubes instead of just a linear parameter. That might do the trick.
It's hard to give more advice without some specific pictures and details about the model you're trying to fit.
Good point. Modeling it as aX2+bX + c would allow linear regression to determine the parabola coefficients a,b,c that best fit the data. And it sounds like that may be what is needed. This is still called linear regression because the coefficients a,b,c are used in a linear way. The X2 does not prevent applying linear regression. Of course, there will be a very strong correlation between the X and X2 data entries. Step-wise linear regression should be used to take the correlations into account when it determines the final model. Excel might not have a good step-wise regression. In that case you might want to look into a statistical package like R.
 
FactChecker said:
Of course, there will be a very strong correlation between the X and X2 data entries.

A standard trick to remove this problem is by centering the variables. So instead of using ##Y = a + bX + cX^2## as a model, you should use ##Y = a + b(X - \overline{X}) + c(X - \overline{X})^2##. It's the same thing of course, but you have no strong correlations anymore this way.
 
Wait... how do you do all that with a binary variable (gender)?
 
  • #10
mfb said:
Wait... how do you do all that with a binary variable (gender)?
I think OP is using standard linear regression with numerical variables for dependent and independent variables.
 
  • #11
It depends highly on the specifics. I find this thread a bit annoying since we're basically shooting in the dark since the OP hasn't given us any plots or numbers or anything. It's hard to give any meaningful advice then.

In any case, with a categorical variable, I would analyse the two genders separately first. Then you can bring them in a full model containing perhaps an interaction term or multiple ones.
 
  • #12
WWGD said:
I think OP is using standard linear regression with numerical variables for dependent and independent variables.
I just see age and gender mentioned. And gender is a binary variable, which makes regression a bit... simple?
 
  • #13
Well, I took it as an unnamed variable ##Y## with predictors gender and age.
 
  • #14
mfb said:
I just see age and gender mentioned. And gender is a binary variable, which makes regression a bit... simple?
My bad, you're right.
 
  • #15
Can this be anything other than a logistic regression, i.e., one of the inputs is Boolean/Binary. What are the options for the dependent variable?
 
  • #16
WWGD said:
Can this be anything other than a logistic regression, i.e., one of the inputs is Boolean/Binary. What are the options for the dependent variable?

Come on, a regression ##\text{gender} \sim \text{age}## makes no sense at all. Who in their right mind would try to predict gender based on the age? There has to be a dependent variable that the OP is not telling us.
 
  • #17
micromass said:
Come on, a regression ##\text{gender} \sim \text{age}## makes no sense at all. Who in their right mind would try to predict gender based on the age? There has to be a dependent variable that the OP is not telling us.
These two may be independent variables used to logistically regress some third variable. There is no specification in the OP as to whether either of these is the dependent variable or not.
 
  • #18
I suggest to wait for @semidevil to explain in more detail what is done and what went wrong.
 
Back
Top