Residuals are not 'random' -- How do I resolve?

  • I
  • Thread starter semidevil
  • Start date
  • Tags
    Random
In summary, linear regression can still be used to fit a nonlinear model, but it may require the use of higher order terms.
  • #1
semidevil
157
2
I'm dabbling with regression (In excel), but I'm stuck because my residual plot is not normal. I have 2 variables: age, and gender(0 or 1). I regress it in excel and also plot the residuals, it is not random. In general, how do I solve this issue?

If it matters, the result of my raw when looks like a curve. It curves up and slowly curves back down(imagine a sine function that goes from 0 to pi)
 
Physics news on Phys.org
  • #2
Generally, this means that your model is not appropriate (in this case, that a simple linear model does not adequately explain the data). It's hard to say more without seeing the data/model.
 
  • #3
thanks for the feedback. When you say that a simple linear model can not explain the data, do you mean that it could be non linear? Is it possible to determine that by visual inspection of the scatter plot? From what I can see, it does curve up and then down like a sine function, so maybe it is non linear? If it is non linear, how do I go about regressing it? If If that's not the issue, how do I go about determine the best model( linear or otherwise)?
 
  • #4
semidevil said:
I'm dabbling with regression (In excel), but I'm stuck because my residual plot is not normal. I have 2 variables: age, and gender(0 or 1). I regress it in excel and also plot the residuals, it is not random. In general, how do I solve this issue?

If it matters, the result of my raw when looks like a curve. It curves up and slowly curves back down(imagine a sine function that goes from 0 to pi)

Linear regression is used if there is some reason to believe the data is linear. This seems not to be the case here.

There is no procedure to input data and get a model in return. You have to have some reason to think that the data matches your model. Statistics is meant to tell you whether your guess is reasonable or not, not to guess for you.
 
  • #5
semidevil said:
thanks for the feedback. When you say that a simple linear model can not explain the data, do you mean that it could be non linear? Is it possible to determine that by visual inspection of the scatter plot? From what I can see, it does curve up and then down like a sine function, so maybe it is non linear? If it is non linear, how do I go about regressing it? If If that's not the issue, how do I go about determine the best model( linear or otherwise)?

Yes, it seems to be nonlinear. Linear regression could still apply, but you should add in higher order terms. Try to make a model with squares or cubes instead of just a linear parameter. That might do the trick.
It's hard to give more advice without some specific pictures and details about the model you're trying to fit.
 
  • Like
Likes FactChecker
  • #6
semidevil said:
thanks for the feedback. When you say that a simple linear model can not explain the data, do you mean that it could be non linear? Is it possible to determine that by visual inspection of the scatter plot? From what I can see, it does curve up and then down like a sine function, so maybe it is non linear? If it is non linear, how do I go about regressing it? If If that's not the issue, how do I go about determine the best model( linear or otherwise)?
What confidence interval do you get for the slope, what value for r^2?
 
  • #7
micromass said:
Yes, it seems to be nonlinear. Linear regression could still apply, but you should add in higher order terms. Try to make a model with squares or cubes instead of just a linear parameter. That might do the trick.
It's hard to give more advice without some specific pictures and details about the model you're trying to fit.
Good point. Modeling it as aX2+bX + c would allow linear regression to determine the parabola coefficients a,b,c that best fit the data. And it sounds like that may be what is needed. This is still called linear regression because the coefficients a,b,c are used in a linear way. The X2 does not prevent applying linear regression. Of course, there will be a very strong correlation between the X and X2 data entries. Step-wise linear regression should be used to take the correlations into account when it determines the final model. Excel might not have a good step-wise regression. In that case you might want to look into a statistical package like R.
 
  • #8
FactChecker said:
Of course, there will be a very strong correlation between the X and X2 data entries.

A standard trick to remove this problem is by centering the variables. So instead of using ##Y = a + bX + cX^2## as a model, you should use ##Y = a + b(X - \overline{X}) + c(X - \overline{X})^2##. It's the same thing of course, but you have no strong correlations anymore this way.
 
  • #9
Wait... how do you do all that with a binary variable (gender)?
 
  • #10
mfb said:
Wait... how do you do all that with a binary variable (gender)?
I think OP is using standard linear regression with numerical variables for dependent and independent variables.
 
  • #11
It depends highly on the specifics. I find this thread a bit annoying since we're basically shooting in the dark since the OP hasn't given us any plots or numbers or anything. It's hard to give any meaningful advice then.

In any case, with a categorical variable, I would analyse the two genders separately first. Then you can bring them in a full model containing perhaps an interaction term or multiple ones.
 
  • #12
WWGD said:
I think OP is using standard linear regression with numerical variables for dependent and independent variables.
I just see age and gender mentioned. And gender is a binary variable, which makes regression a bit... simple?
 
  • #13
Well, I took it as an unnamed variable ##Y## with predictors gender and age.
 
  • #14
mfb said:
I just see age and gender mentioned. And gender is a binary variable, which makes regression a bit... simple?
My bad, you're right.
 
  • #15
Can this be anything other than a logistic regression, i.e., one of the inputs is Boolean/Binary. What are the options for the dependent variable?
 
  • #16
WWGD said:
Can this be anything other than a logistic regression, i.e., one of the inputs is Boolean/Binary. What are the options for the dependent variable?

Come on, a regression ##\text{gender} \sim \text{age}## makes no sense at all. Who in their right mind would try to predict gender based on the age? There has to be a dependent variable that the OP is not telling us.
 
  • #17
micromass said:
Come on, a regression ##\text{gender} \sim \text{age}## makes no sense at all. Who in their right mind would try to predict gender based on the age? There has to be a dependent variable that the OP is not telling us.
These two may be independent variables used to logistically regress some third variable. There is no specification in the OP as to whether either of these is the dependent variable or not.
 
  • #18
I suggest to wait for @semidevil to explain in more detail what is done and what went wrong.
 

1. What does it mean when residuals are not 'random'?

When the residuals in a statistical model are not 'random', it means that the remaining variation in the data cannot be explained by the model. This could be due to factors that were not included in the model, or the model itself may not be an accurate representation of the data.

2. How do I know if my residuals are not 'random'?

One way to determine if your residuals are not 'random' is by examining a plot of the residuals. If the plot shows a clear pattern or trend, it suggests that there is still some underlying structure in the data that the model has not captured.

3. What are the consequences of having non-random residuals?

Having non-random residuals can lead to biased and inaccurate results. This can affect the overall conclusions and interpretations of the study. Non-random residuals can also indicate that the model needs to be improved or revised to better fit the data.

4. How can I resolve non-random residuals?

To resolve non-random residuals, it is important to first identify the source of the problem. This could involve adding more variables to the model, transforming the data, or using a different modeling technique. It may also be necessary to collect more data to better understand the underlying patterns in the data.

5. Are non-random residuals always a problem?

Not necessarily. In some cases, non-random residuals may be expected and may not affect the overall validity of the model. For example, in time series analysis, the residuals may show a seasonal pattern. However, if the non-random residuals are significantly large or have a systematic pattern, it is important to address them to ensure the accuracy of the model.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
333
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
915
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
6K
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Calculus and Beyond Homework Help
Replies
3
Views
813
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
Back
Top