# I Residuals are not 'random' -- How do I resolve?

1. Mar 8, 2016

### semidevil

I'm dabbling with regression (In excel), but I'm stuck because my residual plot is not normal. I have 2 variables: age, and gender(0 or 1). I regress it in excel and also plot the residuals, it is not random. In general, how do I solve this issue?

If it matters, the result of my raw when looks like a curve. It curves up and slowly curves back down(imagine a sine function that goes from 0 to pi)

2. Mar 8, 2016

### Number Nine

Generally, this means that your model is not appropriate (in this case, that a simple linear model does not adequately explain the data). It's hard to say more without seeing the data/model.

3. Mar 9, 2016

### semidevil

thanks for the feedback. When you say that a simple linear model can not explain the data, do you mean that it could be non linear? Is it possible to determine that by visual inspection of the scatter plot? From what I can see, it does curve up and then down like a sine function, so maybe it is non linear? If it is non linear, how do I go about regressing it? If If that's not the issue, how do I go about determine the best model( linear or otherwise)?

4. Mar 9, 2016

### Hornbein

Linear regression is used if there is some reason to believe the data is linear. This seems not to be the case here.

There is no procedure to input data and get a model in return. You have to have some reason to think that the data matches your model. Statistics is meant to tell you whether your guess is reasonable or not, not to guess for you.

5. Mar 9, 2016

### micromass

Staff Emeritus
Yes, it seems to be nonlinear. Linear regression could still apply, but you should add in higher order terms. Try to make a model with squares or cubes instead of just a linear parameter. That might do the trick.
It's hard to give more advice without some specific pictures and details about the model you're trying to fit.

6. Mar 9, 2016

### WWGD

What confidence interval do you get for the slope, what value for r^2?

7. Mar 12, 2016

### FactChecker

Good point. Modeling it as aX2+bX + c would allow linear regression to determine the parabola coefficients a,b,c that best fit the data. And it sounds like that may be what is needed. This is still called linear regression because the coefficients a,b,c are used in a linear way. The X2 does not prevent applying linear regression. Of course, there will be a very strong correlation between the X and X2 data entries. Step-wise linear regression should be used to take the correlations into account when it determines the final model. Excel might not have a good step-wise regression. In that case you might want to look into a statistical package like R.

8. Mar 12, 2016

### micromass

Staff Emeritus
A standard trick to remove this problem is by centering the variables. So instead of using $Y = a + bX + cX^2$ as a model, you should use $Y = a + b(X - \overline{X}) + c(X - \overline{X})^2$. It's the same thing of course, but you have no strong correlations anymore this way.

9. Mar 12, 2016

### Staff: Mentor

Wait... how do you do all that with a binary variable (gender)?

10. Mar 12, 2016

### WWGD

I think OP is using standard linear regression with numerical variables for dependent and independent variables.

11. Mar 12, 2016

### micromass

Staff Emeritus
It depends highly on the specifics. I find this thread a bit annoying since we're basically shooting in the dark since the OP hasn't given us any plots or numbers or anything. It's hard to give any meaningful advice then.

In any case, with a categorical variable, I would analyse the two genders separately first. Then you can bring them in a full model containing perhaps an interaction term or multiple ones.

12. Mar 12, 2016

### Staff: Mentor

I just see age and gender mentioned. And gender is a binary variable, which makes regression a bit... simple?

13. Mar 12, 2016

### micromass

Staff Emeritus
Well, I took it as an unnamed variable $Y$ with predictors gender and age.

14. Mar 12, 2016

### WWGD

15. Mar 12, 2016

### WWGD

Can this be anything other than a logistic regression, i.e., one of the inputs is Boolean/Binary. What are the options for the dependent variable?

16. Mar 12, 2016

### micromass

Staff Emeritus
Come on, a regression $\text{gender} \sim \text{age}$ makes no sense at all. Who in their right mind would try to predict gender based on the age? There has to be a dependent variable that the OP is not telling us.

17. Mar 12, 2016

### WWGD

These two may be independent variables used to logistically regress some third variable. There is no specification in the OP as to whether either of these is the dependent variable or not.

18. Mar 12, 2016

### Staff: Mentor

I suggest to wait for @semidevil to explain in more detail what is done and what went wrong.