# (regression) why would you exclude an explanatory variable

1. Sep 10, 2010

### mrcleanhands

If someone is interested in modelling data on households to find out whether there is discrimination in the workplace why would they ever leave out variables which are relevant to explaining the dependent variable but not so relevant to the investigation?

e.g. lets say they survey age, race and family honor rank (out of 100) and the Y variable is employability (also out of 100). This is pretty bad, but its just to help illustrate my question.

Why would you exclude "education" from this regression or never bother to collect it?

Although its probably not relevant to what we are trying to discover won't you possibly get:
biased estimators (if education is correlated with "honor rank").

the only other thing I could think of is multicollinearity as an explanation, but then honour rank would have to be highly correlated with education - and we don't know that.

So isn't it best to just include the "education" variable as a sort of insurance to make sure out estimator turn out right? and we can handle multicollinearity once we test the regression.

2. Sep 11, 2010

Several possibilities:

* correlated estimators can play havoc with the estimates: coefficients can have the wrong sign (i.e., previous work indicates that a variable should contribute with a positive coefficient, but your model has it with a negative coefficient)
* correlated predictors can distort the standard errors of the estimates
* even if the predictors are not correlated, we look for models that do good jobs as efficiently as possible. A crude but widely used way to assess the "worth" of a regression model is to look at its R^2 value: it is a mathematical fact that this will increase any time a new predictor is introduced, regardless of whether that predictor is or is not appropriate. We have to decide whether an increased value of R^2 is worth the added complexity of the model: if it is, keep it: if not, don't keep it.