(regression) why would you exclude an explanatory variable

Click For Summary
Excluding relevant explanatory variables, like education, from a regression model can lead to biased estimators, especially if those variables are correlated with other included predictors. This bias can result in incorrect coefficient signs and distorted standard errors, undermining the model's validity. While multicollinearity is a concern, it may not justify omitting potentially important variables, as their inclusion can enhance model accuracy. Additionally, adding predictors will typically increase the R^2 value, but this does not guarantee the model's effectiveness. Ultimately, the decision to include or exclude variables should balance model complexity with the accuracy of the estimates.
mrcleanhands
If someone is interested in modelling data on households to find out whether there is discrimination in the workplace why would they ever leave out variables which are relevant to explaining the dependent variable but not so relevant to the investigation?

e.g. let's say they survey age, race and family honor rank (out of 100) and the Y variable is employability (also out of 100). This is pretty bad, but its just to help illustrate my question.


Why would you exclude "education" from this regression or never bother to collect it?

Although its probably not relevant to what we are trying to discover won't you possibly get:
biased estimators (if education is correlated with "honor rank").

the only other thing I could think of is multicollinearity as an explanation, but then honour rank would have to be highly correlated with education - and we don't know that.

So isn't it best to just include the "education" variable as a sort of insurance to make sure out estimator turn out right? and we can handle multicollinearity once we test the regression.
 
Physics news on Phys.org
Several possibilities:

* correlated estimators can play havoc with the estimates: coefficients can have the wrong sign (i.e., previous work indicates that a variable should contribute with a positive coefficient, but your model has it with a negative coefficient)
* correlated predictors can distort the standard errors of the estimates
* even if the predictors are not correlated, we look for models that do good jobs as efficiently as possible. A crude but widely used way to assess the "worth" of a regression model is to look at its R^2 value: it is a mathematical fact that this will increase any time a new predictor is introduced, regardless of whether that predictor is or is not appropriate. We have to decide whether an increased value of R^2 is worth the added complexity of the model: if it is, keep it: if not, don't keep it.
 
The standard _A " operator" maps a Null Hypothesis Ho into a decision set { Do not reject:=1 and reject :=0}. In this sense ( HA)_A , makes no sense. Since H0, HA aren't exhaustive, can we find an alternative operator, _A' , so that ( H_A)_A' makes sense? Isn't Pearson Neyman related to this? Hope I'm making sense. Edit: I was motivated by a superficial similarity of the idea with double transposition of matrices M, with ## (M^{T})^{T}=M##, and just wanted to see if it made sense to talk...

Similar threads

  • · Replies 1 ·
Replies
1
Views
1K
Replies
5
Views
5K
  • · Replies 6 ·
Replies
6
Views
6K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 89 ·
3
Replies
89
Views
8K
  • · Replies 4 ·
Replies
4
Views
3K
Replies
1
Views
2K
Replies
3
Views
3K
  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 31 ·
2
Replies
31
Views
4K