(regression) why would you exclude an explanatory variable

  • Context: Undergrad 
  • Thread starter Thread starter mrcleanhands
  • Start date Start date
  • Tags Tags
    Regression Variable
Click For Summary
SUMMARY

This discussion addresses the rationale behind excluding explanatory variables in regression analysis, particularly in the context of modeling household data to investigate workplace discrimination. The example provided highlights the potential pitfalls of omitting relevant variables, such as education, which can lead to biased estimators and distorted standard errors. The conversation emphasizes the importance of considering multicollinearity and the impact of adding variables on the R² value of the model. Ultimately, the decision to include or exclude variables should balance model complexity against the accuracy of estimators.

PREREQUISITES
  • Understanding of regression analysis and its components
  • Familiarity with multicollinearity and its effects on regression models
  • Knowledge of R² as a measure of model fit
  • Experience with statistical software for regression modeling
NEXT STEPS
  • Explore techniques for detecting and addressing multicollinearity in regression models
  • Learn how to interpret R² values and their implications for model selection
  • Investigate the impact of omitted variable bias on regression results
  • Study best practices for variable selection in regression analysis
USEFUL FOR

Data analysts, statisticians, and researchers interested in regression modeling and the implications of variable selection in statistical analyses.

mrcleanhands
If someone is interested in modelling data on households to find out whether there is discrimination in the workplace why would they ever leave out variables which are relevant to explaining the dependent variable but not so relevant to the investigation?

e.g. let's say they survey age, race and family honor rank (out of 100) and the Y variable is employability (also out of 100). This is pretty bad, but its just to help illustrate my question.


Why would you exclude "education" from this regression or never bother to collect it?

Although its probably not relevant to what we are trying to discover won't you possibly get:
biased estimators (if education is correlated with "honor rank").

the only other thing I could think of is multicollinearity as an explanation, but then honour rank would have to be highly correlated with education - and we don't know that.

So isn't it best to just include the "education" variable as a sort of insurance to make sure out estimator turn out right? and we can handle multicollinearity once we test the regression.
 
Physics news on Phys.org
Several possibilities:

* correlated estimators can play havoc with the estimates: coefficients can have the wrong sign (i.e., previous work indicates that a variable should contribute with a positive coefficient, but your model has it with a negative coefficient)
* correlated predictors can distort the standard errors of the estimates
* even if the predictors are not correlated, we look for models that do good jobs as efficiently as possible. A crude but widely used way to assess the "worth" of a regression model is to look at its R^2 value: it is a mathematical fact that this will increase any time a new predictor is introduced, regardless of whether that predictor is or is not appropriate. We have to decide whether an increased value of R^2 is worth the added complexity of the model: if it is, keep it: if not, don't keep it.
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 6 ·
Replies
6
Views
7K
  • · Replies 89 ·
3
Replies
89
Views
8K
  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 31 ·
2
Replies
31
Views
4K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 1 ·
Replies
1
Views
3K
Replies
10
Views
5K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K