Removing dummy variables from a model: singly or only as a group?

  • Context: Graduate 
  • Thread starter Thread starter wvguy8258
  • Start date Start date
  • Tags Tags
    Group Model Variables
Click For Summary

Discussion Overview

The discussion revolves around the treatment of dummy variables in generalized linear models, specifically regarding whether to drop individual dummy variables or to remove them only as a group. The context involves a categorical variable representing land use/cover with four levels: forest, agriculture, grass, and wetlands.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant questions whether it is more important to consider the differences among the categories excluding the baseline (forest) rather than the differences between the other categories themselves.
  • Another participant suggests creating a 4x4 matrix of pairwise differences to assess significance before deciding on excluding any dummy variables.
  • Concerns are raised about the arbitrary nature of the baseline choice and its implications when dropping individual dummy variables, as it affects potential class aggregations.
  • There is a proposal to run multiple models with different baseline categories to observe variations in significance and parameter estimates.
  • One participant emphasizes the importance of considering joint tests for groups of dummy variables rather than focusing solely on individual coefficients.
  • Another participant notes that the categories do not have an intrinsic ordering, which complicates the interpretation of differences between them.

Areas of Agreement / Disagreement

Participants express differing views on the methodology for handling dummy variables, with no consensus reached on whether to drop individual variables or groups. The discussion remains unresolved regarding the best approach to model the categorical variable.

Contextual Notes

Participants mention the significance of baseline selection and its impact on model interpretation, as well as the potential for different statistical tests to yield varying insights based on the chosen approach.

wvguy8258
Messages
48
Reaction score
0
Hi,

I'm running a few generalized linear models. One of the predictors of interest is a categorical variable with 4 levels. I have this coded as 3 dummy variables, with one as a baseline that will influence the intercept (multicollinearity concerns prompt this, of course). I have not read a good treatment of the following: should you consider dropping an individual dummy variable from the model or only do so by the whole group (meaning all in or all out). The categorical variable here is land use/cover, the classes are forest, agriculture, grass, wetlands. Forest is the category not represented by a dummy variable. If agriculture and grass are statistically significant but wetland is not, then it seems the effect of removing wetland as a variable is to make forest/wetland now a single, baseline category. This has some intuitive appeal because the nonsignificant results indicates the possibility of no difference between forest and wetland as a predictor. So, in a sense, you are allowing the model results to inform how to modify the categorical variable from which the dummy variables are produced, in this case aggregating forest/wetland would be indicated. Am I missing something important here? Any literature recommendation that is related? Thanks, Seth
 
Physics news on Phys.org
You must be thinking that the differences {a, g, w} minus forest is more important than, say the difference a - g. Any particular reason why?

Before excluding anything I'd create a 4x4 matrix of all pairwise differences and try to see what's significant. Then you might consider joint F tests (e.g. Are x & y jointly significant when baseline is z?)
 
Hmm, I've never seen this suggested for dummy variables. Usually, the choice of a baseline is considered to be arbitrary or for reasons such as mine which is that forest is the most "natural" and common condition in this area (so it seems a natural baseline for comparison). I can perhaps see how the choice of a baseline might become more important when you are considering dropping individual dummy variables since the choice of baseline then dictates the possible class aggregations that the result from dropping variables.
 
EnumaElish said:
You must be thinking that the differences {a, g, w} minus forest is more important than, say the difference a - g. Any particular reason why?

Before excluding anything I'd create a 4x4 matrix of all pairwise differences and try to see what's significant. Then you might consider joint F tests (e.g. Are x & y jointly significant when baseline is z?)

So you are suggesting running the model 4 times, once per possible baseline category, and then see how significance and parameter estimates vary?
 
It's a 4x4 matrix, but it's symmetric, and its diagonal is zero, so you need 3 models (at most). What I'm trying to get at is, one, why worry about individual coefficients if the model is significant; two, what makes w vs. f special, while ignoring other differences; three, why not think in terms of sets of dummies; e.g. why isn't the most important test a&g jointly against f&w, or a&f vs. g&w (using an F test in each case). I'm throwing out these because, unlike, say a model measuring the separate effect of each level of education (primary-middle-high school, college, beyond) on income, your categories do not have an intrinsic ordering, unless maybe in terms of land vs. human input (wetlands = pristine, grassland = minimal labor, forest = moderate labor, agriculture = maximum labor).
 
Last edited:

Similar threads

  • · Replies 45 ·
2
Replies
45
Views
7K
  • · Replies 2 ·
Replies
2
Views
3K
Replies
24
Views
8K
  • · Replies 94 ·
4
Replies
94
Views
14K
  • Sticky
  • · Replies 2 ·
Replies
2
Views
505K