Linear Model with independent categorical variable

Click For Summary

Discussion Overview

The discussion revolves around the modeling of blood pressure (BP) data in relation to age and gender as categorical variables. Participants explore the implications of including gender in linear regression models, comparing a no-interaction model with separate models for each gender.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant proposes two linear regression models: one that includes gender and one that does not, questioning the equivalence of these models in terms of fitting blood pressure data.
  • Another participant argues that the group-wise fitting allows for different intercepts and slopes, while the no-interaction model permits different intercepts but not slopes.
  • Some participants suggest that the no-interaction model may be preferable to separate models for each gender due to efficiency in estimating parameters with more data.
  • There is a suggestion that if an interaction between age and gender is suspected, an interaction model should be used to better account for this relationship.
  • One participant expresses confusion about the order of model equations and suggests that clarity in presentation is important to avoid misunderstanding.
  • A further suggestion is made to separate the genders into distinct datasets for regression analysis if sufficient data is available for each group.

Areas of Agreement / Disagreement

Participants generally disagree on the equivalence of the models discussed, with some advocating for the no-interaction model while others support the idea of separate models for each gender. The discussion remains unresolved regarding the best approach to modeling the data.

Contextual Notes

Participants note potential limitations in the models, such as the assumption of no interaction between age and gender and the implications of using different datasets for analysis.

fog37
Messages
1,566
Reaction score
108
TL;DR
Linear Model with independent categorical variable
Hello,

I have been pondering on the following: we have data for blood pressure BP (response variable) and data about age and gender (categorical variable with two levels). We can build two linear regression models: $$BP=b_0+b_1 age+b_2 gender$$ $$BP=b_0+b_1 age$$

The first model does not take gender into account and plots one single best-fit line disregarding that gender may have an effect.
The 2nd model includes ##gender## and two scenarios are possible: assuming no interaction term, the categorical variable ##gender## may shift the best fit regression line up or down depending its value being ##1## or ##0## and the sign of its corresponding coefficient. If the shift is very small, then ##gender## does not have an effect. But if best-fit line vertical shift is meaningful, then ##gender## has an effect. That means that the ##BP## values for males and females form different clusters that would require two different best-fit lines (same slope different intercept).
The 2nd model, including ##gender## takes care of that difference. Would the 2nd model be exactly equivalent to creating two separate linear regression models and best-fit lines, one for the male group and one for the female group, once we recognize that male and female form different clusters of points w.r.t. blood pressure BP?

Thank you!
 
Physics news on Phys.org
No, those would not be exactly equivalent. The group-wise fitting would allow different intercepts and different slopes for the two groups. The no-interaction model (your model 2) allows different intercepts but not different slopes for the two groups. Also, the standard error for the slope in the no-interaction model will be smaller (if there is indeed no significant interaction) because it is estimated with twice the data of either of the group-wise fits.
 
  • Like
Likes   Reactions: fog37
Dale said:
No, those would not be exactly equivalent. The group-wise fitting would allow different intercepts and different slopes for the two groups. The no-interaction model (your model 2) allows different intercepts but not different slopes for the two groups. Also, the standard error for the slope in the no-interaction model will be smaller (if there is indeed no significant interaction) because it is estimated with twice the data of either of the group-wise fits.
Thanks. I see. So the no-interaction model 2 would be a better model than creating two separate models, one for each group. Thanks for confirming.
 
fog37 said:
Thanks. I see. So the no-interaction model 2 would be a better model than creating two separate models, one for each group. Thanks for confirming.
Yes. And if you think that there may be an interaction then I would use an interaction model instead of group-wise fits. It is a lot easier to control for multiple comparisons that way.
 
  • Like
Likes   Reactions: fog37
Dale said:
Yes. And if you think that there may be an interaction then I would use an interaction model instead of group-wise fits. It is a lot easier to control for multiple comparisons that way.
We can suspect the interaction term between ##age## and ##gender## but the proof would be to see that model 2 generates best-fit lines with different slopes for different values of the ##gender## variable. Once we see that, we should include the interaction term ##(age)\times(gender)##
 
fog37 said:
We can build two linear regression models: $$BP=b_0+b_1 age+b_2 gender$$ $$BP=b_0+b_1 age$$

The first model does not take gender into account and plots one single best-fit line disregarding that gender may have an effect.
The 2nd model includes ##gender## and two scenarios are possible: assuming no interaction term, the categorical variable ##gender## may shift the best fit regression line up or down depending its value being ##1## or ##0## and the sign of its corresponding coefficient. If the shift is very small, then ##gender## does not have an effect.
You should have your model equations and your description in the same order so there is no confusion about which model is "first" and which is "second". It looks like your model equations are in reverse order. Otherwise, I disagree with practically everything you said about those two models.
A third option is to separate the genders into two distinct data sets and do separate regressions on each one. It is not clear to me if that is what you had in mind for the model that does not include a "gender" factor. I recommend this approach if you have enough data for each gender to get adequate parameter estimates for each.
 

Similar threads

  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 6 ·
Replies
6
Views
3K
Replies
3
Views
3K
  • · Replies 22 ·
Replies
22
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 13 ·
Replies
13
Views
4K