Different results for factor vs continuous

  • Context: Graduate 
  • Thread starter Thread starter FallenApple
  • Start date Start date
  • Tags Tags
    Continuous
Click For Summary

Discussion Overview

The discussion revolves around the differences in results obtained from two statistical models analyzing the interaction between age and treatment type on a response variable. One model treats age as a continuous variable, while the other categorizes age into discrete groups. Participants explore the implications of these modeling choices on the significance of interaction effects, particularly in the context of a dataset involving patients aged 20 to 90.

Discussion Character

  • Debate/contested
  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant notes that the continuous model shows age as highly significant, suggesting that slight deviations in age have a large effect on treatment outcomes.
  • In contrast, when age is treated as a factor with groups (20-40, 40-60, etc.), the interaction effect is not significant, prompting questions about the underlying reasons for this discrepancy.
  • Some participants express skepticism about the large effect observed in the continuous model, questioning whether it is visually apparent in the data plots.
  • Concerns are raised about the number of groups into which age is factored, with suggestions that too many groups could lead to reduced significance due to increased degrees of freedom.
  • One participant highlights a specific treatment-age combination that appears significantly different from others, suggesting it may be a small-sample outlier.
  • There is discussion about the potential misleading influence of extreme results from specific combinations on overall conclusions, especially given the rarity of the response variable.
  • Participants discuss the importance of visualizing the raw data to assess the interaction effect, rather than relying solely on regression outputs.
  • Questions are raised about the need for balanced samples and the implications of having a small number of data points in certain combinations.

Areas of Agreement / Disagreement

Participants express differing views on the significance of the interaction effects in the continuous versus factor models. There is no consensus on the reasons for the observed differences, and the discussion remains unresolved regarding the implications of modeling choices and data characteristics.

Contextual Notes

Limitations include potential issues with sample size, the definition of age factor levels, and the influence of outliers on statistical significance. The discussion also highlights the complexity of interpreting interaction effects in the context of rare events.

FallenApple
Messages
564
Reaction score
61
So, I'm doing an interaction model with response vs treatment_type interaction with age+controls(for confounders) with age being continous, say patients ranging from 20 years old to 90 years old.

so I have two models.
y=age+treatment_type . + . age*treatment_type

y=factor(age)+treatment_type . + . factor(age)*treatment_type

Basically what I got was that age is highly significant for the continuous model. For slight deviations in age, there is a huge effect on of treatment.

However, when age is factorized into groups, there is barely any interaction effect. (pvalue not significant)Why? What could be the reason for this. Is the reason why it's so significant for the continuous model is that patients simply differ from each other so much, that it seems like age has an effect, but it really doesn't.

Afterall, it has no effect when looking at groups. But somehow, within the groups, slight deviations give a large effect.
 
Physics news on Phys.org
Just to clarify, how are your age factor levels defined?
 
FallenApple said:
For slight deviations in age, there is a huge effect on of treatment
That seems suspicious on its own. If an effect is large then you should clearly see it when you plot the data even without doing statistics. Is that the case?

How do your regression diagnostic plots look? Do you have some high leverage or otherwise suspicious points?

FallenApple said:
However, when age is factorized into groups, there is barely any interaction effect. (pvalue not significant)
How many groups have you factored age into? If you have factored it into many groups then you will have a model with a large number of degrees of freedom. A good statistics package will take that into account and reduce the significance correspondingly.
 
FactChecker said:
Just to clarify, how are your age factor levels defined?
I've split it up into 4 sections. So basically 20-40, 40-60 etc.
 
Dale said:
That seems suspicious on its own. If an effect is large then you should clearly see it when you plot the data even without doing statistics. Is that the case?

How do your regression diagnostic plots look? Do you have some high leverage or otherwise suspicious points?
Many. But I can't reject those because they occur due to some systematic process. I've accounted for that by using a negative binomial link.
How many groups have you factored age into? If you have factored it into many groups then you will have a model with a large number of degrees of freedom. A good statistics package will take that into account and reduce the significance correspondingly.

Just 4. But I've factored it again into many. And here's the plot.

ZIgzag_Plot.png
It seems that they are maybe canceling.
 
Looking at your data, it looks like there is only one (solid line treatment, age (18.9, 27.4]) combination that is significantly different from the others. (Are the different lines different treatments?) Is there much data in that combination category or could it be a small-sample outlier?

I recommend that you statistically analyse the one glaring (solid line treatment, age (18.9, 27.4]) combination as one step and then look at the others in a separate statistical analysis.
 
Last edited:
FallenApple said:
And here's the plot
That doesn't look like it should be non significant. How does the data itself look. Can you see the interaction in the raw data?
 
Dale said:
That doesn't look like it should be non significant. How does the data itself look. Can you see the interaction in the raw data?
When I increase the number of partitions in the factor, it seems that there it follows the same trend(just a bunch of zigzags with the solid one being the most prominant). I think that is why for continuous age, it's highly significant, because even one slight increment in the age could send it in a certain direction.
 
FactChecker said:
Looking at your data, it looks like there is only one (solid line treatment, age (18.9, 27.4]) combination that is significantly different from the others. (Are the different lines different treatments?) Is there much data in that combination category or could it be a small-sample outlier?

I recommend that you statistically analyse the one glaring (solid line treatment, age (18.9, 27.4]) combination as one step and then look at the others in a separate statistical analysis.
So split the data into two different sets? It is a small sample. Less than 10% of the data set. Yet, there's only a small amount of people under this treatment option in the first place. So every data point counts. The response is a count of relatively rare events( negative side effect) so most of the response would be zero anyway
 
  • #10
FallenApple said:
So split the data into two different sets? It is a small sample. Less than 10% of the data set. Yet, there's only a small amount of people under this treatment option in the first place. So every data point counts. The response is a count of relatively rare events( negative side effect) so most of the response would be zero anyway
From the looks of the data, I think that it would be very misleading to allow the extreme result from one combination of (treatment, age) to influence your conclusions about the other combinations. In fact, the other treatments show, if anything, a slight bit of the opposite trend. If you do not address that combination separately, I don't think your conclusions will have any merit.
 
  • #11
FallenApple said:
So split the data into two different sets? It is a small sample. Less than 10% of the data set. Yet, there's only a small amount of people under this treatment option in the first place. So every data point counts. The response is a count of relatively rare events( negative side effect) so most of the response would be zero anyway
Then it doesn't sound like you will have enough data points to justify a large number of degrees of freedom. That is probably driving the lack of significance somewhat.

Also, do you see this interaction when you plot the data itself (not the fit)? I think I have asked this three times now.
 
  • #12
Dale said:
Then it doesn't sound like you will have enough data points to justify a large number of degrees of freedom. That is probably driving the lack of significance somewhat.

Also, do you see this interaction when you plot the data itself (not the fit)? I think I have asked this three times now.

I see. That makes sense. So generally, would I need to have the samples balanced?

I thought of one thing, so it might not work because there is such a low number within that combination. Like in the tens compared to over a thousand total.

But if age is continuous, then there is no combination sample of data. Its just the whole data set. Is this the correct way to see it?

I'm not sure what you mean. The pattern that I plotted was based on the data. It wasn't derived from a regression.
 

Similar threads

  • · Replies 11 ·
Replies
11
Views
8K
  • · Replies 25 ·
Replies
25
Views
3K
Replies
3
Views
2K
  • · Replies 4 ·
Replies
4
Views
1K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 35 ·
2
Replies
35
Views
9K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 20 ·
Replies
20
Views
4K