The webpage title could be: Is Occam's Razor Applicable to Modeling Data?

  • I
  • Thread starter EngWiPy
  • Start date
  • Tags
    Correlation
In summary, the study found that there is an interaction between being a male and being in first class on the outcome of survival. Males tend to not survive, but first class tended to survive. However, being a first class male does not affect the outcome significantly.
  • #1
EngWiPy
1,368
61
Hello,

I have two predictor variables: gender and class, and one response variable: survived. After converting the categorical variables into numeric variables using dummy variables, I found the correlation between gender_male and survived to be -0.54, and between class_1 and survived to be 0.29. Clearly, there is an negative relationship between being a male and survived, and a positive relationship between being a class_1 passenger and survived. However, I also want to study the combined effect of being male and in class_1 on the outcome. To this end, I created a new variable that logically finds the gender_male AND class_1, and found the correlation with survived. It was -0.012. Is what I did correct? If so, how to interpret the result?

Thanks
 
Physics news on Phys.org
  • #2
S_David said:
However, I also want to study the combined effect of being male and in class_1 on the outcome. ... If so, how to interpret the result?
What you are looking for in your model is called an interaction term. Basically, you want to see if the change in one effect depends on the value of the other effect.

So in your example, males tend to not survive (perhaps due to heroic “women and children first” decisions) and first class tended to survive (perhaps due to proximity to exits). If these effects had no interaction then you would expect first class males to survive at (.29-.54)=-.25. This is less than females of any class but more than non first class males (perhaps first class males help all females, just like non first class males, but after the females are out they still have the exit-proximity advantage)

An interaction effect would mean that there is something about the gender relationship in first class that is different than the gender relationship in non first class. Perhaps first class males are more prone to acts of heroism which results in their survival being below the expected -0.25, or perhaps they are generally older and similar heroism gets them killed more often
 
  • Like
Likes EngWiPy
  • #3
Dale said:
What you are looking for in your model is called an interaction term. Basically, you want to see if the change in one effect depends on the value of the other effect.

So in your example, males tend to not survive (perhaps due to heroic “women and children first” decisions) and first class tended to survive (perhaps due to proximity to exits). If these effects had no interaction then you would expect first class males to survive at (.29-.54)=-.25. This is less than females of any class but more than non first class males (perhaps first class males help all females, just like non first class males, but after the females are out they still have the exit-proximity advantage)

An interaction effect would mean that there is something about the gender relationship in first class that is different than the gender relationship in non first class. Perhaps first class males are more prone to acts of heroism which results in their survival being below the expected -0.25, or perhaps they are generally older and similar heroism gets them killed more often

So, if there is no interaction, then, first class males survive at -0.25, but I got a result of -0.012. What does this mean? Can we say that being a first class male doesn't affect the outcome significantly? If so, it doesn't affect the outcome relative to what: being a male, or a first class passenger?
 
  • #4
S_David said:
Can we say that being a first class male doesn't affect the outcome significantly?
Well, whether it is significant or not depends on the p-value of the interaction term. In any case, even if it is significant it seems to be a small effect. Depending on the number of data points even a very small difference can be significant. So it may not be important even if it is significant.

S_David said:
I got a result of -0.012. What does this mean?
It means that first class males survive at -0.262 instead of the expected -0.25. So slightly less than expected.
 
  • Like
Likes EngWiPy
  • #5
I still don't get it completely. Let us take the females, as another example to have a more clear understanding, hopefully:

females survived at 0.54
first class survived at 0.29, and first class females at 0.41
second class survived at 0.093, and second class females at 0.34
third class survived at -0.32, and third class females at 0.10

How to interpret these results, and are we getting valuable information here? I think we do, but not sure how to express them correctly. For example, if there is no interaction, as you mentioned, the first class female survives at 0.83. Right? But I got 0.41. What does this mean? There is an interaction, correct? But how to understand this interaction and interpret it?
 
  • #6
It is a little hard for me to know since I am not familiar with the data, and you aren’t describing it exhaustively. So I will be a little general.

You have a model like ##S_{ij}=\mu+G_i+C_j+GC_{ij}## where S is some measure of survival, G is gender and C is class. So ##j\in(1,2,3)## and ##i\in(F,M)##

There will be some reference class, let’s say first class, and some reference gender, let’s say female. So automatically ##G_F=C_1=GC_{Fj}=GC_{i1}=0##.

Then when you do your fit, you will get an intercept term ##\mu##, a gender term ##G_M##, two class terms ##C_2## and ##C_3##, and two interaction terms ##GC_{M2}## and ##GC_{M3}##. So the survival measure of first class females would be ##S_{F1}=\mu## and the survival of first class males would be ##S_{M1}=\mu+G_M## and the survival of third class females would be ##S_{F3}=\mu+C_3## and the survival of second class males would be ##S_{M2}=\mu+G_M+C_2+GC_{M2}##
 
  • Like
Likes EngWiPy
  • #7
S_David said:
females survived at 0.54
first class survived at 0.29, and first class females at 0.41
second class survived at 0.093, and second class females at 0.34
third class survived at -0.32, and third class females at 0.10
I just noticed, there are too many coefficients here. If you have three classes and two genders then the full model is 6 terms. You have 7. There should be one intercept, one gender, two classes, and two interactions.

What software are you using? It shouldn’t be giving you this output. Can you post your code and output?
 
Last edited:
  • Like
Likes EngWiPy
  • #8
Thanks for your replies. Let me start over and explain the issue in more details: I have a dataset related to the Titanic disaster with a number of predictors, and one response variable. At the moment I am concerned with the categorical predictors Sex/Gender and Class, and their relationships with the response variable Survived. The Sex predictor takes on two values: Female, and Male, while Class takes on three values: first, second, and third. The response variable Survived takes on two values: 0 for didn't survive, and 1 for survived.

What I did first is to convert these categorical predictors into numeric ones using dummy variable coding. So, after this conversion I had 5 predictors, namely: Sex_Male, Sex_Female, Class_1, Class_2, Class_3. From this, I can find the correlation between each of these 5 predictors with Survived using Python as:

Python:
df.corr()

where df is a dataframe that has 6 columns (the numeric predictors mentioned above, and the response variable Survived).

After that, I asked the question what is combined effect of Sex/Class on the outcome. To this end, I created new 6 columns that represent all the possible combinations between these two predictors, namely: Male_Class1, Male_Class_2, ..., Female_Class_3. Each new feature was obtained by multiplying the columns the new feature represents. For example, for Male_Class1, I multiplied the columns Sex_Male, and Class_1, since this gives 1 only if both Sex_Male, and Class_1 are 1. And so for the other new features. Then I found the correlation matrix again with these new features (and the response variable). From here I got the results I mentioned before, but at this point I didn't know how to interpret the results.

For example, Class_2 has a correlation of 0.093 with the response variable, which implies a very weak relationship with the response variable, but the Female_Class2 has a correlation of 0.34 with the response variable. How does Class_2 as a whole have a very low correlation with the outcome, yet being a subset of that class has a higher correlation? Also, what does mean if Sex_Male has a correlation of -0.54, but a Male_Class_1 has a correlation of -0.012 with the outcome? Do these imply that Sex is a stronger predictor of the outcome than the class?

I am not yet at fitting the model as you presented it mathematically (this is a classification problem though, which suggests the use of Logistic Regression rather than Linear Regression). I may later use the model fitted with the p-values to confirm some tentative conclusions from this exploration process.
 
Last edited:
  • #9
S_David said:
I created new 6 columns that represent all the possible combinations between these two predictors, namely: Male_Class1, Male_Class_2, ..., Female_Class_3.
OK, so how did you get 7 coefficients out?

S_David said:
which suggests the use of Logistic Regression rather than Linear Regression
I would also recommend logistic regression, but your question on linear regression is still fine even in this context.
 
  • Like
Likes EngWiPy
  • #10
I have 6 combinations, but then I have the 5 individual features. So, in total, I have 11 columns/features. In post #7 you asked why I have 7 features. I wrote 7 features, because Sex_Female is also a feature in its own, but I have 11 features now after the combination. I didn't remove the individual features after the combination.
 
  • #11
Ok, that is way too many degrees of freedom. I am surprised (and concerned) that the Python package did this without throwing an error. Your matrix should be singular.

You should make two choices. First, what would you consider your reference group, against which you would compare all other groups. Second, do you want to do a traditional model where you have interaction effects and main effects, or do you just want to look at all the interaction effect combinations.
 
  • Like
Likes EngWiPy
  • #12
Dale said:
Ok, that is way too many degrees of freedom. I am surprised (and concerned) that the Python package did this without throwing an error. Your matrix should be singular.

You should make two choices. First, what would you consider your reference group, against which you would compare all other groups. Second, do you want to do a traditional model where you have interaction effects and main effects, or do you just want to look at all the interaction effect combinations.

I was trying to see if I should include interaction effects or not based on the correlations with the outcome. If I decided that Sex_Female is a reference for gender, and Class_1 a reference for class, then I would end up with only two new features: Male_Class2, Male_Class_3. Right? In this case what do the correlations between, Male_Class2 and the response variable and Male_Class3 and the response variable say about the other combinations correlation with the response variable?
 
  • #13
S_David said:
I was trying to see if I should include interaction effects or not based on the correlations with the outcome.
Ok, so for that then you will want to take the traditional approach. In your design matrix, one column will be all 1’s, that is your column for the intercept which will be first class females. The next column will be for Male, the column after will be for 2nd class, then 3rd class, then 2nd class Male, then 3rd class Male. This is a total of 6 columns and is the standard full factorial model. If you were doing this in R you would specify it as S~G+C+G*C.

You will get a list out which should have 6 rows and several columns labeled things like: Parameter, Estimate, Std. Err, p-value, ... The parameter column should have entries like Intercept, G_M, C_2, G_M:C_2. The first thing you will want to do is look at the p-values and Estimates for your interaction terms. If the p values are large or the estimates are small, then drop the interaction terms and run a smaller model with the main effects only.

Now, if they are significant then you have to understand what they mean. Since your reference is first class female the average of the first class females is considered the overall “expected value”, that is why it is the intercept of the model. Then each of the other terms in the model represents a deviation from the expected. So the M term shows how first class males differ from first class females. The expected survival of first class males is therefore the intercept estimate plus the M estimate. Similarly, the 2 and 3 terms show how second and third class females differ from first class females. Their expected survival is the intercept plus the 2 and 3 terms respectively.

So what about the groups that differ from the reference group by both gender and class? If your interaction terms were dropped above, then their expected survival is just the sum of the intercept and both main effects, so second class males would be intercept plus the Male term plus the 2 term. If the interaction was not dropped above then it cannot simply be described as an independent sum of those two effects, but there is a further difference from what would be expected. So then the second class male expected survival would be the intercept plus the Male term plus the 2 term plus the Male:2 term.

You may find it helpful to expand all of those six groups into straight gender:class survival rates, without looking at them as effects and effects upon effects.
 
  • Like
Likes EngWiPy
  • #14
Thanks for the detailed reply. I tested the accuracy (the score or the coefficient of determination ##r^2##) of this model and it didn't change much compared to the accuracy of a model that contained the main effects only. I will check the estimates and the p-values for the interaction terms to double check this (I think Python doesn't give the p-value automatically, and it has to be coded), but I think the estimates will most likely be close to zero.
 
  • #15
I don’t know much about Python statistics packages. I use R for statistics. In R, there are several easy ways to evaluate a model. One is using “drop1” which drops all single terms one at a time. Another is using the AIC or the BIC. A third is to run two models and just use ANOVA to compare them.

If python enables any of those then I can help you understand what they mean.
 
  • Like
Likes EngWiPy
  • #16
Dale said:
I don’t know much about Python statistics packages. I use R for statistics. In R, there are several easy ways to evaluate a model. One is using “drop1” which drops all single terms one at a time. Another is using the AIC or the BIC. A third is to run two models and just use ANOVA to compare them.

If python enables any of those then I can help you understand what they mean.

I think Python evaluates models using the score and MSE metrics mainly, and for classification problems it uses classification report that includes the precision, recall, f1-score and support. Since the accuracy hasn't improved with the interaction terms, cannot we conclude (at least tentatively) that they have little to no effect just from this metric (the score)?
 
  • #17
Yes. The burden of proof for simplifying a model is small. Use any excuse to simplify, it is only a more complicated model that needs strong justification.

That is my personal prejudice and it is common amongst scientists, but not universal
 
  • Like
Likes EngWiPy and jim mcnamara
  • #18
Dale is describing something sometimes called Ockham's Razor: https://en.wikipedia.org/wiki/Occam's_razor It is something important to be aware of when you are trying to model data.

simpler theories [hypotheses] are preferable to more complex ones because they are more testable

I added the word "hypotheses".
 
Last edited:
  • Like
Likes EngWiPy

1. What is correlation and how is it calculated?

Correlation is a statistical measure that quantifies the relationship between two variables. It is calculated by using a correlation coefficient, such as Pearson's r, which ranges from -1 to +1. A positive correlation means that the variables move in the same direction, while a negative correlation means they move in opposite directions.

2. What does a correlation coefficient of 0 mean?

A correlation coefficient of 0 means that there is no linear relationship between the two variables. However, it is important to note that there could still be a non-linear relationship or other types of relationships between the variables.

3. How do I interpret the strength of a correlation?

The absolute value of the correlation coefficient indicates the strength of the relationship between the variables. A value of 1 indicates a perfect positive correlation, 0.5 indicates a moderate positive correlation, and 0.2 indicates a weak positive correlation. Similarly, a value of -1 indicates a perfect negative correlation, -0.5 indicates a moderate negative correlation, and -0.2 indicates a weak negative correlation.

4. Can correlation imply causation?

No, correlation does not imply causation. Just because two variables are correlated, it does not necessarily mean that one variable causes the other. It is important to consider other factors and conduct further research to establish causation.

5. What are some limitations of interpreting correlation?

Correlation does not provide information about the direction of the relationship or the cause and effect between variables. It also does not account for other variables that may influence the relationship. Additionally, correlation does not indicate the strength of the relationship in the entire population, only in the sample being studied.

Similar threads

  • Biology and Chemistry Homework Help
Replies
2
Views
4K
Back
Top