The webpage title could be: Is Occam's Razor Applicable to Modeling Data?

EngWiPy · Feb 6, 2018

Hello,

I have two predictor variables: gender and class, and one response variable: survived. After converting the categorical variables into numeric variables using dummy variables, I found the correlation between gender_male and survived to be -0.54, and between class_1 and survived to be 0.29. Clearly, there is an negative relationship between being a male and survived, and a positive relationship between being a class_1 passenger and survived. However, I also want to study the combined effect of being male and in class_1 on the outcome. To this end, I created a new variable that logically finds the gender_male AND class_1, and found the correlation with survived. It was -0.012. Is what I did correct? If so, how to interpret the result?

Thanks

Dale · Feb 6, 2018

S_David said:

However, I also want to study the combined effect of being male and in class_1 on the outcome. ... If so, how to interpret the result?

What you are looking for in your model is called an interaction term. Basically, you want to see if the change in one effect depends on the value of the other effect.

So in your example, males tend to not survive (perhaps due to heroic “women and children first” decisions) and first class tended to survive (perhaps due to proximity to exits). If these effects had no interaction then you would expect first class males to survive at (.29-.54)=-.25. This is less than females of any class but more than non first class males (perhaps first class males help all females, just like non first class males, but after the females are out they still have the exit-proximity advantage)

An interaction effect would mean that there is something about the gender relationship in first class that is different than the gender relationship in non first class. Perhaps first class males are more prone to acts of heroism which results in their survival being below the expected -0.25, or perhaps they are generally older and similar heroism gets them killed more often

EngWiPy · Feb 6, 2018

Dale said:

What you are looking for in your model is called an interaction term. Basically, you want to see if the change in one effect depends on the value of the other effect.

So in your example, males tend to not survive (perhaps due to heroic “women and children first” decisions) and first class tended to survive (perhaps due to proximity to exits). If these effects had no interaction then you would expect first class males to survive at (.29-.54)=-.25. This is less than females of any class but more than non first class males (perhaps first class males help all females, just like non first class males, but after the females are out they still have the exit-proximity advantage)

An interaction effect would mean that there is something about the gender relationship in first class that is different than the gender relationship in non first class. Perhaps first class males are more prone to acts of heroism which results in their survival being below the expected -0.25, or perhaps they are generally older and similar heroism gets them killed more often

So, if there is no interaction, then, first class males survive at -0.25, but I got a result of -0.012. What does this mean? Can we say that being a first class male doesn't affect the outcome significantly? If so, it doesn't affect the outcome relative to what: being a male, or a first class passenger?

Dale · Feb 6, 2018

S_David said:

Can we say that being a first class male doesn't affect the outcome significantly?

Well, whether it is significant or not depends on the p-value of the interaction term. In any case, even if it is significant it seems to be a small effect. Depending on the number of data points even a very small difference can be significant. So it may not be important even if it is significant.

S_David said:

I got a result of -0.012. What does this mean?

It means that first class males survive at -0.262 instead of the expected -0.25. So slightly less than expected.

EngWiPy · Feb 6, 2018

I still don't get it completely. Let us take the females, as another example to have a more clear understanding, hopefully:

females survived at 0.54
first class survived at 0.29, and first class females at 0.41
second class survived at 0.093, and second class females at 0.34
third class survived at -0.32, and third class females at 0.10

How to interpret these results, and are we getting valuable information here? I think we do, but not sure how to express them correctly. For example, if there is no interaction, as you mentioned, the first class female survives at 0.83. Right? But I got 0.41. What does this mean? There is an interaction, correct? But how to understand this interaction and interpret it?

Dale · Feb 6, 2018

It is a little hard for me to know since I am not familiar with the data, and you aren’t describing it exhaustively. So I will be a little general.

You have a model like ##S_{ij}=\mu+G_i+C_j+GC_{ij}## where S is some measure of survival, G is gender and C is class. So ##j\in(1,2,3)## and ##i\in(F,M)##

There will be some reference class, let’s say first class, and some reference gender, let’s say female. So automatically ##G_F=C_1=GC_{Fj}=GC_{i1}=0##.

Then when you do your fit, you will get an intercept term ##\mu##, a gender term ##G_M##, two class terms ##C_2## and ##C_3##, and two interaction terms ##GC_{M2}## and ##GC_{M3}##. So the survival measure of first class females would be ##S_{F1}=\mu## and the survival of first class males would be ##S_{M1}=\mu+G_M## and the survival of third class females would be ##S_{F3}=\mu+C_3## and the survival of second class males would be ##S_{M2}=\mu+G_M+C_2+GC_{M2}##

Dale · Feb 6, 2018

S_David said:

females survived at 0.54
first class survived at 0.29, and first class females at 0.41
second class survived at 0.093, and second class females at 0.34
third class survived at -0.32, and third class females at 0.10

I just noticed, there are too many coefficients here. If you have three classes and two genders then the full model is 6 terms. You have 7. There should be one intercept, one gender, two classes, and two interactions.

What software are you using? It shouldn’t be giving you this output. Can you post your code and output?

EngWiPy · Feb 7, 2018

Thanks for your replies. Let me start over and explain the issue in more details: I have a dataset related to the Titanic disaster with a number of predictors, and one response variable. At the moment I am concerned with the categorical predictors Sex/Gender and Class, and their relationships with the response variable Survived. The Sex predictor takes on two values: Female, and Male, while Class takes on three values: first, second, and third. The response variable Survived takes on two values: 0 for didn't survive, and 1 for survived.

What I did first is to convert these categorical predictors into numeric ones using dummy variable coding. So, after this conversion I had 5 predictors, namely: Sex_Male, Sex_Female, Class_1, Class_2, Class_3. From this, I can find the correlation between each of these 5 predictors with Survived using Python as:

Python:

df.corr()

where df is a dataframe that has 6 columns (the numeric predictors mentioned above, and the response variable Survived).

After that, I asked the question what is combined effect of Sex/Class on the outcome. To this end, I created new 6 columns that represent all the possible combinations between these two predictors, namely: Male_Class1, Male_Class_2, ..., Female_Class_3. Each new feature was obtained by multiplying the columns the new feature represents. For example, for Male_Class1, I multiplied the columns Sex_Male, and Class_1, since this gives 1 only if both Sex_Male, and Class_1 are 1. And so for the other new features. Then I found the correlation matrix again with these new features (and the response variable). From here I got the results I mentioned before, but at this point I didn't know how to interpret the results.

For example, Class_2 has a correlation of 0.093 with the response variable, which implies a very weak relationship with the response variable, but the Female_Class2 has a correlation of 0.34 with the response variable. How does Class_2 as a whole have a very low correlation with the outcome, yet being a subset of that class has a higher correlation? Also, what does mean if Sex_Male has a correlation of -0.54, but a Male_Class_1 has a correlation of -0.012 with the outcome? Do these imply that Sex is a stronger predictor of the outcome than the class?

I am not yet at fitting the model as you presented it mathematically (this is a classification problem though, which suggests the use of Logistic Regression rather than Linear Regression). I may later use the model fitted with the p-values to confirm some tentative conclusions from this exploration process.

Dale · Feb 7, 2018

S_David said:

I created new 6 columns that represent all the possible combinations between these two predictors, namely: Male_Class1, Male_Class_2, ..., Female_Class_3.

OK, so how did you get 7 coefficients out?

S_David said:

which suggests the use of Logistic Regression rather than Linear Regression

I would also recommend logistic regression, but your question on linear regression is still fine even in this context.

EngWiPy · Feb 7, 2018

I have 6 combinations, but then I have the 5 individual features. So, in total, I have 11 columns/features. In post #7 you asked why I have 7 features. I wrote 7 features, because Sex_Female is also a feature in its own, but I have 11 features now after the combination. I didn't remove the individual features after the combination.

Dale · Feb 7, 2018

Ok, that is way too many degrees of freedom. I am surprised (and concerned) that the Python package did this without throwing an error. Your matrix should be singular.

You should make two choices. First, what would you consider your reference group, against which you would compare all other groups. Second, do you want to do a traditional model where you have interaction effects and main effects, or do you just want to look at all the interaction effect combinations.

EngWiPy · Feb 7, 2018

Dale said:

Ok, that is way too many degrees of freedom. I am surprised (and concerned) that the Python package did this without throwing an error. Your matrix should be singular.

You should make two choices. First, what would you consider your reference group, against which you would compare all other groups. Second, do you want to do a traditional model where you have interaction effects and main effects, or do you just want to look at all the interaction effect combinations.

I was trying to see if I should include interaction effects or not based on the correlations with the outcome. If I decided that Sex_Female is a reference for gender, and Class_1 a reference for class, then I would end up with only two new features: Male_Class2, Male_Class_3. Right? In this case what do the correlations between, Male_Class2 and the response variable and Male_Class3 and the response variable say about the other combinations correlation with the response variable?

Dale · Feb 7, 2018

S_David said:

I was trying to see if I should include interaction effects or not based on the correlations with the outcome.

Ok, so for that then you will want to take the traditional approach. In your design matrix, one column will be all 1’s, that is your column for the intercept which will be first class females. The next column will be for Male, the column after will be for 2nd class, then 3rd class, then 2nd class Male, then 3rd class Male. This is a total of 6 columns and is the standard full factorial model. If you were doing this in R you would specify it as S~G+C+G*C.

You will get a list out which should have 6 rows and several columns labeled things like: Parameter, Estimate, Std. Err, p-value, ... The parameter column should have entries like Intercept, G_M, C_2, G_M:C_2. The first thing you will want to do is look at the p-values and Estimates for your interaction terms. If the p values are large or the estimates are small, then drop the interaction terms and run a smaller model with the main effects only.

Now, if they are significant then you have to understand what they mean. Since your reference is first class female the average of the first class females is considered the overall “expected value”, that is why it is the intercept of the model. Then each of the other terms in the model represents a deviation from the expected. So the M term shows how first class males differ from first class females. The expected survival of first class males is therefore the intercept estimate plus the M estimate. Similarly, the 2 and 3 terms show how second and third class females differ from first class females. Their expected survival is the intercept plus the 2 and 3 terms respectively.

So what about the groups that differ from the reference group by both gender and class? If your interaction terms were dropped above, then their expected survival is just the sum of the intercept and both main effects, so second class males would be intercept plus the Male term plus the 2 term. If the interaction was not dropped above then it cannot simply be described as an independent sum of those two effects, but there is a further difference from what would be expected. So then the second class male expected survival would be the intercept plus the Male term plus the 2 term plus the Male:2 term.

You may find it helpful to expand all of those six groups into straight gender:class survival rates, without looking at them as effects and effects upon effects.

EngWiPy · Feb 8, 2018

Thanks for the detailed reply. I tested the accuracy (the score or the coefficient of determination ##r^2##) of this model and it didn't change much compared to the accuracy of a model that contained the main effects only. I will check the estimates and the p-values for the interaction terms to double check this (I think Python doesn't give the p-value automatically, and it has to be coded), but I think the estimates will most likely be close to zero.

Dale · Feb 8, 2018

I don’t know much about Python statistics packages. I use R for statistics. In R, there are several easy ways to evaluate a model. One is using “drop1” which drops all single terms one at a time. Another is using the AIC or the BIC. A third is to run two models and just use ANOVA to compare them.

If python enables any of those then I can help you understand what they mean.

EngWiPy · Feb 8, 2018

Dale said:

I don’t know much about Python statistics packages. I use R for statistics. In R, there are several easy ways to evaluate a model. One is using “drop1” which drops all single terms one at a time. Another is using the AIC or the BIC. A third is to run two models and just use ANOVA to compare them.

If python enables any of those then I can help you understand what they mean.

I think Python evaluates models using the score and MSE metrics mainly, and for classification problems it uses classification report that includes the precision, recall, f1-score and support. Since the accuracy hasn't improved with the interaction terms, cannot we conclude (at least tentatively) that they have little to no effect just from this metric (the score)?

Dale · Feb 8, 2018

Yes. The burden of proof for simplifying a model is small. Use any excuse to simplify, it is only a more complicated model that needs strong justification.

That is my personal prejudice and it is common amongst scientists, but not universal

jim mcnamara · Feb 8, 2018

Dale is describing something sometimes called Ockham's Razor: https://en.wikipedia.org/wiki/Occam's_razor It is something important to be aware of when you are trying to model data.

simpler theories [hypotheses] are preferable to more complex ones because they are more testable

I added the word "hypotheses".

The webpage title could be: Is Occam's Razor Applicable to Modeling Data?

1. What is correlation and how is it calculated?

2. What does a correlation coefficient of 0 mean?

3. How do I interpret the strength of a correlation?

4. Can correlation imply causation?

5. What are some limitations of interpreting correlation?

Similar threads

Hot Threads

Recent Insights