Question about multivariate statistics on proportions

1. Feb 9, 2014

Signifier

Hi all, I have no experience with multivariate statistics but I'm trying to figure out a type of problem that seems like it should be easy, but I don't know where to begin. I ran into the problem in my research, and I don't have a strong statistics background. I'll try to explain it with an example.

Let's say I've got two classes of individuals, "Mutants" and "Non-mutants". Let's say I have 50 Mutants and 120 Non-mutants, so 170 individuals total.

Now I want to study how these two groups may differ in some preferences.

Let's say I find that 32 of the Mutants like to eat cheese (64%), while 50 of the Non-mutants like to eat cheese (42%).

Let's say I find that 20 of the Mutants are tired (40%), while only 2 of the Non-mutants are tired (~2%).

I can do some univariate statistics and conclude that it seems that there is a statistically significant difference between the Mutants and Non-mutants in their cheese liking, and also a statistically significant difference between the Mutants and Non-mutants in their tiredness.

What I want to know is what statistical tests can I do to figure out if these differences are INDEPENDENT of each other. For instance, is the difference between the % of mutants who like cheese and % of non-mutants who like cheese really due to "mutant"/"non-mutant"? Or is it actually a product of the differences in tiredness (do tired people in general just like cheese?)

Is this even possible to answer with the information given? What sort of "multivariate statistics" do I need to do to sort something like this out?

Thank you so much in advance, and ask me any clarifying questions you need to. Thanks!

2. Feb 10, 2014

Stephen Tashi

Your type of problem is described by the phrase "analysis of categorical data". You can find bewildering array of methods for doing such analysis.

One of the reasons that applying statistics is a subjective matter is that it involves assuming some probability model for how the data is generated. You have to decide what probability model you want to use.

The two major divisions of statistics are Hypothesis Testing and Estimation.

In statistical "Hypothesis Testing", the probability model that is assumed is specified by the "null hypothesis". The probability model must be specific enough to compute the probability of an event that contains the particular event given by the observed data. For example, in coin tossing, "The coin is just as likely to land heads as tails" specifies the particular probability model that p = 0.5 for the probability of a head. The assumption "The coin is more likely to land heads than tails" does not specify a particular probability model, so it isn't a usable "null hypothesis". If you want to do a hypothesis test using a probability model that reflects that mutants are more likely than non-mutants to like cheese then you must come up with some specific numbers to quantify this.

In Estimation, a probability model is assumed that has unknown parameters and the parameters are estimated from the observed data. For example, you could assume a model with several unknown variables representing conditional probabilities (e.g. p_c|m,t = the probability an individual likes cheese given that he is a a mutant and tired).

Often people fit a mode to data using Estimation and then, using that model as the "null hypothesis", they Hypothesis Test whether the model should be "accepted". If it is accepted, this has the feel of a self-fulfilling prophecy, but this is a traditional approach.

Since applying statistics is subjective, if you are doing work that you want to publish, you should look at other papers that have been published and see what kind of statistics editors in your field like.

3. Feb 10, 2014

FactChecker

There is a very direct way to test the hypothesis that being tired depends on Mutant/non-mutant. Assume that there is no dependence. Then the data you have indicates that a random 22 out of 170 should be tired. This gives a binomial distribution with p=22/170 ~ 0.13 and q ~ 0.87. The Chi-squared goodness of fit test can test the probability that data fits a particular distribution. The expected numbers with this assumed distribution is 50*0.13 ~ 6 tired Mutants (had 20), 44 non-tired Mutants (had 30), 120*0.13 ~ 16 tired non-Mutants (had 2) and 104 non-tired non-Mutants (had 118). Plug this into the calculator at http://vassarstats.net/csfit.html . This gives a Chi-squared with 3 degrees of freedom value of 51.26. It has a probability P < 0.0001. So our assumption that there is no connection between tiredness and Mutants is almost certainly wrong.

4. Feb 11, 2014

Signifier

Thank you both. This was really useful.

Stephen Tashi: what you've described seems more complicated than what I have done, so I want to check my understanding. For what I'm interested in (is there a significant difference in the % of mutants and % of non-mutants who like cheese?), can't I just assume that the data follows a normal distribution, posit a null hypothesis that the %'s are the same, and then test it by calculating a Z-value? Using this formula: http://www.kean.edu/~fosborne/bstat/px/z-score-for-7-6-1.gif

where p-bar is the pooled proportion estimate. Is this not sufficient? (I basically am following the explanation at http://www.kean.edu/~fosborne/bstat/07d2pop.html)

FactChecker: thank you. Is this a better technique for this problem than calculating a Z-score from testing of the difference between 2 population proportions (again, as described at http://www.kean.edu/~fosborne/bstat/07d2pop.html)? [Broken]

Lastly, how could we use a technique like this to see if there is a connection between the other variables, like tiredness and cheese-liking? I'm trying to figure out if it's really the mutant/non-mutant status that makes the difference in cheese-liking, or it's really the difference in tiredness. Is this sort of question that can even be answered (in part) with more powerful statistical tests?

Thank you again both so much!

Last edited by a moderator: May 6, 2017
5. Feb 11, 2014

Stephen Tashi

I note that the link you gave calculates by assuming $(p_1 - p_2) = 0$ it uses the assumption $p_1 = p_2$.

You aren't assuming "the data" is normally distributed, you are assuming the test statistic is normally distributed

As I said, applying statistics is a subjective process. Many people like the test you describe. As to whether it is "sufficient", I think it is sufficient in a sociological sense. Many readers would find it convincing. I don't know any mathematical definition for what it would mean to say that a hypothesis test was "sufficient".

Those words indicate that you grant that there is difference in the cheese-liking between mutants and non-mutants. You are asking whether this can be "explained" by a difference in cheese-liking between tired and non-tired individuals. You can do a hypothesis test to "accept" or "reject" the idea that there is no difference in cheese-liking between tired and non-tired individuals. However if you "reject" the idea that there is no difference, this doesn't answer the question of whether the difference (whatever it is) "explains" the difference in cheese-liking between mutants and non-mutants.

To test whether a difference in cheese-liking between tired and non-tired individuals explains the difference in cheese-liking between mutants and non-mutants, I think you must be specific about the difference in cheese-liking between tired and non-tired individuals. It isn't sufficient just to say "Yes, there is a difference" because that doesn't let you calculate probabilities. You'd have to assume something specific like "60 % of tired individuals like cheese and 45% of non-tired individuals like cheese".

6. Feb 11, 2014

FactChecker

I don't know.

This sounds like an analysis of variance (ANOVA) question where the problem is to determine what variables are the cause of data variance. I noticed that your example did not include any data on how many tired / not-tired people like / do-not-like cheese. I think that it will be very hard to address your question without that data. This is especially true since your example has such a strong relationship between Mutant / non-Mutant and tired / not-tired. So when you ask which of those really cause liking cheese, you will need a lot of data to statistically distinguish between those two possible causes. (Assuming either are a cause)

Last edited by a moderator: May 6, 2017
7. Feb 11, 2014

Signifier

Again thank you both, this has given me a lot to think about and learn.

8. Feb 11, 2014

Stephen Tashi

I'll repeat my main point, which is that doing statistics involves assuming a probability model for how the data is generated.

(Introducing jargon like "null hypothesis" may obscure the fact that a probability model is being used, but there always is a probability model.)

FactChecker's suggestion of using ANOVA is excellent if you are willing to use a probability model that assumes a linear relation among some continuous random variables. If you go that route, you must define the variables.

If you are familiar with computer programming, I suggest that you think about how you would write a program to simulate how your data was generated. That will force you to think of a specific probability model. Imagining what underlying process generates the data is naturally a subjective process. However, in real life problems, the bare facts usually don't provide enough "givens" to make a solvable mathematical problem. Hence you must make subjective assumptions in order to apply math.

9. Feb 14, 2014

bpet

The analysis will be much much easier if the data includes how many individuals are Mutants that like cheese AND are tired, etc. Then you can make a 2x4 contingency table (or 2x2x2 if you wish) and apply the standard methods.

10. Feb 14, 2014

FactChecker

The data in your example is not well suited to say anything directly about the connection between liking cheese and being tired. You don't have the best data (e.g. who likes cheese and is tired). The subject of Experimental Design address exactly that issue. There are several Experimental Design tools available to help design experiments that will allow you to determine the interactions between factors.