# Which statistical test should I use?

I have a data set consisting of ~10,000 patients. Each patient answered how often they had chocolate and we assigned the following values to their responses:

Never= 0*

1-6 time per year = 0.01*

7-11 times per year = 0.028*

1 time per month = 0.033*

2-3 times per month = 0.08*

1 time per week = 0.14*

2 times per week = 0.29*

3-4 times per week = 0.5*

5-6 times per week = 0.79*

1 time per day = 1*

2 or more times per day = 2*

We also have data describing the levels of a certain cytokine in their blood.*

The question I am trying to answer is: Does the amount of chocolate someone eats cause DECREASED levels of this cytokine?

What statistical test should I run in order to answer this question? I am very new to statistics, so any and all information is much appreciated.

Help would be greatly appreciated! Please post if you can!

Stephen Tashi
The question I am trying to answer is: Does the amount of chocolate someone eats cause DECREASED levels of this cytokine?

Applying statistics is a subject matter, so you should clarify whether "trying to answer" means that you want an answer that can be published in an article or do you want an answer for your own personal satisfaction?

Unless you explain your rating system, people can only advise you how to test whether higher levels of your rating are associated with decreased levels of the cytokine. This is not necessarily the same thing as whether higher levels of chocolate are associated with decreased levels.

Thanks Stephen, I am trying to obtain an answer that can be published in an article.

As for explaining the rating system - the participants are given a questionnaire in which they have the multiple choice options shown above (never, 1-6 times per year, 7-11 times per year, etc...). The database codes these answers to the numbers shown above (0=never, 0.01=1-6 times per year, 0.028=7-11 times per year, etc.)...

Maybe my lack of knowledge in statistics is causing me to be unfortunately vague. I thought that I could describe the data, our ranking system, and the question at hand (Does the amount of chocolate someone eats cause DECREASED levels of this cytokine?) in the hopes that someone would respond with:

"Use a t-test" or "use a chi-squared test"

I'm privy to the fact that it's probably going to require more than running one simple test in order to obtain publishable results, but any step in the right direction (maybe just explaining the general outline or providing what statistics concepts I should look into) would be a great help to get the ball rolling for me in this project.

Thanks again!

Stephen Tashi
Thanks Stephen, I am trying to obtain an answer that can be published in an article.

Then the only reliable way to do the statistics is to understand what type of statistical tests are accepted by the editors of the target publications. Academic journals may have a document describing their guidelines. You can look at other articles that were published and see what tests were used in a similar situation. If you can't find articles about similar situations (Does eating X more often influence the level of Y?) it's unlikely the journal would publish your article anwyay.

The database codes these answers to the numbers shown above (0=never, 0.01=1-6 times per year, 0.028=7-11 times per year, etc.)...

Why code 1-6 times a year to 0.01. Why not 0.02 or 177.45? You haven't explained the reason for the ratings.

Maybe my lack of knowledge in statistics is causing me to be unfortunately vague. I thought that I could describe the data, our ranking system, and the question at hand (Does the amount of chocolate someone eats cause DECREASED levels of this cytokine?) in the hopes that someone would respond with:

"Use a t-test" or "use a chi-squared test"

There are people who make responses like that. If you want to believe them, good luck!

I think your lack of knowledge in statistics leads you to attribute too much capability to statistics.

Statistics does not determine cause and effect. (The very definition of a "cause" gets into philosophical discussions.) It's better to ask the question in the form "Is increased chocolate consumption associated with decreased levels of the cytokine".

The conclusions made by mathematics are limited by the given information. For example, if you know one side and one angle of a triangle, there is no mathematics that lets you determine the missing sides and angles. There is not enough given information In your problem (and in most applications of statistics to real world problems) to determine a yes-or-no answer to the question. It is not even possible to determine the the probability that the answer is yes or the probability that the answer is no.

The statistics used in most life science journals is called "frequentist statistics" and it is historically the oldest form of statistics. A problem like yours is approached by a procedure called a "hypothesis test". We make an assumption that lets us compute the probability of the data. This assumption, called the "null hypothesis" is usually as "empty" as possible. In your case it would say "chocolate consumption has no effect of levels of the cytokine" and that would have to be further elaborated so it provides a mathematical model of how your data were generated. The probability of some function of the data is computed (such functions are "statistics" in the technical sense of the term). If the probability of the statistic being as extreme as the data shows is small then we "reject" the null hypothesis.

The above method is not a mathematical proof that "rejection" is the correct thing to do. It is simply a procedure that is accepted by many people as "evidence" in the non-technical sense of the word "evidence". Some persons may be persuaded by one type of statistic and others may not. Different journals may have different standards about what constitutes a "small" probability (0.05 is a typical value, but this is arbitrary).

Wow, thanks a lot for the information. Your explanation of 'frequentist statistics' is really helpful. I've been independently trying to understand the uses of each statistical test, but your explanation of the null hypothesis as it relates to my study really made me understand what I am really trying do (that is, compile argument that supports the rejection my null hypothesis).

Why code 1-6 times a year to 0.01. Why not 0.02 or 177.45? You haven't explained the reason for the ratings.

My apologies - I didn't understand your question at first. The decimal changes the multiple choice answer into to the participant's daily frequency of chocolate consumption.

...the only reliable way to do the statistics is to understand what type of statistical tests are accepted by the editors of the target publications. Academic journals may have a document describing their guidelines. You can look at other articles that were published and see what tests were used in a similar situation.

I'm taking your advice and went along to look at other journals that did the same study. According to these studies, cytokine "concentrations were transformed into natural logarithms to reduct their positive skewness and data were reported as geometric means. Differences between chocolate nonconsumers and dark chocolate consumers were evaluated using multivariate analysis of variance or multivariate binomial (Poisson) regression with the log link function." The article continues to address covariates, but I think it's best for me to stop here and make covariates step 2 of my attempts to understand this process.

From here, I have looked into Poisson regression and multivariate ANOVA. Here, I run into another bit of confusion. When looking into Poisson regression, most instructions say that the outcome is dichotomized/categorical (yes or no, agree or disagree), so I'm not sure how that can be applied since there are various levels of chocolate consumption and various concentrations of CRP.

MANOVA seems complex. Plus, most sites and papers are "controlling for.." this and that (age, sex, race, etc.) which I don't know how to do. "Controlling for..." something makes intuitive sense to me. Obviously, there are more things that influence cytokine levels than chocolate - and I need to control for a convincing amount of these factors. Any help understanding these two tests and if/why they would be useful, what they would show, etc. would be wonderful.

Stephen Tashi
Differences between chocolate nonconsumers and dark chocolate consumers were evaluated using multivariate analysis of variance or multivariate binomial (Poisson) regression with the log link function."

That's a promising lead. Don't neglect non-chocolate related articles with the same general pattern. (The more X you eat the more your Y level goes down (or up).)

From here, I have looked into Poisson regression and multivariate ANOVA. Here, I run into another bit of confusion. When looking into Poisson regression, most instructions say that the outcome is dichotomized/categorical (yes or no, agree or disagree), so I'm not sure how that can be applied since there are various levels of chocolate consumption and various concentrations of CRP.

I don't see how to apply Poission regression either, unless you treat the number of times per week a person eats chocolates as the dependent varaible. The "number of times" is a count. (Using counts would ignore your rating scale.)

MANOVA seems complex. Plus, most sites and papers are "controlling for.." this and that (age, sex, race, etc.) which I don't know how to do. "Controlling for..." something makes intuitive sense to me. Obviously, there are more things that influence cytokine levels than chocolate - and I need to control for a convincing amount of these factors. Any help understanding these two tests and if/why they would be useful, what they would show, etc. would be wonderful.

The last time I did ANOVA was over 20 years ago when I took a course in it, However, if we discuss it somebody who knows what they are doing will probably chime in and it keeps the thread alive.

Before your worry about complicated approaches, do the simplest ANOVA. As I see it, this would use the cytokine level as the response variable and the levels of chocolate eating as the treatments. (It would ignore your rating scale.) This type of test won't provide evidence for "The more chocolate you eat, the lower your cytokine level will be". If it provides evidence, It only provides evidence that that eating chocolate has some effect. (For example, perhaps if you eat a certain level of chocolate you cytokine goes low, but at greater or lesser level of chocolate eating, your cytokine goes high.)

If you plan to "control for factors" you'll have to survey your data to make sure the data for those factors is present. The database may have a field for a datum but the contents of the field may be null or some place holder for "missing data" like "99999".