Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

What does statistically significant mean?

  1. Aug 9, 2017 #1
    I'm looking at this quote:

    "The proportions of the phyla Firmicutes and Bacteroidetes were statistically significantly increased in the obese group compared to the normal weight group (p< 0.001, p = 0.003 respectively)."

    Since I don't know statistics can you please explain how to visualize these results? The difference between the numbers looks so small, 0.001 and 0.003. How can it be significant? Thanks.
  2. jcsd
  3. Aug 9, 2017 #2

    jim mcnamara

    User Avatar

    Staff: Mentor

    This is partially guesswork on my part. It is hard to see what is being discussed - gut bacteria I think.
    Another poster may put up a wall of Math, but I know from previous posts you want very simple plain English answer.

    (This is neither a rigorous nor textbook worthy description, nor is it meant to be applied to wherever the "p" in your question came from)

    What the "p" means in general. An example using people.

    Let's assume we have two towns, Town A and Town B. Most of the people in A seem to be shorter than the folks in B. Since people have schedules it would be hard to get an age, height, and weight for every adult in both towns. Expensive, too. So we compromise and rely on the fact that a sample smaller in number than the population can be analyzed. And the results can have meaning. There are assumptions behind this. One biggie is: We assume that people who we do get data from are representative of the larger population. This is not remotely easy with humans and medical research, BTW. We also assume a lot of other things.

    Okay, we have data. What we do is calculate the mean, and calculate statistical values, maybe things like a standard deviation. Then we ask the question: are the differences we see in the data are due to the fact that, for our data set, the populations in town A and B are truly different. Or differences are due to sampling a population and nothing is going on.

    This is where "p" enters the picture. When you are given a deck of card ands asked to take one, the chance of getting the ace of spades is one chance in 52 possible cards. Right? so 1/52 means p=.019 Anytime you have very small p it means that the two samples probably came from different populations.

    Depending on what you are doing you may choose to view a p=.05 as good enough to qualify the results as meaningful. It does NOT mean it is an absolute slam dunk. It means you have 5 chances in 100 of drawing the wrong conclusion.

    In general the more people you sample the more likely the results will be clearer - either no difference or a smaller p value.
  4. Aug 9, 2017 #3

    jim mcnamara

    User Avatar

    Staff: Mentor

    PS: can you tell us what research you were reading? Then someone can tell you what was going on? With low p values like those it probably has some interpretation.
  5. Aug 9, 2017 #4
    Yes, thanks. The article is this: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3258740/

    Frequency of Firmicutes and Bacteroidetes in gut microbiota in obese and normal weight Egyptian children and adults

    There are research suggesting that in obese people Bacteroidetes diminish while Firmicutes increase. This paper appears to corroborate this. I wanted to know how strong is their result? So supposedly they found that indeed the obese children had more Firmicutes than non-obese children. So how do I know that what they found is a strong evidence and not just a close call.

    Do they find that all obese children have more Firmicutes? Or do they find some of the children have somewhat increased Firmicutes?

    I copy the section about statistical analysis:

    Statistical analysis

    Statistical analysis was performed using the software SPSS 12.0 (SPSS: Chicago, IL). Variables were expressed as mean ± standard deviation (SD). Mean values were compared among subjects using Student’s t test and ANOVA was used to compare variables between all groups. For comparing categorical data, χ2 test was performed. The exact test was used instead when the expected frequency was less than 5. Spearman’s correlation coefficient rho was used to correlate between non-normally distributed continuous variables. A probability value (p value) less than 0.05 was considered statistically significant.

    Thanks again..
  6. Aug 9, 2017 #5


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    It all depends on how the difference between the two treatments compares to the normal variation of results when the treatment isn't changed.

    "statistically significant" means that it is unlikely to get that result simply due to luck.

    Not statistically significant example:
    Suppose you are measuring the effect of a treatment on something and the results are
    Without treatment: 1.0, 2.0, 0.0, 1.5, 2.5, 0.5
    With treatment: 1.01, 2.01, 0.01, 1.5, 2.51, 0.52
    Clearly the differences between the two sets of data are very small compared with the amount of variation of the results. So these differences may easily be just due to luck. The difference is not statistically significant.​

    Statistically significant example:
    Suppose you are measuring the effect of a treatment on something and the results are
    Without treatment: 1.0, 1.01, 0.99, 1.01, 0.98, 1.0
    With treatment: 2.0, 2.02, 1.1, 1.9, 2.01, 1.49
    Clearly the differences between the two sets of data are very large compared with the amount of variation of the results. So these differences are not likely be just due to luck. The difference is statistically significant.​

    The odds of the differences being due just to luck can mathematically rigorous if some assumptions about the nature of random variation are satisfied.
    If the random part is a normal random variable, a lot can be done mathematically.
    There are some standard levels of probability that are typically set before one can state that the results are statistically significant. Some common ones are:
    0.05 ( odds are 1 in 20 that a difference that large or larger is just due to luck)
    0.025 ( odds are 1 in 40 that a difference that large or larger is just due to luck)
    0.01 ( odds are 1 in 100 that a difference that large or larger is just due to luck)

    One of the most extreme significance levels is the level required in nuclear physics before one can claim that a new particle has been discovered. That level is the 5 sigma level, which would occur only once in 3.5 million times just due to luck.
    Last edited: Aug 9, 2017
  7. Aug 9, 2017 #6


    User Avatar
    Science Advisor
    2017 Award

    Here's a useful piece written about p-values for the general public: http://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/

    It's important to note that p-values don't really tell you about the size of the effect nor do they tell you about causation. A large difference can have a large p-value if only a small fraction of the population was sampled. A miniscule difference that is not biologically relevant could have a small p-value if a large enough sample size was studied. P-values are often misinterpreted and misused when researchers are performing data-mining and testing multiple different hypotheses (see https://xkcd.com/882/ and http://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/).

    Being able to detect a statistically significant difference between two populations does not necessarily mean that the difference is meaningful. There could be confounding variables at play that make inferring anything about causality difficult. For example, obesity can be caused by one's diet and one's diet will affect one's microbiome. Thus, the different microbial composition could be a side effect of obesity, and not an actual cause of obesity. Randomized, controlled experiments are required to estabilish causality.
    Last edited: Aug 9, 2017
  8. Aug 9, 2017 #7

    jim mcnamara

    User Avatar

    Staff: Mentor

    In Biology it is possible to see "significance" statistically but not because it is "real". Rather - Because what you measured or tested was interfered with (confounded) by something you did not know about. This is all about how you design experiments. And then it as all about how you analyze the data - those assumptions I mentioned earlier . Which is why I left the p values you showed in limbo.
  9. Aug 9, 2017 #8


    User Avatar
    Science Advisor

    In frequentist statistics, statistical significance means that there is a low probability that the "null" or "boring" hypothesis will produce the observed result.

    The degree of lowness needed for "significance" for publication in a journal is a matter of convention.

    Statistical significance is not the same as real world or biological significance. For example, the difference in mean between two very large samples drawn from the same distribution will almost certainly be statistically significant. But there is no real significance since the underlying distribution is the same.
  10. Aug 10, 2017 #9


    User Avatar
    Science Advisor

    The p values reported aren't a difference!
    Rather the p value is the probability to observe by chance the observed number of Bacterioidetes in obese children under the null hypothesis that this number is on average as large in obese as in non-obese children. As this probability turns out to be very small, you discard the null hypothesis and rather accept the alternative hypothesis, namely that the number of Bacterioidetes is significantly lower in obese children than in normal weight children.
  11. Aug 15, 2017 #10
    Here's the wiki article: https://en.wikipedia.org/wiki/P-value
    DrDu has provided the correct answer. But let me elaborate.

    The topic here is "inferential statistics". That is, to test a hypothesis by collecting evidence and judging to what extent and in what way that evidence supports the conclusion.
    From the article:
    So, given this hypothesis, they should be able to observe a correlation between certain gut bacteria and obesity.

    Let's say that we tested only 2 Egyptians and found this:
    Egyptian 1: obese; harbors Firmicutes; likes Star War movies.
    Egyptian 2: not obese; does not harbor Firmicutes; does not like Star War Movies.

    If we were a bit reckless, we might conclude that both Firmicutes and liking Star War movies results in obesity. But we have only collected data on two Egyptians. How confident can we be?

    The answer comes when you consider the "null hypothesis". In this case, there would be two null hypothesis:
    1) There is no correlation between obesity and Firmicutes.
    2) There is no correlation between obesity and liking Star War movies.

    From there you ask: How likely is it that I would results this suggestive given these null hypothesis?
    The answer is the "p" value. In this case with only 2 Egyptians, it's only about 50/50 so we cannot discount random chance.

    The term "statistical significance" is commonly applied to studies that result in p=90%, 95%, 99%. The author reports in this study:
    - so he is looking for p<=.05. That means that there is a 5% chance of getting p=0.05 results from luck. On average, if you performed the experiment 20 times with 20 different groups of people, you would get a p<=0.05 result about once in those 20 times.

    So, what the researchers did was to collect information from 79 Egyptians:
    Of the 52 obese Egyptians: 45 have Firmicutes, 7 do not.
    Of the 27 normal weight Egyptians: 12 have Firmicutes, 15 do not.

    Given the data above, you can compute the "p" value - usually with a calculator.
    Using this calculator: http://www.socscistatistics.com/tests/ztest/Default2.aspx
    It turns out that the chance that the study results were purely coincidental is p=0.00008 - which is pretty convincing.
    For the Bacteroidetes, the stats were 43/52 and 13/27. When I plugged in those numbers, I get p=0.00132 - not the same as their 0.003.

    Also: I noticed that at the start of the study they reported 51 obese and 28 normal, but in the statistical calculations, they used 52 obese and 27 normal weight. There may be an explanation for this in the study - but I didn't catch it. I also noted that they defined obese as BMI>30, normal as BMI<25, and overweight as between those value. In the final tally, only obese and normal were considered - no participants we "overweight". But they did not list "overweight" as one of the disqualifiers for participation in the study.
  12. Aug 15, 2017 #11


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    Some slight word-smithing:
    p 0.05
    20 times with 20 separate groups of people from the same statistical population
  13. Aug 16, 2017 #12
    Statistical significance is a deprecated statistic (to the extent that some journals even ban its use). See http://amstat.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108?needAccess=true for a good explanation of the reasons. The follow-up https://errorstatistics.files.wordpress.com/2016/03/2_benjamin_berger.pdf gives a better choice which is very easy to calculate and avoids the ubiquitous misinterpretation associated with p-values.

    Note also that the two p-values quoted in the OP are clearly for two separate comparisons. The first compared the amount of Firmicutes in the obese group to the normal weight group and the second compared the amount of Bacteroidetes in the same two groups.
    Last edited: Aug 16, 2017
  14. Aug 16, 2017 #13
    That's really clear...
  15. Aug 17, 2017 #14
    "For comparing categorical data, χ2 test was performed." Your second post shows that the study was simply comparing categories and not continuous values such as "bacterial count" or "obesity index". However, the text you quoted was not clear that χ2 test was used for these particular inferences. Let's assume that χ2 test was used. χ2 test would be the wrong test for comparing probability of having Firmicutes AND Bacteroidetes. Therefore, χ2 test could only have been used for the two category tests comparing Obesity AND Firmicutes, and Obesity AND Bacteroidetes. So, these tests do NOT bear on your concern regarding the inverse relation between Firmicutes and Bacteroidetes.

    The study data was categorized by having or not having the conditions.
    Lets see what is inferred. Let O = Obese, N = Normal Weight, F = Firmicutes, B = Bacteroidetes, P = Probability of Occurrence, | = Given.

    "The proportions of the phyla Firmicutes and Bacteroidetes were statistically significantly increased in the obese group compared to the normal weight group (p< 0.001, p = 0.003 respectively)." First Inference: That the P( F | O) > P ( F | N) is True in our sample, and that this difference is significant enough to state that the two probabilities are different in the real population, with a p <.001. This means that the difference between these two ratios is "significant". One can conclude that if a random subject is picked, the probability of having Firmicutes is greater if they are Obese, with a probability of about 99.999%. The χ2 test can provide you precisely how much greater if the other distribution characteristics were available. With this very high confidence, however, it's almost certain that the probability of having Firmicutes is greater if one is Obese.

    Since, these other χ2 test data are not available, the best one can conclude is that there is a probable relationship between having these bacteria and being obese. What the study does not tell you is whether there is a causal relationship. That is we cannot answer whether obesity causes vulnerability to these bacteria, or whether these bacteria contribute to obesity. It may be intuitive to professionals that the bacteria is the causal factor. Maybe other parts of the study do answer these questions.
  16. Aug 25, 2017 #15
    I came across an online article discussing p-values which tries to justify this claim:

    If you observe a P value close to 0.05, your false discovery rate will not be 5%. It will be at least 30% and it could easily be 80% for small studies.

    It seems pretty convincing to me but I am by no means an expert with this type of statistical description. As such I'll refrain from paraphrasing the author's arguments and just link the article. Would someone with more knowledge of the field be able to comment on how this applies to this discussion?

  17. Aug 25, 2017 #16

    jim mcnamara

    User Avatar

    Staff: Mentor

    What he is on about is experimental design. And the effect of bogus interpretation/design on p values. Which is a huge subject.

    An example - suppose you want to test if a known good formulation of medicine can be improved by changing the concentration of "stuff" in the mixture.
    So you have 30 volunteers, all males over the age of 21. You get a small but significant (you think) change based on your statistical analysis. Turns out later that for women and children the new mixture is not better, it is worse. The experimental design stinks. Therefore the data analysis stinks, too.

    This example is horribly oversimplified. But is along the line of some of the points the guy is trying to make. I think the article is substandard myself (very brief read). However overstated some of his claims may possibly be, the overall unstated point about design and interpretation of data is okay. His approach is not improved experimental design and analysis, but rather the use of p values much smaller than .05, like .001. Which is not really correcting the root cause of the problem. My opinion only.

    In grad school, we were required to take graduate level statistics courses and did extensive labs and homework. I would guess that is still true for most people working in Science.

    From wikipedia: https://en.wikipedia.org/wiki/P-value
    OR, attributed to several people:
  18. Aug 25, 2017 #17


    User Avatar
    Science Advisor
    2017 Award

    It is essentially taking a Bayesian approach to interpreting p-values, which is something that many statisticians advocate. For a more intuitive summary, see this graphic from Nature:
    (from https://www.nature.com/news/big-names-in-statistics-want-to-shake-up-much-maligned-p-value-1.22375)
    As you can see, a result with p = 0.05 does not do much to change one's prior beliefs when analyzed in a Bayesian framework. If you thought a hypothesis had a 50% chance of being false prior to the study, a result with p = 0.05 should only decrease this belief to about a 30% chance of being false.

    For a more technical discussion of the issue, here's a good piece from the journal, Nature Methods: http://www.nature.com/nmeth/journal/v14/n3/full/nmeth.4210.html
  19. Aug 25, 2017 #18


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    His claim here does not challenge the mathematical statistical statement, it argues with how the confidence level is set and interprited in the real world. To summarize his argument:
    Suppose the situation and process for what you decided to do a test on has only a 10% probability that there really is a difference. Then a 10% confidence level is too lax -- it would have you making a false claim almost half the time.

    I can't disagree with that. My conclusion is that one should make a judicious selection of the confidence level that is much tighter if the real odds of finding a solution are small. That is what the scientific world does. For instance, the required confidence level in nuclear physics to claim that a new particle has been discovered, is the 5 sigma level, which would occur only once in 3.5 million times just due to luck.

    Another thing that can be done is to do some repeat tests of a positive result. A legitimate positive should be repeatable whereas a false positive will only give a repeat positive at the rate of the confidence level. Again, that is what the scientific world does.

    Summary: If someone claims that he has discovered an alien from space and that it passed a test at the 10% confidence level, ask him to set his confidence level lower and insist of independent repeat testing.
    Last edited: Aug 25, 2017
  20. Aug 25, 2017 #19

    jim mcnamara

    User Avatar

    Staff: Mentor

  21. Aug 25, 2017 #20
    *Vannay* is very close. His reference is correct. The p value has to do with hypothesis supporting the study; not, the probability of a false positive or a false negative. With the p<.0009, the null hypothesis is proven. Therefore, the theory is very highly correct. But, the data for the hypothesis was structured on class, not an actual variable; so, computing a false-postive rate is not possible with these data. {perhaps the other parts of the study.
  22. Aug 26, 2017 #21


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    The p value is the probability of a false positive given that the null hypothesis is true. It is a hard mathematical fact. The p value is the only result that can be realistically calculated with mathematical certainty. The rest is up to the scientist to interpret judiciously.

    Alternatively, it is very speculative and debatable to try to expand that to a probability in general.

    I would prefer to start with a hard mathematical fact and interpret it judiciously than to muddy the picture immediately with unsubstantiated and debatable opinion regarding the overall probability of a particular result.
    Last edited: Aug 26, 2017
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted