How far and how close to p=0.05 for statistical significance?

fog37 · Sep 25, 2023

Hello Forum,

I understand what the p value represents and how it is calculated in a statistical hypothesis test. In general, the p-value threshold is set to 0.05, i.e. 5% which means that the null hypothesis is reject 5 times our 100 even if it is true. Or that the sample statistics are, assuming the null hypothesis is true, are extremely rare (if p<0.05) leading to reject H0...

What if our p value is just 0.057? Do we keep H0? What if p was 0.049? Would we reject H0? I guess I am asking how far the calculated p-value must be from the 0.05 threshold for the results to be either statistically significant or not...

Thank you!

FactChecker · Sep 25, 2023

The choice of a confidence level depends on the subject area and the seriousness of a mistake. An extreme example is in physics where the issue is the claim of discovering a new nuclear particle. There, they typically insist on a standard of 5 sigma, which corresponds to percentages of 0.00006% or 0.00003% (two sided or one sided). See this CERN post.
In your example, where the p value is exactly 0.05, remember that the goal is to be able to convince others, who may be skeptical of the alternative of the null hypothesis. You can go either way, but expect some resistance from others.

Dale · Sep 25, 2023

fog37 said:

TL;DR Summary: How far and how close to p=0.05 for statistical significance...

What if our p value is just 0.057? Do we keep H0? What if p was 0.049? Would we reject H0?

Unfortunately, p values are very misused, and the magical 0.05 threshold especially so. A low p value is evidence against a null hypothesis, but null hypotheses are almost never actually believable and are rarely of interest.

One of the biggest misinterpretations of p values is that a small p value is evidence in favor of some scientific hypothesis of interest. Or that a small p value indicates a large or important effect.

Regarding your specific question, I usually consider all of the evidence, but I typically would not find a p value of 0.049 to be very persuasive even though it is significant.

fog37 · Sep 25, 2023

Dale said:

Unfortunately, p values are very misused, and the magical 0.05 threshold especially so. A low p value is evidence against a null hypothesis, but null hypotheses are almost never actually believable and are rarely of interest.

One of the biggest misinterpretations of p values is that a small p value is evidence in favor of some scientific hypothesis of interest. Or that a small p value indicates a large or important effect.

Regarding your specific question, I usually consider all of the evidence, but I typically would not find a p value of 0.049 to be very persuasive even though it is significant.

I see and suspected that...thank you. So what is the alternative when we are working with a sample of size n and need to see if our estimates are reasonable and similar to the population parameters?

Dale · Sep 25, 2023

fog37 said:

I see and suspected that...thank you. So what is the alternative when we are working with a sample of size n and need to see if our estimates are reasonable and similar to the population parameters?

Whenever possible I prefer to use Bayesian methods. I have a few insights articles on them. This one is the most relevant one to your question, but there are three others too if you are interested

https://www.physicsforums.com/insights/how-bayesian-inference-works-in-the-context-of-science/

BWV · Sep 26, 2023

I prefer a p-value of 0.05 as it only requires 20 or so different models to come up with a statistically proven one. Higher p-values require far more iterations of variables and parameters

seriously, this is the issue -

https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124

p=.05 naively might be OK, but it ignores how research gets created and published

Vanadium 50 · Sep 26, 2023

You are beginning to explore the art of statistics. It may be a science, but it is also an art.

Hornbein · Sep 27, 2023

Dale said:

A low p value is evidence against a null hypothesis, but null hypotheses are almost never actually believable and are rarely of interest.

Suppose a grad student wants to prove that exposure to the music of Led Zeppelin increases the sexual potency of rats. The null hypothesis is that this is not so. I find this believable. Usually the null hypothesis is that nothing of interest is going on. I would say that in general this is believable. The grad student's hope is that the null hypothesis will be rejected due to a low probability that the observed increased virility is an artifact of random chance.

Dale said:

One of the biggest misinterpretations of p values is that a small p value is evidence in favor of some scientific hypothesis of interest.

I share this misinterpretation, assuming an experiment is properly designed. A small p value suggests that the sought-for effect is real. Perhaps there is something I am missing. Of course all this depends on proper application of statistical methods.

hutchphd · Sep 27, 2023

I like Derek Muller (Veritasium's) you tube and a reference therein from John P. A. Ioannidis . It becomes quite complicated and quite disconcerting rather quickly.

Dale · Sep 27, 2023

Hornbein said:

Suppose a grad student wants to prove that exposure to the music of Led Zeppelin increases the sexual potency of rats. The null hypothesis is that this is not so. I find this believable.

This is not a typical null hypothesis. This would be called an an alternative hypothesis. So the hypothesis of interest is that the effect is positive, and the alternative hypothesis is that the effect is non-positive (negative or zero). The null hypothesis is that there is no effect, i.e. that the effect is exactly zero.

A point hypothesis is generally not believable. If a parameter, like the effect size, is continuous then the chance that it assumes a specific single value vanishes.

Nevertheless, unbelievable null hypotheses are used because they allow easy calculation of the probability of the observed data under the point hypothesis. In other words it is easy to calculate ##P(D|H)## where ##D## is the data and ##H## is the hypothesis if ##H## is a point hypothesis. For your example it would be just as difficult to calculate ##P(D|H)## for your experimental hypothesis as it is for your alternative hypothesis. There would be no utility in that alternative hypothesis.

So typically your grad student would compare the data to the unbelievable null hypothesis, show that the data is unlikely to have arisen by chance under the null hypothesis, also show that the average effect is positive, and then claim that is evidence supporting the experimental hypothesis.

Hornbein said:

I share this misinterpretation, assuming an experiment is properly designed. A small p value suggests that the sought-for effect is real. Perhaps there is something I am missing. Of course all this depends on proper application of statistical methods.

A small p-value indicates only that the observed data is unlikely to have arisen by chance under the null hypothesis. Any other inference is suspect.

That the observed data is unlikely to have arisen by chance under the null hypothesis does not itself indicate anything anything about the experimental hypothesis. The null hypothesis could be true and the experimenter just was unlucky. The null hypothesis could be true but the sampling non-random. The null hypothesis and the experimental hypothesis could both be false together. The experimental hypothesis could be one of many experimental hypotheses and multiple comparisons were not considered. Etc.

You are by far not alone in your misinterpretation. That is one of the biggest problems with p values.

It is actually kind of sad because when we take statistics they very carefully explain that you say "we reject the null hypothesis" and never that we "accept the experimental hypothesis". In statistics class people are told that the test just rejects the null hypothesis and does not support the experimental hypothesis. And then we publish our first scientific paper and in the results we reject the null hypothesis as we were taught in statistics class, and then immediately in the discussion section we accept the experimental hypothesis anyway.

statdad · Oct 16, 2023

It's not wise to lock yourself into a rigid set of statements like "If p < .05 then..., otherwise if p >= .05 then ...". As others have pointed out that isn't what p-values do, and it's not really how Fisher and other early practitioners thought they should be used. Looking back at some comments from Fisher:

"In 1926, as one of Fisher's early statements endorsing a p value of 0.05 as a boundary, he wrote: “…it is convenient [emphasis added] to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials’.”17 In 1956, Fisher wrote: “[…] no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”

We don't base decisions on results of single calcuations: small p-values might indicate a particular H0 isn't true, but there are many reasons a null can qualify to be rejected besides its being false. You should also look at the quality of the data, whether you've really asked the correct question, confidence intervals (use confidence level that corresponds to your test's significance level and don't make the mistake [as too many new students do] to refer to the alpha you use in a test as a confidence level: it isn't, it's the test's significance level, and so on.

hutchphd · Oct 16, 2023

Further I like to think of the p value as a measure of significance, (not a bright line test of significance). In this it conforms to my subjective method of making personal decisions where the quality of any input is always evaluated and colors the significance of that particular "fact".

It is also useful that ±2σ and 95% correspond.

WWGD · Oct 17, 2023

fog, you may want to look up terms like p-hacking and the power of a test.

FactChecker · Oct 17, 2023

statdad said:

"In 1926, as one of Fisher's early statements endorsing a p value of 0.05 as a boundary, he wrote: “…it is convenient [emphasis added] to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials’.”17 In 1956, Fisher wrote: “[…] no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”

Exactly. The real question is whether it is wise to make the claim of the alternative hypothesis if that claim might be wrong one time out of 20. In some cases, that might be fine and in other cases that might be terrible. That is why high energy physics claims are required to have a "5-sigma" (wrong once in every 1.7 million double tail, or once in every 3.5 million single tail) level of significance.

hutchphd · Oct 18, 2023

Of course any possible deviation from gaussian normal distribution will make these low probability significance estimates wildly speculative. There is a point where it becomes silly. (space shuttle failure estimates of 1 in 10⁵ per flight come to mind

FactChecker · Nov 3, 2023

hutchphd said:

Of course any possible deviation from gaussian normal distribution will make these low probability significance estimates wildly speculative. There is a point where it becomes silly. (space shuttle failure estimates of 1 in 10⁵ per flight come to mind

I think that we should distinguish between the reliability of the statistical theory versus the reliability of the model assumptions. In the case of particle physics, where the assumptions only depend on physics, the 5-sigma results may be very reliable. On the other hand, in cases like the space shuttle, where the assumptions depend on upper management being non-political, I wouldn't count too much on any result that was less than 1/10.

Vanadium 50 · Nov 3, 2023

fog37 said:

What if p was 0.049?

What if it were 0.48?
What if it were 0.47?
What if it were 0.51?

If you draw a line, stick to it. Don't go changing it after the fact to get the answer you want.

Dale · Nov 3, 2023

Vanadium 50 said:

What if it were 0.48?
What if it were 0.47?
What if it were 0.51?

If you draw a line, stick to it. Don't go changing it after the fact to get the answer you want.

Or don't draw the line

FactChecker · Nov 3, 2023

Dale said:

Or don't draw the line

Sometimes decisions must be made.

Dale · Nov 3, 2023

FactChecker said:

Sometimes decisions must be made.

Sure, but you can base decisions based on an aggregate of available relevant information rather than a single artificial line that generally is not even relevant to the decision being made.

FactChecker · Nov 3, 2023

Dale said:

Sure, but you can base decisions based on an aggregate of available relevant information rather than a single artificial line that generally is not even relevant to the decision being made.

Good point. But the "aggregate of available relevant information" is often just as questionable (or more so) as the statistical results. That is often why the statistical analysis was asked for in the first place. The world is messy.

Dale · Nov 3, 2023

FactChecker said:

Good point. But the "aggregate of available relevant information" is often just as questionable (or more so) as the statistical results.

I think I maybe was unclear. I am talking about all of “the statistical results” when I say “aggregate of available relevant information”. As opposed to “the p-value” as a single line.

statdad · Nov 20, 2023

hutchphd said:

Of course any possible deviation from gaussian normal distribution will make these low probability significance estimates wildly speculative. There is a point where it becomes silly. (space shuttle failure estimates of 1 in 10⁵ per flight come to mind

It's important to remember that there is no such thing as truly gaussian data: every application of that distribution is an approximation: the only question is about how drastic the approximation is

How far and how close to p=0.05 for statistical significance?

1. What does p=0.05 mean in statistical testing?

2. Is p=0.05 still the standard for determining statistical significance?

3. How should results be interpreted when p-values are close to 0.05?

4. What happens if the p-value is exactly 0.05?

5. Can we use a p-value other than 0.05 to determine statistical significance?

Similar threads

Hot Threads

Recent Insights