How far and how close to p=0.05 for statistical significance?

  • I
  • Thread starter fog37
  • Start date
  • Tags
    P-value
  • #1
fog37
1,568
108
TL;DR Summary
How far and how close to p=0.05 for statistical significance...
Hello Forum,

I understand what the p value represents and how it is calculated in a statistical hypothesis test. In general, the p-value threshold is set to 0.05, i.e. 5% which means that the null hypothesis is reject 5 times our 100 even if it is true. Or that the sample statistics are, assuming the null hypothesis is true, are extremely rare (if p<0.05) leading to reject H0...

What if our p value is just 0.057? Do we keep H0? What if p was 0.049? Would we reject H0? I guess I am asking how far the calculated p-value must be from the 0.05 threshold for the results to be either statistically significant or not...

Thank you!
 
Physics news on Phys.org
  • #2
The choice of a confidence level depends on the subject area and the seriousness of a mistake. An extreme example is in physics where the issue is the claim of discovering a new nuclear particle. There, they typically insist on a standard of 5 sigma, which corresponds to percentages of 0.00006% or 0.00003% (two sided or one sided). See this CERN post.
In your example, where the p value is exactly 0.05, remember that the goal is to be able to convince others, who may be skeptical of the alternative of the null hypothesis. You can go either way, but expect some resistance from others.
 
  • #3
fog37 said:
TL;DR Summary: How far and how close to p=0.05 for statistical significance...

What if our p value is just 0.057? Do we keep H0? What if p was 0.049? Would we reject H0?
Unfortunately, p values are very misused, and the magical 0.05 threshold especially so. A low p value is evidence against a null hypothesis, but null hypotheses are almost never actually believable and are rarely of interest.

One of the biggest misinterpretations of p values is that a small p value is evidence in favor of some scientific hypothesis of interest. Or that a small p value indicates a large or important effect.

Regarding your specific question, I usually consider all of the evidence, but I typically would not find a p value of 0.049 to be very persuasive even though it is significant.
 
  • Like
Likes fog37
  • #4
Dale said:
Unfortunately, p values are very misused, and the magical 0.05 threshold especially so. A low p value is evidence against a null hypothesis, but null hypotheses are almost never actually believable and are rarely of interest.

One of the biggest misinterpretations of p values is that a small p value is evidence in favor of some scientific hypothesis of interest. Or that a small p value indicates a large or important effect.

Regarding your specific question, I usually consider all of the evidence, but I typically would not find a p value of 0.049 to be very persuasive even though it is significant.
I see and suspected that...thank you. So what is the alternative when we are working with a sample of size n and need to see if our estimates are reasonable and similar to the population parameters?
 
  • #5
fog37 said:
I see and suspected that...thank you. So what is the alternative when we are working with a sample of size n and need to see if our estimates are reasonable and similar to the population parameters?
Whenever possible I prefer to use Bayesian methods. I have a few insights articles on them. This one is the most relevant one to your question, but there are three others too if you are interested

https://www.physicsforums.com/insights/how-bayesian-inference-works-in-the-context-of-science/
 
Last edited:
  • #6
Last edited:
  • Like
Likes Office_Shredder, Vanadium 50 and Dale
  • #7
You are beginning to explore the art of statistics. It may be a science, but it is also an art.
 
  • #8
Dale said:
A low p value is evidence against a null hypothesis, but null hypotheses are almost never actually believable and are rarely of interest.

Suppose a grad student wants to prove that exposure to the music of Led Zeppelin increases the sexual potency of rats. The null hypothesis is that this is not so. I find this believable. Usually the null hypothesis is that nothing of interest is going on. I would say that in general this is believable. The grad student's hope is that the null hypothesis will be rejected due to a low probability that the observed increased virility is an artifact of random chance.

Dale said:
One of the biggest misinterpretations of p values is that a small p value is evidence in favor of some scientific hypothesis of interest.

I share this misinterpretation, assuming an experiment is properly designed. A small p value suggests that the sought-for effect is real. Perhaps there is something I am missing. Of course all this depends on proper application of statistical methods.
 
  • Like
Likes FactChecker
  • #10
Hornbein said:
Suppose a grad student wants to prove that exposure to the music of Led Zeppelin increases the sexual potency of rats. The null hypothesis is that this is not so. I find this believable.
This is not a typical null hypothesis. This would be called an an alternative hypothesis. So the hypothesis of interest is that the effect is positive, and the alternative hypothesis is that the effect is non-positive (negative or zero). The null hypothesis is that there is no effect, i.e. that the effect is exactly zero.

A point hypothesis is generally not believable. If a parameter, like the effect size, is continuous then the chance that it assumes a specific single value vanishes.

Nevertheless, unbelievable null hypotheses are used because they allow easy calculation of the probability of the observed data under the point hypothesis. In other words it is easy to calculate ##P(D|H)## where ##D## is the data and ##H## is the hypothesis if ##H## is a point hypothesis. For your example it would be just as difficult to calculate ##P(D|H)## for your experimental hypothesis as it is for your alternative hypothesis. There would be no utility in that alternative hypothesis.

So typically your grad student would compare the data to the unbelievable null hypothesis, show that the data is unlikely to have arisen by chance under the null hypothesis, also show that the average effect is positive, and then claim that is evidence supporting the experimental hypothesis.

Hornbein said:
I share this misinterpretation, assuming an experiment is properly designed. A small p value suggests that the sought-for effect is real. Perhaps there is something I am missing. Of course all this depends on proper application of statistical methods.
A small p-value indicates only that the observed data is unlikely to have arisen by chance under the null hypothesis. Any other inference is suspect.

That the observed data is unlikely to have arisen by chance under the null hypothesis does not itself indicate anything anything about the experimental hypothesis. The null hypothesis could be true and the experimenter just was unlucky. The null hypothesis could be true but the sampling non-random. The null hypothesis and the experimental hypothesis could both be false together. The experimental hypothesis could be one of many experimental hypotheses and multiple comparisons were not considered. Etc.

You are by far not alone in your misinterpretation. That is one of the biggest problems with p values.

It is actually kind of sad because when we take statistics they very carefully explain that you say "we reject the null hypothesis" and never that we "accept the experimental hypothesis". In statistics class people are told that the test just rejects the null hypothesis and does not support the experimental hypothesis. And then we publish our first scientific paper and in the results we reject the null hypothesis as we were taught in statistics class, and then immediately in the discussion section we accept the experimental hypothesis anyway.
 
  • Informative
  • Like
Likes hutchphd, fog37 and berkeman
  • #11
It's not wise to lock yourself into a rigid set of statements like "If p < .05 then..., otherwise if p >= .05 then ...". As others have pointed out that isn't what p-values do, and it's not really how Fisher and other early practitioners thought they should be used. Looking back at some comments from Fisher:

"In 1926, as one of Fisher's early statements endorsing a p value of 0.05 as a boundary, he wrote: “…it is convenient [emphasis added] to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials’.”17 In 1956, Fisher wrote: “[…] no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”

We don't base decisions on results of single calcuations: small p-values might indicate a particular H0 isn't true, but there are many reasons a null can qualify to be rejected besides its being false. You should also look at the quality of the data, whether you've really asked the correct question, confidence intervals (use confidence level that corresponds to your test's significance level and don't make the mistake [as too many new students do] to refer to the alpha you use in a test as a confidence level: it isn't, it's the test's significance level, and so on.
 
  • Like
Likes fog37 and FactChecker
  • #12
Further I like to think of the p value as a measure of significance, (not a bright line test of significance). In this it conforms to my subjective method of making personal decisions where the quality of any input is always evaluated and colors the significance of that particular "fact".

It is also useful that ±2σ and 95% correspond.
 
Last edited:
  • Like
Likes fog37 and FactChecker
  • #13
fog, you may want to look up terms like p-hacking and the power of a test.
 
  • Like
Likes fog37 and Dale
  • #14
statdad said:
"In 1926, as one of Fisher's early statements endorsing a p value of 0.05 as a boundary, he wrote: “…it is convenient [emphasis added] to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials’.”17 In 1956, Fisher wrote: “[…] no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”
Exactly. The real question is whether it is wise to make the claim of the alternative hypothesis if that claim might be wrong one time out of 20. In some cases, that might be fine and in other cases that might be terrible. That is why high energy physics claims are required to have a "5-sigma" (wrong once in every 1.7 million double tail, or once in every 3.5 million single tail) level of significance.
 
  • Like
Likes fog37
  • #15
Of course any possible deviation from gaussian normal distribution will make these low probability significance estimates wildly speculative. There is a point where it becomes silly. (space shuttle failure estimates of 1 in 105 per flight come to mind
 
  • Like
Likes fog37 and FactChecker
  • #16
hutchphd said:
Of course any possible deviation from gaussian normal distribution will make these low probability significance estimates wildly speculative. There is a point where it becomes silly. (space shuttle failure estimates of 1 in 105 per flight come to mind
I think that we should distinguish between the reliability of the statistical theory versus the reliability of the model assumptions. In the case of particle physics, where the assumptions only depend on physics, the 5-sigma results may be very reliable. On the other hand, in cases like the space shuttle, where the assumptions depend on upper management being non-political, I wouldn't count too much on any result that was less than 1/10.
 
  • Like
Likes hutchphd
  • #17
fog37 said:
What if p was 0.049?
What if it were 0.48?
What if it were 0.47?
What if it were 0.51?

If you draw a line, stick to it. Don't go changing it after the fact to get the answer you want.
 
  • Like
Likes FactChecker
  • #18
Vanadium 50 said:
What if it were 0.48?
What if it were 0.47?
What if it were 0.51?

If you draw a line, stick to it. Don't go changing it after the fact to get the answer you want.
Or don't draw the line
 
  • Like
Likes hutchphd
  • #19
Dale said:
Or don't draw the line
Sometimes decisions must be made.
 
  • #20
FactChecker said:
Sometimes decisions must be made.
Sure, but you can base decisions based on an aggregate of available relevant information rather than a single artificial line that generally is not even relevant to the decision being made.
 
  • Like
Likes hutchphd and FactChecker
  • #21
Dale said:
Sure, but you can base decisions based on an aggregate of available relevant information rather than a single artificial line that generally is not even relevant to the decision being made.
Good point. But the "aggregate of available relevant information" is often just as questionable (or more so) as the statistical results. That is often why the statistical analysis was asked for in the first place. The world is messy.
 
  • #22
FactChecker said:
Good point. But the "aggregate of available relevant information" is often just as questionable (or more so) as the statistical results.
I think I maybe was unclear. I am talking about all of “the statistical results” when I say “aggregate of available relevant information”. As opposed to “the p-value” as a single line.
 
  • Like
Likes FactChecker
  • #23
hutchphd said:
Of course any possible deviation from gaussian normal distribution will make these low probability significance estimates wildly speculative. There is a point where it becomes silly. (space shuttle failure estimates of 1 in 105 per flight come to mind
It's important to remember that there is no such thing as truly gaussian data: every application of that distribution is an approximation: the only question is about how drastic the approximation is
 
  • Like
Likes hutchphd

1. What does p=0.05 mean in statistical testing?

The p-value of 0.05 is a conventional threshold used to determine statistical significance in hypothesis testing. It indicates that there is a 5% probability of observing the data, or something more extreme, if the null hypothesis is true. If the p-value is less than or equal to 0.05, the results are considered statistically significant, suggesting that the observed effect is unlikely to have occurred by chance alone.

2. Is p=0.05 still the standard for determining statistical significance?

While p=0.05 has been traditionally used as a standard threshold for statistical significance, there is growing debate and criticism regarding its use. Many researchers argue for a more flexible approach to p-values, sometimes suggesting lower thresholds like 0.01 to reduce the rate of false positives, or advocating for the use of confidence intervals and effect sizes as more informative statistical measures.

3. How should results be interpreted when p-values are close to 0.05?

When a p-value is close to 0.05, the results should be interpreted with caution. A p-value slightly below 0.05 does not necessarily mean a strong evidence against the null hypothesis, just as a p-value slightly above 0.05 does not mean the absence of any effect. Researchers are encouraged to consider the context, the robustness of the methodology, and the possibility of replication to strengthen the interpretation of such borderline results.

4. What happens if the p-value is exactly 0.05?

If the p-value is exactly 0.05, it means that there is exactly a 5% probability of observing the data, or something more extreme, under the assumption that the null hypothesis is true. This is often considered the borderline for declaring statistical significance. However, declaring significance at this exact point should be done with understanding of its limitations and the potential for error, particularly Type I error (false positives).

5. Can we use a p-value other than 0.05 to determine statistical significance?

Yes, researchers can use different p-value thresholds to determine statistical significance based on the context of the study, the field of research, and the specific risks associated with Type I and Type II errors. Some fields may require more stringent criteria (e.g., p=0.01 or even lower), especially in cases where the consequences of errors are significant. Ultimately, the choice of p-value should be justified based on the study design, the quality of the data, and the overall research objectives.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
477
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
26
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
17
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
714
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
Back
Top