Chebyshev inequality, confidence intervals, etc

In summary, the conversation discusses two different approaches to determining the proportion of observations within a certain number of standard deviations from the mean. The first approach, known as Chebyshev's inequality, assumes only a finite variance and provides wider intervals compared to the second approach, which assumes a normal distribution and provides tighter intervals. The conversation also emphasizes the importance of paying attention to the assumptions in statistics.
  • #1
Vital
108
4
Hello.

I am bewildered by so many different notions of probability distribution percentages, i.e. the proportion of values that lie within certain standard deviations from the mean.

(1) There is a Chebyshev's inequality:
- for any distribution with finite variance, the proportion of the observations within k standard deviations of the arithmetic mean is at least 1 − 1/k2 for all k > 1. Below X is the mean.

k = 1.25 => X ± 1.25s => proportion 1 - 1/(1.25)^2 = 36% => 36% of observations lie within 36% from the mean (hence 18% below the mean and 18% above the mean)
k = 1.50 => X ± 1.5s => proportion 56% => 56% of observations lie within 56% from the mean (hence 28% below the mean and 28% above the mean)
k = 2 => X ±2.0s => proportion 75% => 75% of observations lie within 75% from the mean (hence 37.5% below the mean and 37.5% above the mean)
k = 2.50 => X ±2.5s => proportion 84% => 84% of observations lie within 84% from the mean (hence 42% below the mean and 42% above the mean)
k = 3.0 => X +- 3.0s => proportion 89% => 89% of observations lie within 89% from the mean (hence 44.5% below the mean and 44.5% above the mean)
k = 4.0 => X +- 4.0s => proportion 94% => 94% of observations lie within 94% from the mean (hence 47% below the mean and 47% above the mean)

(2) Confidence intervals:
is a range of values around the expected outcome within which we expect the actual outcome to be some specified percentage of the time. For example, a 95% confidence interval is a range that we expect the random variable to be in 95% of the time.
μ ± 1.65σ for 90 percent of the observations
μ ± 1.96σ for 95 percent of the observations
μ ± 2.58σ for 99 percent of the observations.

Both approaches above show completely different percentages of observations within a certain number of standard deviations from the mean. In Chebyshev's inequality concept there are 94% of observations within ±4 standard deviations, while in Confidence interval approach there are 99% within ±2.58 standard deviations.

Please, help me to understand how these differ from each other, and why they give such different percentages. Please, do me a favour and don't go too deep into a rabbit whole by using complicated math formulas in your explanation.

Thank you very much.)
 
Physics news on Phys.org
  • #2
No need to go down a rabbit hole. It is pretty simple actually.

The confidence interval numbers make the assumption that it is normally distributed. Making that assumption you can get pretty tight intervals.

The Chebyshev inequality makes a much weaker assumption. It assumes only that the variance is finite. Many distributions have finite variance but are much broader than the normal distribution. So you expect that the Chebyshev intervals will be wider than the normal distribution confidence intervals.

In statistics you should ALWAYS pay close attention to the assumptions. They are exceptionally important in statistics.
 
  • #3
Dale said:
No need to go down a rabbit hole. It is pretty simple actually.

The confidence interval numbers make the assumption that it is normally distributed. Making that assumption you can get pretty tight intervals.

The Chebyshev inequality makes a much weaker assumption. It assumes only that the variance is finite. Many distributions have finite variance but are much broader than the normal distribution. So you expect that the Chebyshev intervals will be wider than the normal distribution confidence intervals.

In statistics you should ALWAYS pay close attention to the assumptions. They are exceptionally important in statistics.
Thank you very much. It is much clearer now. So if in this or that problem it is stated that the distribution is normal, then I can use confidence intervals. But when the distribution is assumed to be non normal, then I should use the Chebyshev inequality to define the interval around the mean. I hope I did understand that correctly)
 
  • #4
Yes, although if you know that the distribution is something specific rather than normal then you can construct confidence intervals for that specific distribution. That will give you equal or better results as the Chebyshev assumption of an unknown distribution.
 
  • #5
Vital said:
Hello.

I am bewildered by so many different notions of probability distribution percentages, i.e. the proportion of values that lie within certain standard deviations from the mean.

(1) There is a Chebyshev's inequality:
- for any distribution with finite variance, the proportion of the observations within k standard deviations of the arithmetic mean is at least 1 − 1/k2 for all k > 1. Below X is the mean.

k = 1.25 => X ± 1.25s => proportion 1 - 1/(1.25)^2 = 36% => 36% of observations lie within 36% from the mean (hence 18% below the mean and 18% above the mean)
*************************************
(2) Confidence intervals:
is a range of values around the expected outcome within which we expect the actual outcome to be some specified percentage of the time. For example, a 95% confidence interval is a range that we expect the random variable to be in 95% of the time.
μ ± 1.65σ for 90 percent of the observations
*****************************************
Both approaches above show completely different percentages of observations within a certain number of standard deviations from the mean. In Chebyshev's inequality concept there are 94% of observations within ±4 standard deviations, while in Confidence interval approach there are 99% within ±2.58 standard deviations.

Please, help me to understand how these differ from each other, and why they give such different percentages. Please, do me a favour and don't go too deep into a rabbit whole by using complicated math formulas in your explanation.

Thank you very much.)

You have mis-stated the results. For Chebychev with ##k = 1.25## it follows that 36% of the observations lie with 25% from the mean (not 36% from the mean as you said). Also, you cannot say that 18% lie above and 18% lie below the mean; you can only say that 36% lie within the range above and below the mean. (I suspect that I could construct an asymmetric example where 30% lie above the mean and 6% lie below.)

The point about Chebychev is that it applies universally, to any distribution whatsoever with a finite mean and variance. Of course, if you know the actual form of the distribution you can do much better: Chebychev is a type of worst-case bound, and when you have a given distribution you are no longer looking at the worst case.
 
Last edited:
  • Like
Likes Dale
  • #6
Ray Vickson said:
You have mis-stated the results. For Chebychev with ##k = 1.25## it follows that 36% of the observations lie with 25% from the mean (not 36% from the mean as you said). [snip]
Thank you very much. But I am not sure what you mean when you say that 36% is not correct.
1 - 1/1.25^2 = 36%, hence around 36% fall with +/- 1.25 standard deviations from the mean. Why is this incorrect?
 
  • #7
Vital said:
Thank you very much. But I am not sure what you mean when you say that 36% is not correct.
1 - 1/1.25^2 = 36%, hence around 36% fall with +/- 1.25 standard deviations from the mean. Why is this incorrect?

I agree that ##1 - 1/1.25^2 = 0.36,## but that does NOT mean that points in the interval ##(\mu - 1.25 \sigma , \mu + 1.25 \sigma)## are within 36% from the mean. In fact, to even speak of "% from the mean" is using meaningless words. The concept of "%" must be in reference to some standard (or base) amount, which you have not specified. Even if you use the standard deviation ##\sigma## to be that base amount, the interval above is actually "within 125% of the mean." There are no intervals of length 36% here. The 36% applies to the probabilites, not to the "distances".
 
Last edited:

1. What is Chebyshev's inequality and how is it used?

Chebyshev's inequality is a mathematical concept that relates to the spread of data around the mean in a probability distribution. It states that for any dataset, no matter the distribution, at least a certain proportion of the data will lie within a certain number of standard deviations from the mean. This is useful for understanding the likelihood of events occurring and for setting bounds on the probability of certain outcomes.

2. How do confidence intervals work?

Confidence intervals are a way of estimating a population parameter based on a sample from that population. They provide a range of values within which the true value of the parameter is likely to fall, with a certain level of confidence. This level of confidence is typically expressed as a percentage, such as 95%. A wider confidence interval indicates more uncertainty in the estimate, while a narrower interval indicates more confidence in the estimate.

3. What is the relationship between sample size and confidence intervals?

The larger the sample size, the smaller the confidence interval. This is because a larger sample size provides more information and reduces the amount of uncertainty in the estimate. As a result, the confidence interval becomes narrower and the estimate becomes more precise.

4. How do you interpret a confidence interval?

A confidence interval provides a range of values within which the true value of a population parameter is likely to fall. This means that if the same sample was taken multiple times and a confidence interval was calculated each time, a certain percentage of those intervals (e.g. 95%) would contain the true population parameter. In other words, we can be 95% confident that the true value of the parameter falls within the given interval.

5. How can confidence intervals be used in hypothesis testing?

In hypothesis testing, confidence intervals can be used to determine whether a null hypothesis (e.g. there is no difference between two groups) can be rejected or not. If the confidence interval for the difference between two groups does not include zero, it suggests that there is a statistically significant difference between the groups. However, if the confidence interval does include zero, it indicates that the observed difference could have occurred by chance and the null hypothesis cannot be rejected.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
994
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
725
  • Set Theory, Logic, Probability, Statistics
Replies
22
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
929
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
21
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
727
  • Set Theory, Logic, Probability, Statistics
Replies
26
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
Back
Top