Distribution/confidence question for dummies

  • Thread starter CRGreathouse
  • Start date
In summary: The lemma tells you that the likelihood function is, well, "best" to use. You know, this is where my lack of stats knowledge hurts. It's clear that I should use 0.05, not 0.025, when k = 0. But for k > 0, where I have a choice, I'm not sure what the justification is on choosing left and right errors to be equal. What do you think?The maximum likelihood estimator is k/n -- if I have 10 trials and win at 3, the most likely probability is 30%. That's not hard to figure out.I'm not actually getting particularly good results with using the normal estimate,
  • #1
CRGreathouse
Science Advisor
Homework Helper
2,844
0
Binomial distribution/confidence question for dummies

(The 'dummy' would be me.)

I have an event that happens with unknown probability p. Each of n independent events results in k of these events happening. How do I construct a (95%) confidence interval for p?

For small n it's easy to figure this out with numerical combinatorics:

Pr(at most k events) = [tex]\sum_{i=0}^k{n\choose i}p^i(1-p)^{n-i}[/tex]
Pr(at least k events) = [tex]\sum_{i=k}^n{n\choose i}p^i(1-p)^{n-i}[/tex]

and then find the roots of Pr(at most k events) - 0.05 and Pr(at least k events) - 0.05. (Maybe I should use 0.025 instead?)


But for large n (even not all that large!), this is inconvenient. Surely there is some standard statistical method for this? Sticking as close to the roots as possible would be best -- I'd prefer to use as little Central Limit Theorem as I can.
 
Last edited:
Physics news on Phys.org
  • #2
Actually, forget it. Computers are fast, and wasting a few billion cycles isn't going to kill me.

Pari/GP code:
Code:
probrange(n,k,confidence=.05)={
	[if(k==0,0,solve(p=0,1,sum(i=k,n,binomial(n,i)*p^i*(1-p)^(n-i))-confidence)),
	if(k==n,1,solve(p=0,1,sum(i=0,k,binomial(n,i)*p^i*(1-p)^(n-i))-confidence))]
};
 
  • #3


CRGreathouse said:
(The 'dummy' would be me.)

I have an event that happens with unknown probability p. Each of n independent events results in k of these events happening. How do I construct a (95%) confidence interval for p?

For small n it's easy to figure this out with numerical combinatorics:

Pr(at most k events) = [tex]\sum_{i=0}^k{n\choose i}p^i(1-p)^{n-i}[/tex]
Pr(at least k events) = [tex]\sum_{i=k}^n{n\choose i}p^i(1-p)^{n-i}[/tex]

and then find the roots of Pr(at most k events) - 0.05 and Pr(at least k events) - 0.05. (Maybe I should use 0.025 instead?)


But for large n (even not all that large!), this is inconvenient. Surely there is some standard statistical method for this? Sticking as close to the roots as possible would be best -- I'd prefer to use as little Central Limit Theorem as I can.

If you want the proper statistical reasoning for using the combinatorial things then look up Neyman Pearson lemma. It tells you that out of all the estimators, the one using the likelihood estimate gives the largest power. I'm afraid without Central Limit or Poisson approximation you can't simplify the computation.

Oo and your values should be 0.025 (so it adds to 0.5 when you do two sided test)

Good luck.
 
  • #4


Focus said:
If you want the proper statistical reasoning for using the combinatorial things then look up Neyman Pearson lemma. It tells you that out of all the estimators, the one using the likelihood estimate gives the largest power. I'm afraid without Central Limit or Poisson approximation you can't simplify the computation.

The maximum likelihood estimator is k/n -- if I have 10 trials and win at 3, the most likely probability is 30%. That's not hard to figure out.

I'm not actually getting particularly good results with using the normal estimate, so I think I'll have to compute numerically.

I will look up that lemma, though; that might help.

Focus said:
Oo and your values should be 0.025 (so it adds to 0.5 when you do two sided test)

You know, this is where my lack of stats knowledge hurts. It's clear that I should use 0.05, not 0.025, when k = 0. But for k > 0, where I have a choice, I'm not sure what the justification is on choosing left and right errors to be equal. What do you think?
 
  • #5


CRGreathouse said:
The maximum likelihood estimator is k/n -- if I have 10 trials and win at 3, the most likely probability is 30%. That's not hard to figure out.

I'm not actually getting particularly good results with using the normal estimate, so I think I'll have to compute numerically.

I will look up that lemma, though; that might help.

The lemma tells you that the likelihood function is, well, "best" to use.


CRGreathouse said:
You know, this is where my lack of stats knowledge hurts. It's clear that I should use 0.05, not 0.025, when k = 0. But for k > 0, where I have a choice, I'm not sure what the justification is on choosing left and right errors to be equal. What do you think?

Confidence intervals always use alpha/2 on each tail. It is analogous (sp?) to two sided hypothesis tests.
 
  • #6
Here's the revised numerical function I'm using:

Code:
probrange(n,k,conf=.05)={
	if (k==0, return([0, solve(p=0,1,(1-p)^n-conf)]));
	if (k==n, return([solve(p=0,1,p^n-conf), 1]));

	conf = conf/2;
	if(k+k < n,
		[solve(p=0,1,1-conf-sum(i=0,k-1,binomial(n,i)*p^i*(1-p)^(n-i))),
		solve(p=0,1,sum(i=0,k,binomial(n,i)*p^i*(1-p)^(n-i))-conf)],
		[solve(p=0,1,sum(i=k,n,binomial(n,i)*p^i*(1-p)^(n-i))-conf),
		solve(p=0,1,1-conf-sum(i=k+1,n,binomial(n,i)*p^i*(1-p)^(n-i)))]
	)
};
addhelp(probrange, "probrange(n,k,conf=.05): Gives a confidence interval for the probability of an event which happens k times in n trials.");

I'd like to test this against some normal approximations, but I'm having a bit of trouble in Pari -- I don't know how to calculate a normal percentile ([itex]z_{1-\alpha/2}[/itex]). All Pari has built-in is the complementary error function. Is there a better way than numerically solving for the inverse?
 

What is a distribution?

A distribution refers to the way in which data is spread out or distributed. It can be visualized through a graph or chart, and it helps to understand the pattern or shape of the data.

What is the difference between a normal distribution and a skewed distribution?

A normal distribution is a symmetrical bell-shaped curve, where the majority of the data is clustered around the mean. A skewed distribution, on the other hand, is asymmetrical and has a longer tail on one side. This means that the data is not evenly distributed and is skewed towards one direction.

What is the purpose of calculating confidence intervals?

Confidence intervals are used to estimate the range of values that a population parameter (such as a mean or proportion) is likely to fall within. It helps to provide a level of certainty or confidence in the estimated value, based on a sample of the population.

How is confidence level related to confidence interval?

The confidence level is the probability that the true population parameter falls within the calculated confidence interval. A higher confidence level (such as 95%) means that there is a higher level of certainty that the true value falls within the interval.

What factors can affect the width of a confidence interval?

The width of a confidence interval can be affected by several factors, including the sample size, the variability of the data, and the chosen confidence level. A larger sample size, lower variability, and higher confidence level all result in a narrower confidence interval.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
791
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
2K
Replies
12
Views
686
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
977
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
Back
Top