Distribution/confidence question for dummies

  • Context: Undergrad 
  • Thread starter Thread starter CRGreathouse
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around constructing a confidence interval for an unknown probability p based on the outcomes of independent events, specifically using the binomial distribution. Participants explore methods for both small and large sample sizes, discussing numerical combinatorics and statistical reasoning.

Discussion Character

  • Exploratory
  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant describes the process of calculating probabilities for at most and at least k events using combinatorial methods, expressing a preference to avoid the Central Limit Theorem for larger n.
  • Another participant suggests that for small n, numerical combinatorics is straightforward, but for larger n, a standard statistical method is preferable.
  • There is mention of the Neyman Pearson lemma, which indicates that the likelihood estimate provides the largest power among estimators.
  • Participants discuss the maximum likelihood estimator being k/n, with examples illustrating its application.
  • Concerns are raised about the choice of alpha values for confidence intervals, particularly the justification for equal left and right errors when k > 0.
  • A revised numerical function for calculating confidence intervals is shared, with a note on the difficulty of calculating normal percentiles in the programming environment used.

Areas of Agreement / Disagreement

Participants express varying opinions on the best methods for constructing confidence intervals, particularly regarding the use of numerical methods versus approximations. There is no consensus on the optimal approach or the justification for specific choices in statistical parameters.

Contextual Notes

Participants note limitations in their statistical knowledge and the challenges of applying certain statistical methods, particularly in relation to large sample sizes and the use of normal approximations.

CRGreathouse
Science Advisor
Homework Helper
Messages
2,832
Reaction score
0
Binomial distribution/confidence question for dummies

(The 'dummy' would be me.)

I have an event that happens with unknown probability p. Each of n independent events results in k of these events happening. How do I construct a (95%) confidence interval for p?

For small n it's easy to figure this out with numerical combinatorics:

Pr(at most k events) = [tex]\sum_{i=0}^k{n\choose i}p^i(1-p)^{n-i}[/tex]
Pr(at least k events) = [tex]\sum_{i=k}^n{n\choose i}p^i(1-p)^{n-i}[/tex]

and then find the roots of Pr(at most k events) - 0.05 and Pr(at least k events) - 0.05. (Maybe I should use 0.025 instead?)


But for large n (even not all that large!), this is inconvenient. Surely there is some standard statistical method for this? Sticking as close to the roots as possible would be best -- I'd prefer to use as little Central Limit Theorem as I can.
 
Last edited:
Physics news on Phys.org
Actually, forget it. Computers are fast, and wasting a few billion cycles isn't going to kill me.

Pari/GP code:
Code:
probrange(n,k,confidence=.05)={
	[if(k==0,0,solve(p=0,1,sum(i=k,n,binomial(n,i)*p^i*(1-p)^(n-i))-confidence)),
	if(k==n,1,solve(p=0,1,sum(i=0,k,binomial(n,i)*p^i*(1-p)^(n-i))-confidence))]
};
 


CRGreathouse said:
(The 'dummy' would be me.)

I have an event that happens with unknown probability p. Each of n independent events results in k of these events happening. How do I construct a (95%) confidence interval for p?

For small n it's easy to figure this out with numerical combinatorics:

Pr(at most k events) = [tex]\sum_{i=0}^k{n\choose i}p^i(1-p)^{n-i}[/tex]
Pr(at least k events) = [tex]\sum_{i=k}^n{n\choose i}p^i(1-p)^{n-i}[/tex]

and then find the roots of Pr(at most k events) - 0.05 and Pr(at least k events) - 0.05. (Maybe I should use 0.025 instead?)


But for large n (even not all that large!), this is inconvenient. Surely there is some standard statistical method for this? Sticking as close to the roots as possible would be best -- I'd prefer to use as little Central Limit Theorem as I can.


If you want the proper statistical reasoning for using the combinatorial things then look up Neyman Pearson lemma. It tells you that out of all the estimators, the one using the likelihood estimate gives the largest power. I'm afraid without Central Limit or Poisson approximation you can't simplify the computation.

Oo and your values should be 0.025 (so it adds to 0.5 when you do two sided test)

Good luck.
 


Focus said:
If you want the proper statistical reasoning for using the combinatorial things then look up Neyman Pearson lemma. It tells you that out of all the estimators, the one using the likelihood estimate gives the largest power. I'm afraid without Central Limit or Poisson approximation you can't simplify the computation.

The maximum likelihood estimator is k/n -- if I have 10 trials and win at 3, the most likely probability is 30%. That's not hard to figure out.

I'm not actually getting particularly good results with using the normal estimate, so I think I'll have to compute numerically.

I will look up that lemma, though; that might help.

Focus said:
Oo and your values should be 0.025 (so it adds to 0.5 when you do two sided test)

You know, this is where my lack of stats knowledge hurts. It's clear that I should use 0.05, not 0.025, when k = 0. But for k > 0, where I have a choice, I'm not sure what the justification is on choosing left and right errors to be equal. What do you think?
 


CRGreathouse said:
The maximum likelihood estimator is k/n -- if I have 10 trials and win at 3, the most likely probability is 30%. That's not hard to figure out.

I'm not actually getting particularly good results with using the normal estimate, so I think I'll have to compute numerically.

I will look up that lemma, though; that might help.

The lemma tells you that the likelihood function is, well, "best" to use.


CRGreathouse said:
You know, this is where my lack of stats knowledge hurts. It's clear that I should use 0.05, not 0.025, when k = 0. But for k > 0, where I have a choice, I'm not sure what the justification is on choosing left and right errors to be equal. What do you think?

Confidence intervals always use alpha/2 on each tail. It is analogous (sp?) to two sided hypothesis tests.
 
Here's the revised numerical function I'm using:

Code:
probrange(n,k,conf=.05)={
	if (k==0, return([0, solve(p=0,1,(1-p)^n-conf)]));
	if (k==n, return([solve(p=0,1,p^n-conf), 1]));

	conf = conf/2;
	if(k+k < n,
		[solve(p=0,1,1-conf-sum(i=0,k-1,binomial(n,i)*p^i*(1-p)^(n-i))),
		solve(p=0,1,sum(i=0,k,binomial(n,i)*p^i*(1-p)^(n-i))-conf)],
		[solve(p=0,1,sum(i=k,n,binomial(n,i)*p^i*(1-p)^(n-i))-conf),
		solve(p=0,1,1-conf-sum(i=k+1,n,binomial(n,i)*p^i*(1-p)^(n-i)))]
	)
};
addhelp(probrange, "probrange(n,k,conf=.05): Gives a confidence interval for the probability of an event which happens k times in n trials.");

I'd like to test this against some normal approximations, but I'm having a bit of trouble in Pari -- I don't know how to calculate a normal percentile ([itex]z_{1-\alpha/2}[/itex]). All Pari has built-in is the complementary error function. Is there a better way than numerically solving for the inverse?
 

Similar threads

  • · Replies 29 ·
Replies
29
Views
6K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 14 ·
Replies
14
Views
2K
Replies
1
Views
3K
  • · Replies 0 ·
Replies
0
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
4K
  • · Replies 3 ·
Replies
3
Views
3K