Probability of Binomial Variable ≥ Another Binomial Variable

AI Thread Summary
The discussion focuses on calculating the probability that one binomial variable, X, is greater than another, Y, when both are generated as paired events with different probabilities. Participants shared simulation results showing that for p=0.8 and p=0.6, X was greater than Y 31.73% of the time, while for p=0.7 and p=0.4, this increased to 42.04%. They explored using the Central Limit Theorem to derive a formula for P{X≥Y}, noting that for large sample sizes, the distribution of X-Y approximates a normal distribution. However, challenges arose when applying the formula for small sample sizes, indicating that the approximation is less accurate with n=m=1. Overall, the conversation emphasizes the importance of sample size in statistical probability calculations.
SirTristan
Messages
12
Reaction score
0
If two binomially distributed variables are generated as paired events, how often will the variable with p=X be greater than the variable with p=Y? Also what is the "equity" if ties are counted as .5 for each?

For instance in Excel I generated 10,000 numbers with p=.8 and 10,000 with p=.6. The first set of numbers was greater 3,173 times, they were equal 5,642 times, and the second set was greater 1,185 times. So p=.8 was greater than p=.6 31.73% of the time. Counting ties as equal the total equity for the first set was (3173+5642/2)/10000=.5994.

Repeating this for p=.7 and p=.4, the first was greater 4,204 times, they were equal 4,610 times, and the second set was greater 1,186 times. p=.7 was greater than p=.4 42.04% of the time, and the "equity" for the first variable was (4204+4610/2)/10000=.6509.
 
Physics news on Phys.org
Hi SirTristan! :smile:

What you are looking for is

P\{X\geq Y\}=P\{X-Y\geq 0\}

Thus you must know the distribution of X-Y. Sadly, I do not know any nice formula for this. However, if X~B(n,p) and Y=B(m,q) and n and m is large, then we can appy the Central Limit Theorem.

Indeed, if n is large, then X~N(np,np(1-p)) and if m is large then Y~N(mq,mq(1-q)). Thus X-Y~N(np+mq,np(1-p)+mq(1-q)). Thus if Z is standard normal, then you need to calculate

P\left\{Z\geq \frac{-np-mq}{\sqrt{np(1-p)+mq(1-q)}}\right\}

which can be easily done by using some kind of table...
 
Looks like you mean Bernoulli variables (Binomial with n=1). For this case it's easy to set up a 2x2 table, e.g. with P[X=1]=p and P[Y=1]=q you have P[X=0,Y=1]=(1-p)q etc and thus P[X>Y]=p(1-q) and P[X=Y]=pq+(1-p)(1-q) which should match reasonably closely the percentages you found by Monte Carlo simulation. You may like to try a Chi square test to see if the observations are close enough to the predictions.
 
Thank you very much guys :)

bpet, yes those numbers seem to match the simulations quite precisely. More simple math than I expected :) Here's what those formulas give:
Code:
P	Q	X>Y	X=Y	Equity
0.8	0.6	0.32	0.56	0.6
0.7	0.4	0.42	0.46	0.65
That's almost exactly the simulation numbers.

I'm having trouble with micromass's formula though - perhaps I'm doing something wrong? Since n=m=1, here's what I get for the numerator [-p-q] and the denominator [sqrt(p*(1-p)+q*(1-q))], the Z score, and the probability of being higher than that Z score:
Code:
P	Q	Num	Den	Z	Probability
0.8	0.6	-1.4	.6325	-2.2136	.9866
0.7	0.4	-1.1	.6708	-1.6398	.9495
0.6	0.8	-1.4	.6325	-2.2136	.9866
Perhaps I'm misapplying the formula, because from how I'm gathering it, when P is less than Q gives the same result as when P is higher than Q. Shouldn't it be that X-Y is distributed with a mean of np-mq rather than np+mq? And the numerator should be -(np-mq) rather than -np-mq? Using that numerator gives me:
Code:
P	Q	Num	Den	Z	Probability
0.8	0.6	-0.2	.6325	-.3162	.6241
0.7	0.4	-0.3	.6708	-.4472	.6726
0.6	0.8	0.2	.6325	.3162	.3759
These numbers make more sense to me, although I think they're a bit less accurate relative to the simulation.
 
I'm sorry SirTristan, you are correct! The numerator should indeed be -(np-mq).

Also, the formula I gave you will only approximate the real probability for large n and m. If you pick n=m=1, then this will be highly inaccurate, as your example shows!

Maybe you could try the same thing for n,m>20 or so, you'll see that the formula approximates your simulation quite closely!
 

Similar threads

Back
Top