Bayesian Probability Distributions

Rifscape · Feb 20, 2018

Hi,

I was having some trouble doing some bayesian probability problems and was wondering if I could get any help. I think I was able to get the first two but am confused on the last. If someone could please check my work to make sure I am correct and help me on the last question that would be great.

Here is the problem setup:

Let’s say that we work for the International Olympic Committee (IOC) as part of their Fight Against Doping (https://www.olympic.org/fight-against-doping). We have a drug test for a banned performance-enhancing drug (PED) that is 99.3% accurate at identifying an athlete that has the PED in their system. However, it is only 73% accurate at identifying the absence of PED in the athlete’s system. From a scientific study we also have a strong reason to believe that only 3% of Olympic athletes use this particular PED.

Here is the first question.

1. An athlete tests positive for the PED. Given that the test had a positive result, what is the probability the tested individual uses the PED?

My answer:

P = Positive N = Negative

##p(P|PED) = 0.993##

##p(N|No PED) = 0.73##

##p(PED) = 0.03##

##p(P) = p(P|PED)*p(PED) + p(P|No PED)*p(No PED) = 0.993 * 0.03 + (1 - 0.73) * 0.97 = 0.292##

##p(PED|P) = 0.993 x 0.03 / 0.292 = 0.102##- answer

Here is the second question.

2. As an employee of the IOC, we don’t want to needlessly ban an athlete from the chance to compete in the Olympics. As a result, we decide to institute a protocol that if an athlete tests positive for the use of the PED we will administer a second test. The second test is less accurate at identifying an athlete that has the PED in their system, at only 81%, but is more accurate at identifying the absence of PED in the athlete’s system, with a probability of 90%. If the athlete tests positive for the PED in both the first and second test, what is the probability that the accused individual uses the banned PED? (You may assume the outcome of the second drug test is conditionally independent of the outcome of the first drug test).

My answer:

##P_2## = Positive 2nd Test

##N_2## = Negative 2nd Test##p(P_2|PED)## = 0.81

##p(N_2|No PED)## = 0.90

##p(PED|P_2) = p(P_2|PED)*p(PED)/p(P_2)##

##p(P_2) = p(P_2|PED)*p(PED) + p(P_2|No PED)*p(No PED) = 0.81*0.03 + (1-0.90)*(1-0.03) = 0.124##

##p(PED|P_2) = p(PED|P_2) * p(PED)/p(P_2) = 0.81*0.03/0.124 = 0.196##

Thus the probability of using PED given that you tested positive for both tests is:

##p(Positive for both) = p(P|PED)*p(P_2|PED) = 0.102*0.196 = 0.0199 \approx 0.20## - answer

Here is question 3 - this is the one I am not sure on.

3. Our information that only 3% of Olympic athletes use the PED came from a
study of 300 athletes. This year we tested 500 athletes and confirmed that 11 of
them used the banned substance. In both cases, only a sample of all athletes to
complete in the Olympics were tested for the PED. As a result there is some
uncertainty, so we decide we would like to express what we have learned as a
probability distribution. In two years when we being testing athletes again, what is the fully specified distribution that we will we use for the percentage of Olympic athletes that use the PED?

I don't get what it means by what the probability distribution would be. Would it be some derivation of a Bayesian posterior distribution with 0.03 as the prior, though I'm not sure what the likelihood or marginal would be.

I know this is the Bayes rule formula:

##p(\theta|x) = p(x|\theta)*p(\theta)/p(x)##

But how would I change this to suit the question?

Apologies for the long post, please let me know if you need any more information or clarification.

Thank you for reading

Ray Vickson · Feb 21, 2018

Rifscape said:

Hi,

I was having some trouble doing some bayesian probability problems and was wondering if I could get any help. I think I was able to get the first two but am confused on the last. If someone could please check my work to make sure I am correct and help me on the last question that would be great.

Here is the problem setup:

Let’s say that we work for the International Olympic Committee (IOC) as part of their Fight Against Doping (https://www.olympic.org/fight-against-doping). We have a drug test for a banned performance-enhancing drug (PED) that is 99.3% accurate at identifying an athlete that has the PED in their system. However, it is only 73% accurate at identifying the absence of PED in the athlete’s system. From a scientific study we also have a strong reason to believe that only 3% of Olympic athletes use this particular PED.

Here is the first question.

1. An athlete tests positive for the PED. Given that the test had a positive result, what is the probability the tested individual uses the PED?

My answer:

P = Positive N = Negative

##p(P|PED) = 0.993##

##p(N|No PED) = 0.73##

##p(PED) = 0.03##

##p(P) = p(P|PED)*p(PED) + p(P|No PED)*p(No PED) = 0.993 * 0.03 + (1 - 0.73) * 0.97 = 0.292##

##p(PED|P) = 0.993 x 0.03 / 0.292 = 0.102##- answer

Here is the second question.

2. As an employee of the IOC, we don’t want to needlessly ban an athlete from the chance to compete in the Olympics. As a result, we decide to institute a protocol that if an athlete tests positive for the use of the PED we will administer a second test. The second test is less accurate at identifying an athlete that has the PED in their system, at only 81%, but is more accurate at identifying the absence of PED in the athlete’s system, with a probability of 90%. If the athlete tests positive for the PED in both the first and second test, what is the probability that the accused individual uses the banned PED? (You may assume the outcome of the second drug test is conditionally independent of the outcome of the first drug test).

My answer:

##P_2## = Positive 2nd Test

##N_2## = Negative 2nd Test##p(P_2|PED)## = 0.81

##p(N_2|No PED)## = 0.90

##p(PED|P_2) = p(P_2|PED)*p(PED)/p(P_2)##

##p(P_2) = p(P_2|PED)*p(PED) + p(P_2|No PED)*p(No PED) = 0.81*0.03 + (1-0.90)*(1-0.03) = 0.124##

##p(PED|P_2) = p(PED|P_2) * p(PED)/p(P_2) = 0.81*0.03/0.124 = 0.196##

Thus the probability of using PED given that you tested positive for both tests is:

##p(Positive for both) = p(P|PED)*p(P_2|PED) = 0.102*0.196 = 0.0199 \approx 0.20## - answer

Here is question 3 - this is the one I am not sure on.

3. Our information that only 3% of Olympic athletes use the PED came from a
study of 300 athletes. This year we tested 500 athletes and confirmed that 11 of
them used the banned substance. In both cases, only a sample of all athletes to
complete in the Olympics were tested for the PED. As a result there is some
uncertainty, so we decide we would like to express what we have learned as a
probability distribution. In two years when we being testing athletes again, what is the fully specified distribution that we will we use for the percentage of Olympic athletes that use the PED?

I don't get what it means by what the probability distribution would be. Would it be some derivation of a Bayesian posterior distribution with 0.03 as the prior, though I'm not sure what the likelihood or marginal would be.

I know this is the Bayes rule formula:

##p(\theta|x) = p(x|\theta)*p(\theta)/p(x)##

But how would I change this to suit the question?

Apologies for the long post, please let me know if you need any more information or clarification.

Thank you for reading

I don't see how you did question 2, and I don't think you answered the question that was actually asked. I would say that
$$
\begin{array}{rcl}\Pr(P1 \& P2) &=& \Pr(P1 \& P2 | PEP ) \Pr(PEP) + \Pr(P1 \& P2 | NPEP) \Pr(NPEP) \\
&=& 0.993 \times 0.81 \times 0.03 + 0.27 \times 0.10 \times 0.97
\end{array}
$$
The question asks for ##\Pr(PEP | P1 \& P2).##

The way I would approach question 3 would be to start with some prior density ##f_0(p)## of the parameter ##p = \Pr(PEP)##; that is a prior, before any measurements have been made. Then, given ##p## the number of doped athletes ##X## in our first sample is binomial with parameters ##300## and ##p##. We observed ##3\% \times 300 = 9## doped athletes, so
$$\Pr(X=9|p) = {300 \choose 9} p^9 (1-p)^{291}.$$
The posterior density of ##p## (after observing 9 doped athletes in 300) is
$$f(p) = \Pr(p | X = 9) = \frac{ \Pr(X=9 |p) f_0(p)}{\int_0^1 \Pr(X=9|p) f_0(p) \, dp}$$
Note that here I have abused notation by saying things like ##\Pr(p|X=9)## instead of the more accurate notation
$$\lim_{\Delta p \to 0} \frac{\Pr([p,p+\Delta p] | X=9)}{\Delta p}$$

If we make the reasonable (?) assumption that ##f_0(p) = 1## on ##0 < p < 1## then we have ##f(p) = C p^9 (1-p)^{291}## for a normalizing constant ##C## that gives ##\int_0^1 f(p) \, dp = 1##.

You could use the above ##f(p)## as your new "prior" before the second round of measurements, to give you a posterior density.

The only remaining issue is how to use the posterior density obtained in that way. Should we use the whole posterior distribution, doing calculations for each fixed ##p##, then integrating over ##p##? Should we simplify life by using (for example) the expected value of ##p##, taken with respect to the posterior? Should we use the most probable value of ##p## (the mode of the posterior density)? Should we use the median of the posterior density?

BTW: these three methods mentioned for the prior (after testing the first 300) are:
$$\begin{array}{rcl}
\text{expected value of } \; p & = &5/151 \doteq 0.03311\\
\text{most probable value of } \; p &=& 3/100 = 0.03 \\
\text{median value of } \; p &= &0.03209
\end{array}
$$By the way: using the median is not easy; getting it requires the solution of a 301-degree polynomial equation---definitely needing a computer algebra system; I used Maple.

Rifscape · Feb 22, 2018

Ray Vickson said:

I don't see how you did question 2, and I don't think you answered the question that was actually asked. I would say that
$$
\begin{array}{rcl}\Pr(P1 \& P2) &=& \Pr(P1 \& P2 | PEP ) \Pr(PEP) + \Pr(P1 \& P2 | NPEP) \Pr(NPEP) \\
&=& 0.993 \times 0.81 \times 0.03 + 0.27 \times 0.10 \times 0.97
\end{array}
$$
The question asks for ##\Pr(PEP | P1 \& P2).##

The way I would approach question 3 would be to start with some prior density ##f_0(p)## of the parameter ##p = \Pr(PEP)##; that is a prior, before any measurements have been made. Then, given ##p## the number of doped athletes ##X## in our first sample is binomial with parameters ##300## and ##p##. We observed ##3\% \times 300 = 9## doped athletes, so
$$\Pr(X=9|p) = {300 \choose 9} p^9 (1-p)^{291}.$$
The posterior density of ##p## (after observing 9 doped athletes in 300) is
$$f(p) = \Pr(p | X = 9) = \frac{ \Pr(X=9 |p) f_0(p)}{\int_0^1 \Pr(X=9|p) f_0(p) \, dp}$$
Note that here I have abused notation by saying things like ##\Pr(p|X=9)## instead of the more accurate notation
$$\lim_{\Delta p \to 0} \frac{\Pr([p,p+\Delta p] | X=9)}{\Delta p}$$

If we make the reasonable (?) assumption that ##f_0(p) = 1## on ##0 < p < 1## then we have ##f(p) = C p^9 (1-p)^{291}## for a normalizing constant ##C## that gives ##\int_0^1 f(p) \, dp = 1##.

You could use the above ##f(p)## as your new "prior" before the second round of measurements, to give you a posterior density.

The only remaining issue is how to use the posterior density obtained in that way. Should we use the whole posterior distribution, doing calculations for each fixed ##p##, then integrating over ##p##? Should we simplify life by using (for example) the expected value of ##p##, taken with respect to the posterior? Should we use the most probable value of ##p## (the mode of the posterior density)? Should we use the median of the posterior density?

BTW: these three methods mentioned for the prior (after testing the first 300) are:
$$\begin{array}{rcl}
\text{expected value of } \; p & = &5/151 \doteq 0.03311\\
\text{most probable value of } \; p &=& 3/100 = 0.03 \\
\text{median value of } \; p &= &0.03209
\end{array}
$$By the way: using the median is not easy; getting it requires the solution of a 301-degree polynomial equation---definitely needing a computer algebra system; I used Maple.

I see that makes sense. Apologies for the delayed response, was working on the problem. Your solution was very helpful. I was able to solve it thanks to you! The actual solution is not as complicated as the one you gave but your solution definitely put me on the right track. The model for the posterior was a beta distribution with the initial values being the prior and the values given the question 3 part of the likelihood. All that is needed after that is too plug the values into the beta distribution.

Thank you for your help and your clear solution - it helped a ton!

Bayesian Probability Distributions

Similar threads

Distance between a Clock's hands when the distance is increasing most rapidly

Polar integral

Deriving spatial derivatives

Is this the correct general solution of the given PDE?

J_1(x) = (x^2/10)*(J_1(x) + J_3(x)) How to solve?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect