# Stats Question Help

• I
Hi, my main sticking point with the following is which bit of statistics/probability theory is needed to answer the question. We've only been taught basic Bayes theory, the standard pdfs and a little on hypothesis testing. I have spent way too much time just trying to figure out where to start, any pointers appreciated.

"A bakery has suspicions that their recent production run of cherry pies has resulted in half of all the pies becoming contaminated. The bakery is trying to work out how this issue will change the number of customers complaining. How many cherry pies out of their recent production run does the bakery need to test, to determine the probability that any given cherry pie is contaminated to better than 5%".

## Answers and Replies

Stephen Tashi
Science Advisor
Solving the problem is a mind reading exercise - what does the author of the problem expect us to assume? To get a hint about that, we need to know where you encountered this problem. What course? What are some examples of other problems from the same problem set?

We've only been taught basic Bayes theory, the standard pdfs and a little on hypothesis testing.

Have you studied confidence intervals?

the probability that any given cherry pie is contaminated to better than 5%".

That language might be a mangled attempt to ask a question about "confidence".

As it stands, the problem seems to be ask how many pies must be tested to be certain that the the probability of a randomly selected pie being contaminated is estimated to within ##\pm .05 ##.

Having 95% "confidence" about the estimate of a probability is different that being certain it is estimated to within ##\pm .05##. There is also the question of whether "5%" means we are to estimate the unknown probability ##p## within ##\pm .05 ## or whether we must estimate it within ##\pm .05 p## of its true value.

An amusing approach is to assume we can give the answer in terms of the number of pies in the particular production run. If there are ##N## pies in the production run and we make the Bayesian assumption that ##N/2## are contaminated, then how many M pies must be tested to be certain we test a number of contaminated-pies C such that ##| C/M - 1/2| \le .05 ## ?

The worst case for sampling is that we'd be unlucky enough to test all the contaminated pies or all the uncontaminated pies before testing pies in the opposite condition. Let ##M = N/2 + K##. As a function of ##N##, what value must ##K## have to satisfy ##| K/((N/2) + K) - 1/2| \le 0.5## ?

If the problem intends to make a point about the distinction between "confidence" about the estimate of a parameter versus "certainty" about the range of an estimate, that amusing interpretation might be what the author intends. However, given the average course material, I think such an interpretation is unlikely.

Solving the problem is a mind reading exercise - what does the author of the problem expect us to assume? To get a hint about that, we need to know where you encountered this problem. What course? What are some examples of other problems from the same problem set?

Have you studied confidence intervals?

That language might be a mangled attempt to ask a question about "confidence".

As it stands, the problem seems to be ask how many pies must be tested to be certain that the the probability of a randomly selected pie being contaminated is estimated to within ##\pm .05 ##.

Having 95% "confidence" about the estimate of a probability is different that being certain it is estimated to within ##\pm .05##. There is also the question of whether "5%" means we are to estimate the unknown probability ##p## within ##\pm .05 ## or whether we must estimate it within ##\pm .05 p## of its true value.

An amusing approach is to assume we can give the answer in terms of the number of pies in the particular production run. If there are ##N## pies in the production run and we make the Bayesian assumption that ##N/2## are contaminated, then how many M pies must be tested to be certain we test a number of contaminated-pies C such that ##| C/M - 1/2| \le .05 ## ?

The worst case for sampling is that we'd be unlucky enough to test all the contaminated pies or all the uncontaminated pies before testing pies in the opposite condition. Let ##M = N/2 + K##. As a function of ##N##, what value must ##K## have to satisfy ##| K/((N/2) + K) - 1/2| \le 0.5## ?

If the problem intends to make a point about the distinction between "confidence" about the estimate of a parameter versus "certainty" about the range of an estimate, that amusing interpretation might be what the author intends. However, given the average course material, I think such an interpretation is unlikely.

Thanks for the reply. The course is essentially a "Stats and Data Analysis for Scientists" course, which is more about what we can read/interpret from data, with an emphasis on using programming languages to help solve the problems, make plots etc. We're only three weeks, have not done anything on confidence intervals though (though we have touched on the P-value, and NHST). Other questions include one where we essentially just had to use the standard Bayes rule ## P(A | B) = \frac{P(B | A)(P(A)}{P(B | A)(P(A) + P(B | A^c) P(A^c)} ## where we had to make an assumption for the "prior" ##P(A)##, and other question where we were given some data, had to computationally find the pearson r-value and p-value.

Stephen Tashi
Science Advisor
The major difficulty in interpreting the problem is the following language:
determine the probability that any given cherry pie is contaminated to better than 5%"

"Determining" the probability could mean estimating its value. Such an estimate has some error ##e_r## which is a random variable since it depends on the results of a random sample. For each given error bound ##\delta## there is a probability ##p_\delta## that ##|e_r| \le \delta##. From that viewpoint it isn't clear what "5%" refers to. Does it refer to ##\delta## or something about ##p_\delta##? If we assume the required value of ##p_\delta## is, say, 0.95 and the required value of ##\delta## is 0.05 then we have a well defined problem.

(though we have touched on the P-value, and NHST).

If we interpret "better than %5" as a requirement to design a test of the null hypothesis that p(randomly selected pie is contaminated) = 1/2 with a significance level of ##\alpha = 0.05## then we have a well defined problem. However that interprets "determine the probability" as merely testing the hpothesis that it is 1/2 rather than estimating whether it has a different value. Have you done problems where you must determine the required sample size to use in a hypothesis test? If it's this kind of problem, it won't involve Baye's theorem. You will assume the probability that a ranomly selected pie is contaminated is (defintely) 1/2 and do all computations based on that assumption.

StoneTemplePython
Science Advisor
Gold Member
my guess is, yes OP is trying to determine a confidence interval, or perhaps in the Bayes realm: a credible interval.

But as ST says... it's quite an expansive question and feels a lot like mind reading at times...

The major difficulty in interpreting the problem is the following language:

"Determining" the probability could mean estimating its value.

My bad, the question say "know" not "determine", the rest of the text is as is.

If we interpret "better than %5" as a requirement to design a test of the null hypothesis that p(randomly selected pie is contaminated) = 1/2 with a significance level of ##\alpha = 0.05## then we have a well defined problem. However that interprets "determine the probability" as merely testing the hpothesis that it is 1/2 rather than estimating whether it has a different value. Have you done problems where you must determine the required sample size to use in a hypothesis test? If it's this kind of problem, it won't involve Baye's theorem. You will assume the probability that a ranomly selected pie is contaminated is (defintely) 1/2 and do all computations based on that assumption.

This sounds more likely to be it, we have done NHST, P-values etc and nothing on confidence intervals. We haven't done anything where we had to determine the required sample size, but I am wondering whether we have to just make an assumption.

Based on testing the null hypothesis that 1/2 the cherry pies are contaminated, just running some brute force methods so far I found that if there were 100 pies, they'd need to test 60 to get ##\alpha \leq 0.05##, and if there were 200 pies, they'd need to test 114 (~57%) . Does that sound at least even reasonable? Sounds way to high to me.

Stephen Tashi
Science Advisor
just running some brute force methods so far I found that if there were 100 pies, they'd need to test 60 to get ##\alpha \leq 0.05##,
Are you using a "one tailed" test or a "two tailed test"? The language
how this issue will change the number of customers complaining.
suggests the company is interested in whether the probability of contamination is different than 1/2 , either by being less or greater.

What is the statistic for your test and what is the "rejection region" for your test?