Stats Q Help: Find Probability of Contaminated Cherry Pies

  • Context: Undergrad 
  • Thread starter Thread starter ChrisJ
  • Start date Start date
  • Tags Tags
    Stats
Click For Summary

Discussion Overview

The discussion revolves around a statistical problem concerning the probability of contamination in cherry pies produced by a bakery. Participants explore the necessary statistical concepts and methods to determine how many pies need to be tested to estimate the contamination probability with a specified level of precision. The scope includes elements of probability theory, hypothesis testing, and Bayesian statistics.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant expresses uncertainty about which statistical methods are applicable, mentioning basic Bayes theory, standard probability density functions (pdfs), and hypothesis testing as their current knowledge base.
  • Another participant suggests that the problem may require assumptions about the expected contamination rate and questions the interpretation of "better than 5%" in terms of confidence intervals versus certainty in estimation.
  • There is a discussion about whether "better than 5%" refers to estimating the unknown probability within a fixed error margin or a relative error margin.
  • Some participants propose that the problem might involve determining sample sizes for hypothesis testing, contrasting this with Bayesian approaches.
  • One participant notes the ambiguity in the language of the problem, particularly regarding the terms "determine" and "know," and how these affect the interpretation of the statistical requirements.
  • Another participant suggests that the problem could be interpreted as requiring a credible interval in Bayesian statistics.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the interpretation of the problem or the appropriate statistical methods to apply. Multiple competing views remain regarding the meaning of "better than 5%" and the relevance of confidence intervals versus hypothesis testing.

Contextual Notes

Participants highlight limitations in their current understanding of confidence intervals and hypothesis testing, indicating that these concepts have not been fully covered in their course yet. The discussion also reflects uncertainty about the specific statistical framework intended by the problem's author.

ChrisJ
Messages
70
Reaction score
3
Hi, my main sticking point with the following is which bit of statistics/probability theory is needed to answer the question. We've only been taught basic Bayes theory, the standard pdfs and a little on hypothesis testing. I have spent way too much time just trying to figure out where to start, any pointers appreciated.

"A bakery has suspicions that their recent production run of cherry pies has resulted in half of all the pies becoming contaminated. The bakery is trying to work out how this issue will change the number of customers complaining. How many cherry pies out of their recent production run does the bakery need to test, to determine the probability that any given cherry pie is contaminated to better than 5%".
 
Physics news on Phys.org
Solving the problem is a mind reading exercise - what does the author of the problem expect us to assume? To get a hint about that, we need to know where you encountered this problem. What course? What are some examples of other problems from the same problem set?

ChrisJ said:
We've only been taught basic Bayes theory, the standard pdfs and a little on hypothesis testing.

Have you studied confidence intervals?

the probability that any given cherry pie is contaminated to better than 5%".

That language might be a mangled attempt to ask a question about "confidence".

As it stands, the problem seems to be ask how many pies must be tested to be certain that the the probability of a randomly selected pie being contaminated is estimated to within ##\pm .05 ##.

Having 95% "confidence" about the estimate of a probability is different that being certain it is estimated to within ##\pm .05##. There is also the question of whether "5%" means we are to estimate the unknown probability ##p## within ##\pm .05 ## or whether we must estimate it within ##\pm .05 p## of its true value.

An amusing approach is to assume we can give the answer in terms of the number of pies in the particular production run. If there are ##N## pies in the production run and we make the Bayesian assumption that ##N/2## are contaminated, then how many M pies must be tested to be certain we test a number of contaminated-pies C such that ##| C/M - 1/2| \le .05 ## ?

The worst case for sampling is that we'd be unlucky enough to test all the contaminated pies or all the uncontaminated pies before testing pies in the opposite condition. Let ##M = N/2 + K##. As a function of ##N##, what value must ##K## have to satisfy ##| K/((N/2) + K) - 1/2| \le 0.5## ?

If the problem intends to make a point about the distinction between "confidence" about the estimate of a parameter versus "certainty" about the range of an estimate, that amusing interpretation might be what the author intends. However, given the average course material, I think such an interpretation is unlikely.
 
Stephen Tashi said:
Solving the problem is a mind reading exercise - what does the author of the problem expect us to assume? To get a hint about that, we need to know where you encountered this problem. What course? What are some examples of other problems from the same problem set?
Have you studied confidence intervals?
That language might be a mangled attempt to ask a question about "confidence".

As it stands, the problem seems to be ask how many pies must be tested to be certain that the the probability of a randomly selected pie being contaminated is estimated to within ##\pm .05 ##.

Having 95% "confidence" about the estimate of a probability is different that being certain it is estimated to within ##\pm .05##. There is also the question of whether "5%" means we are to estimate the unknown probability ##p## within ##\pm .05 ## or whether we must estimate it within ##\pm .05 p## of its true value.

An amusing approach is to assume we can give the answer in terms of the number of pies in the particular production run. If there are ##N## pies in the production run and we make the Bayesian assumption that ##N/2## are contaminated, then how many M pies must be tested to be certain we test a number of contaminated-pies C such that ##| C/M - 1/2| \le .05 ## ?

The worst case for sampling is that we'd be unlucky enough to test all the contaminated pies or all the uncontaminated pies before testing pies in the opposite condition. Let ##M = N/2 + K##. As a function of ##N##, what value must ##K## have to satisfy ##| K/((N/2) + K) - 1/2| \le 0.5## ?

If the problem intends to make a point about the distinction between "confidence" about the estimate of a parameter versus "certainty" about the range of an estimate, that amusing interpretation might be what the author intends. However, given the average course material, I think such an interpretation is unlikely.

Thanks for the reply. The course is essentially a "Stats and Data Analysis for Scientists" course, which is more about what we can read/interpret from data, with an emphasis on using programming languages to help solve the problems, make plots etc. We're only three weeks, have not done anything on confidence intervals though (though we have touched on the P-value, and NHST). Other questions include one where we essentially just had to use the standard Bayes rule ## P(A | B) = \frac{P(B | A)(P(A)}{P(B | A)(P(A) + P(B | A^c) P(A^c)} ## where we had to make an assumption for the "prior" ##P(A)##, and other question where we were given some data, had to computationally find the pearson r-value and p-value.
 
The major difficulty in interpreting the problem is the following language:
determine the probability that any given cherry pie is contaminated to better than 5%"

"Determining" the probability could mean estimating its value. Such an estimate has some error ##e_r## which is a random variable since it depends on the results of a random sample. For each given error bound ##\delta## there is a probability ##p_\delta## that ##|e_r| \le \delta##. From that viewpoint it isn't clear what "5%" refers to. Does it refer to ##\delta## or something about ##p_\delta##? If we assume the required value of ##p_\delta## is, say, 0.95 and the required value of ##\delta## is 0.05 then we have a well defined problem.
ChrisJ said:
(though we have touched on the P-value, and NHST).

If we interpret "better than %5" as a requirement to design a test of the null hypothesis that p(randomly selected pie is contaminated) = 1/2 with a significance level of ##\alpha = 0.05## then we have a well defined problem. However that interprets "determine the probability" as merely testing the hpothesis that it is 1/2 rather than estimating whether it has a different value. Have you done problems where you must determine the required sample size to use in a hypothesis test? If it's this kind of problem, it won't involve Baye's theorem. You will assume the probability that a ranomly selected pie is contaminated is (defintely) 1/2 and do all computations based on that assumption.
 
my guess is, yes OP is trying to determine a confidence interval, or perhaps in the Bayes realm: a credible interval.

But as ST says... it's quite an expansive question and feels a lot like mind reading at times...
 
Stephen Tashi said:
The major difficulty in interpreting the problem is the following language:"Determining" the probability could mean estimating its value.

My bad, the question say "know" not "determine", the rest of the text is as is.

Stephen Tashi said:
If we interpret "better than %5" as a requirement to design a test of the null hypothesis that p(randomly selected pie is contaminated) = 1/2 with a significance level of ##\alpha = 0.05## then we have a well defined problem. However that interprets "determine the probability" as merely testing the hpothesis that it is 1/2 rather than estimating whether it has a different value. Have you done problems where you must determine the required sample size to use in a hypothesis test? If it's this kind of problem, it won't involve Baye's theorem. You will assume the probability that a ranomly selected pie is contaminated is (defintely) 1/2 and do all computations based on that assumption.

This sounds more likely to be it, we have done NHST, P-values etc and nothing on confidence intervals. We haven't done anything where we had to determine the required sample size, but I am wondering whether we have to just make an assumption.

Based on testing the null hypothesis that 1/2 the cherry pies are contaminated, just running some brute force methods so far I found that if there were 100 pies, they'd need to test 60 to get ##\alpha \leq 0.05##, and if there were 200 pies, they'd need to test 114 (~57%) . Does that sound at least even reasonable? Sounds way to high to me.
 
ChrisJ said:
just running some brute force methods so far I found that if there were 100 pies, they'd need to test 60 to get ##\alpha \leq 0.05##,
Are you using a "one tailed" test or a "two tailed test"? The language
how this issue will change the number of customers complaining.
suggests the company is interested in whether the probability of contamination is different than 1/2 , either by being less or greater.

What is the statistic for your test and what is the "rejection region" for your test?
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 76 ·
3
Replies
76
Views
8K
  • · Replies 25 ·
Replies
25
Views
6K
  • · Replies 9 ·
Replies
9
Views
5K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 49 ·
2
Replies
49
Views
13K
  • · Replies 8 ·
Replies
8
Views
4K