When/how to reject Poisson distribution hypothesis?

In summary, you have a situation where you have a large number of objects (1026059 in your example), but you can't tell how many are of a particular type. You want to determine the number of objects of each type, and to do so you need to calculate the probability of each type of object being in a particular bucket.
  • #1
DaanV
26
0

Homework Statement


I have run into a situation that my gut tells me is impossible (alright extremely unlikely) when assuming a Poisson distribution. I want to make this gut feeling more formal by testing it against a Poisson distribution. Sadly I'm not a schooled statistician.

Generalised form
I have ##N## objects that I am dividing over ##C## compartments.
##N## can be of distinguishable types ##N_A## and ##N_B## such that ##N_A + N_B = N##
(say green and blue tennis balls in buckets)

Sadly I cannot distinguish the number of objects that are in a bucket, but I can tell of what type they are. I.e. I can see if a bucket contains no tennis balls, only green, only blue or both.

Usually, ##C >> N##

If my null hypothesis is that the distribution of tennis balls over the buckets is random, what calculations do I have to perform to reject that hypothesis with e.g. 99% certainty?

Real example
This example should (IMO) quite clearly yield a rejection of the hypothesis, but I have other examples where the distinction may be less obvious. So I'm looking for ways to calculate.

I have 1026059 "buckets". 29755 of these contain only (at least one) ##N_A##, 14 contain both (at least one each) ##N_A## and ##N_B##. 0 "buckets" contain only ##N_B##. The remainder (1026059 - 29755 - 14 - 0 = 996290) contains no objects, is empty.

It seems incredibly unlikely to me that all 14 ##N_B## objects would end up in a bucket also containing ##N_A##, despite there being so many empty buckets left. But how do I bring this formally?

Homework Equations


##H_0## = Poisson distribution
##H_1## = a greater than random probability of B being linked to A

Something along those lines?

The Attempt at a Solution


I want to determine ##N_A## and ##N_B## from the bucket counts, assuming a Poisson distribution. Then from there I want to determine the expected distribution of ##N_A## and ##N_B## over the buckets, again assuming a Poisson distribution. That should show that we expect very few "dual positive" buckets.

Probability of ##b## balls in any bucket, given an average 'Loading' ##L## of balls per bucket (##L = \frac{N}{C}##), is given by:
##P(b;L)=\frac{L^b * e^{-L}}{b!}##

The fraction of empty buckets over total buckets (##\frac{C^-}{C}##) is then given by
##\frac{C^-}{C}=P(0;L)=\frac{L^0 * e^{-L}}{0!} = e^{-L}## so that we can compute ##L=-ln(\frac{C^-}{C})##

Doing so we can calculate ##L_A## and ##L_B## in an equivalent manner. Using the example above, I get to ##L_A=0.0289## and ##L_B=1.364*10^{-5}##.

From there I can calculate what distribution of fragments we would expect, given that loading. Doing so yields that we would expect on average 0.4 buckets with both ##N_A## and ##N_B## in them, rather than the 14 that I found.

So.. How do I show that the expected distribution does not fit with the observed distribution?

Thanks in advance for any help provided!
 
Physics news on Phys.org
  • #2
You've done a fair job of making the problem clear, but some details need clarification.

I suggest we not use the same symbol [itex] N_A [/itex] to denote both a type of thing and also the number of things of that type. Let's say "[itex] A [/itex]" denotes a type of thing and "[itex] N_A [/itex]" denotes the number of things of that type.

As I understand it, you don't know the numerical values of [itex] N_A [/itex] and [itex] N_B [/itex].

If you are trying to estimate the values of [itex] N_A, N_B [/itex] from the data, this is (naturally) a problem of "statistical estimation". If you are trying to do a statistical test of the hypothesis that the balls are placed independently and at random in the buckets, this is a problem of statistical "hypothesis testing". A method of hypothesis testing might involve a preliminary step that is "estimation".

As I understand the problem, you don't have an "observed distribution". Instead, you have 3 numerical values [itex] (L,L_A, L_B) [/itex] A distribution of those values would be a function of 3 variables [itex] f(x,y,z) [/itex]. if you specify a certain volume V, you can ask "What is the probability that the observed 3 values will fall in that volume?". For example, you might specify the volume by [itex] y \le 0.4, z \le 0.4 [/itex] Hypothesis testing relies on the subjective procedure of "rejecting" the hypothesis that [itex] f(x,y,z) [/itex] is the distribution from which the data is drawn when there is a small probability of the particular triplet of values observed in the data falling in the specified volume.

Can you write computer simulations? I think this would be the quickest way to investigate your problem. There might be a pencil-and-paper way to get some results, but none pops into my mind at the moment.
 
  • Like
Likes DaanV
  • #3
By the way, is this a homework problem? It doesn't sound like a homework problem to me. If it isn't, then I'll suggest the thread be moved to the mathematics section on set theory and probability.
 
  • #4
Stephen Tashi said:
You've done a fair job of making the problem clear, but some details need clarification.

I suggest we not use the same symbol [itex] N_A [/itex] to denote both a type of thing and also the number of things of that type. Let's say "[itex] A [/itex]" denotes a type of thing and "[itex] N_A [/itex]" denotes the number of things of that type.
You're absolutely right. Sorry about the confusion.

As I understand it, you don't know the numerical values of [itex] N_A [/itex] and [itex] N_B [/itex].

If you are trying to estimate the values of [itex] N_A, N_B [/itex] from the data, this is (naturally) a problem of "statistical estimation". If you are trying to do a statistical test of the hypothesis that the balls are placed independently and at random in the buckets, this is a problem of statistical "hypothesis testing". A method of hypothesis testing might involve a preliminary step that is "estimation".
That's right, I don't know ##N_A## and ##N_B## (the numbers of green and blue balls). I only know the number of buckets that have balls in them, and what colour(s) are in those buckets. I am indeed trying to test whether or not the balls are placed independently and at random.

As I understand the problem, you don't have an "observed distribution". Instead, you have 3 numerical values [itex] (L,L_A, L_B) [/itex] A distribution of those values would be a function of 3 variables [itex] f(x,y,z) [/itex].
Please forgive my ignorance on statistical nomenclature. What I mean by "observed distribution" is that I know the number of buckets, and I can literally count (observe) the number of buckets with coloured balls in them. Again, I do not know the distribution of balls over buckets.
I would say I have 4 numerical values: ##C_{empty}, C_A, C_B, C_{AB}##, denoting the number of buckets after distribution of the balls.

if you specify a certain volume V, you can ask "What is the probability that the observed 3 values will fall in that volume?". For example, you might specify the volume by [itex] y \le 0.4, z \le 0.4 [/itex] Hypothesis testing relies on the subjective procedure of "rejecting" the hypothesis that [itex] f(x,y,z) [/itex] is the distribution from which the data is drawn when there is a small probability of the particular triplet of values observed in the data falling in the specified volume.
I'm not sure I follow. Only 2 'values' (A and B) should fall in the same 'volume' (bucket).

Here's my shot:
Let ##C## be the number of buckets and ##N## the number of balls. Let ##P(C_A)## be the probability for any bucket to contain an ##A##, ##P(C_B)## to contain ##B## and ##P(C_{AB})## to contain both.

Can I then say that the probability for anyone droplet to contain both A and B:
##P(C_{AB}) = P(C_A) * P(C_B) = \frac{N_A}{C} * \frac{N_B}{C}##
?

Then the probability for X buckets to contain both ##A## and ##B##, given ##P(C_{AB})##, could be computed by:
##P = P(C_{AB})^X * (^C_X)##
Does that sound remotely correct?
Of course problems arise in that I don't yet know ##N_A## and ##N_B##. I could estimate those statistically, as you mentioned earlier, under the assumption of independent and random placement.

Can you write computer simulations? I think this would be the quickest way to investigate your problem. There might be a pencil-and-paper way to get some results, but none pops into my mind at the moment.
No.
I'll be following an introductory R course soon-ish.

Stephen Tashi said:
By the way, is this a homework problem? It doesn't sound like a homework problem to me. If it isn't, then I'll suggest the thread be moved to the mathematics section on set theory and probability.
No it's not homework. But I figured that this would be low-level probability theory, so might as well go here. Apologies if this is the wrong spot.
 
  • #5
DaanV said:
I'm not sure I follow. Only 2 'values' (A and B) should fall in the same 'volume' (bucket).

I'm referring to the abstract scenario for hypothesis testing. This involves a "statistic" and the determination of whether that "observed value of the statistic" falls inside or outside a "acceptance region".

In the simplest case, a observed value of a statistic is a number - for example, the mean value of a sample of the heights of 10 people is a statistic. A particular observed value of the statistic might be 71.3 inches. The acceptance region can be an interval on the number line -for example "between 69 and 72 inches".

We assume a given population distribution (e.g. a normal distribution with mean 68 inches and standard deviation 6 inches) then we peform a hypothesis test by "rejecting" the null hypothesis if the observed value of the statistic falls outside the acceptance region.

The probabilities ("significance level" etc) that are associated with a hypothesis test are determined by the probabilities that the statistic falls inside or outside the acceptance region.

Note that in the above example, the probability that the observed statistic takes one particular value (e.g 71.3) is small. (In fact, it is theoretically zero.) So to make any use of probability, we have to test using a region that contains an interval of values.

For a given real world problem, one can define many different statistics. A statistic can be a vector and and observed value of such a statistic is a vector of values. One statistic that can be defined in your problem is the vector with the 3 components that are given by

number of buckets with only red balls,
number of buckets with only green balls,
number of buckets with both red and green balls

An "acceptance region" for a test using that statistic would be defined by a set of vectors. Since these are 3 dimensional vectors, I called such a set a "volume". Since we are dealing with vectors of whole numbers, that might not be a good term. I'll have to think about this problem more in order to evaluate you're work. Right now, I have to go to an appointment.
 
  • Like
Likes DaanV
  • #6
Stephen Tashi said:
I'm referring to the abstract scenario for hypothesis testing. This involves a "statistic" and the determinati
DaanV said:
You're absolutely right. Sorry about the confusion.That's right, I don't know ##N_A## and ##N_B## (the numbers of green and blue balls). I only know the number of buckets that have balls in them, and what colour(s) are in those buckets. I am indeed trying to test whether or not the balls are placed independently and at random.

Are you saying that you know a bucket has either 0 balls, all red (but of unknown number), all green (but of unknown number) or some red and some green (but with both amounts unknown)? That seems like a rather artificial scenario to me. Where does it arise?
 
  • #7
DaanV said:
Let ##C## be the number of buckets and ##N## the number of balls. Let ##P(C_A)## be the probability for any bucket to contain an ##A##, ##P(C_B)## to contain ##B## and ##P(C_{AB})## to contain both.

When we say "the probability", we have to define the event precisely. You can speak of the probability that a randomly selected selected bucket contains an "A" (meaning "at least one "A", but possibly more).

(However if the number of things of type "A" and the number of buckets is fixed then picking two randomly selected buckets gets into the situation of "random sampling with replacement" rather than "random sampling without replacement".)

Can I then say that the probability for anyone droplet to contain both A and B:
##P(C_{AB}) = P(C_A) * P(C_B)##
Yes, since we've assumed independence.
##= \frac{N_A}{C} * \frac{N_B}{C}##
?
I'd say no - just because the problem would still makes sense if there were more things of type "A" than there were buckets (i.e. when [itex] N_A > C [/itex]) . In that case, it is still possible (by chance) to have some empty buckets.

You may find some useful formulae by looking up probability problems that involve the jargon "occupancy numbers". I'll try that myself, but I won't have time this evening.

-----
I'm going to suggest this thread be moved to a section that handles real world problems instead of homework problems.
 
Last edited:
  • Like
Likes DaanV
  • #8
Ray Vickson said:
Are you saying that you know a bucket has either 0 balls, all red (but of unknown number), all green (but of unknown number) or some red and some green (but with both amounts unknown)? That seems like a rather artificial scenario to me. Where does it arise?
Since you ask..
This is a problem derived from my research in a cancer institute. Have you ever heard of (Droplet) Digital PCR? In essence, the idea is to dilute a sample containing DNA to limiting dilutions, until you reach a point where you would only expect one DNA molecule of interest in a well (or in the case of droplet digital PCR, in a droplet).

We then perform a PCR reaction, amplifying only a region of interest, in the presence of two distinct hydrolysis probes targeting that region. One probe targets the wildtype variant, the other a specific mutant variant. We then perform 40 cycles of PCR amplification. If there was only a wildtype molecule in the beginning, the well (or droplet) would only emit the wildtype (HEX) wavelength, if there was the specific mutant it would emit mutant (FAM) wavelength. Or both, or none.

Perhaps red and green balls is not a perfect analogy. Perhaps it should be red and green paint, or something like that. Painting something red twice doesn't make a difference to painting it red once, but we can see if it was painted both red and green.

Does that help to make the problem a bit clearer?

Stephen Tashi said:
You may find some useful formulae by looking up probability problems that involve the jargon "occupancy numbers". I'll try that myself, but I won't have time this evening.

-----
I'm going to suggest this thread be moved to a section that handles real world problems instead of homework problems.
Thanks, I'll try performing a search with those terms. I'm fine with the topic being moved elsewhere.
 
  • #9
This morning's thoughts:

Thinking of tennis balls, I'll use "R" and "G" to denote types instead of "A" and "B".

Let [itex] S_R [/itex] be the observed number of buckets that have some red balls in them (i.e. at least one red ball).
Let [itex] S_G [/itex] be the observed number of buckets that have some green balls in them.

Under the assumption that each ball is placed randomly in the buckets, I think that any set of locations for the buckets with some red balls in them is equally likely, likewise for locations with some green balls. So, taking [itex] S_R [/itex] and [itex] S_G [/itex] as "given", and letting [itex] X [/itex] denote the buckets that have both red and green balls, I will consider [itex] X [/itex] to be a random variable. I'll denote the particular value of [itex] X [/itex] that was observed in your experiment by [itex] B [/itex].

Visualize taking random samples of [itex] X [/itex] as follows. Create [itex] S_R [/itex] labels that say "Has some red" and [itex] S_G [/itex] labels that say "Has some green". Place these labels in the buckets "at random" with the condition that there is at most one "Has some red" label in a bucket and at most one "Has some green" label in a bucket. Count the number buckets that have both a "Has some red" and "Has some green" label in it. This count gives the observed value of [itex] X [/itex].

I'm sure the distribution of [itex] X [/itex] can be calculated or simulated. We can rephrase the definition of [itex] X[/itex] in a different balls-in-cells scenario, where the labels now are considered to be "balls". To compute the distribution of [itex] X [/itex], we need to solve (for each possible value of [itex] k [/itex] ) the problem:

[itex] S_R[/itex] red balls and [itex] S_G [/itex] green balls are placed randomly in [itex] N [/itex] cells with the condition that no cell contains more than one red ball and no cell contains more than one green ball. Find the probability that exactly [itex] k [/itex] of the cells contain 2 balls, one ball of each color.

The distribution of [itex] X [/itex] will have a mean value and (hopefully) a single peak. We can base statistical tests on how far the observed value [itex] B [/itex] is from that peak.

-----
Applying statistics to real world problems is subjective and questions involveing probability can be conceptually complicated. One can foresee objections to the above type of test like "You treated part of the observed data as given and part of it as a random variable". There is also the question of who is to be persuaded by the test - just yourself? your boss? the referees of a scientific journal?
 
  • #10
It seems like you are overcomplicating this. To me it seems like a place for a simple chi square test for independence. I mean, it doesn't sound like you really care if it is Poisson or normal or some other distribution. What it sounds like you want to know is if it is independent. That can be tested using a contingency table and a chi square test regardless of the underlying distribution.
 
  • #11
DaleSpam said:
It seems like you are overcomplicating this. To me it seems like a place for a simple chi square test for independence. I mean, it doesn't sound like you really care if it is Poisson or normal or some other distribution.
What do you mean by "it"?
What it sounds like you want to know is if it is independent. That can be tested using a contingency table and a chi square test regardless of the underlying distribution.

Distribution of what random variable?
 
  • #12
The OP seems to want to know whether or not "a bucket contains at least one red ball" is independent of whether or not "the bucket contains at least one green ball".

It doesn't matter how the probability of a bucket containing a red ball is distributed. You just set up a 2x2 contingency table. Columns are red ball (yes or no) and rows are green ball (yes or no). Run the chi square test under the null hypothesis that they are independent, and see if the data provides evidence against that.
 
  • #13
DaanV said:
I would say I have 4 numerical values: Cempty,CA,CB,CABC_{empty}, C_A, C_B, C_{AB}, denoting the number of buckets after distribution of the balls.
Yes this is the typical data for a chi square test. You want to test for independence, yes?
 
  • #14
DaleSpam said:
The OP seems to want to know whether or not "a bucket contains at least one red ball" is independent of whether or not "the bucket contains at least one green ball".
He may be happy to investigate that, but is it equivalent to the hypothesis H that the both colors of balls are placed in the buckets independently and at random? It would seem to be implied by H, but does it imply H?
 
  • #15
@DaanV please let us know if it is independence that you wish to test or the distribution.
 
  • #16
Thanks for all the replies guys. Only just now checking back after the weekend.
For now, I do think it would be sufficient to show whether or not the events (Red and/or Green) are independent. I will look up chi square test for independence (sorry, as noted I'm not a statistics guy).

Terribly sorry Stephen Tashi, if my description of the situation put you on the wrong track. Thanks a whole bunch for all the time invested here.
 

1. What is the Poisson distribution hypothesis?

The Poisson distribution is a probability distribution that is used to model the number of occurrences of a specific event within a fixed interval of time or space. It assumes that the events occur at a constant rate and are independent of each other.

2. When should I reject the Poisson distribution hypothesis?

You should reject the Poisson distribution hypothesis when the observed data significantly deviates from the expected values based on the Poisson distribution. This can be determined by conducting a hypothesis test, such as the chi-square test, and calculating the p-value. If the p-value is less than the chosen significance level, typically 0.05, the null hypothesis of the data following a Poisson distribution can be rejected.

3. How do I conduct a hypothesis test for the Poisson distribution?

To conduct a hypothesis test for the Poisson distribution, you would first need to calculate the expected values based on the Poisson distribution for each category or interval in your data. Then, you would use a statistical test, such as the chi-square test, to compare the observed values to the expected values and calculate a p-value. If the p-value is less than the chosen significance level, the null hypothesis can be rejected.

4. What are some common reasons for rejecting the Poisson distribution hypothesis?

Some common reasons for rejecting the Poisson distribution hypothesis include the data not following a constant rate of occurrence, the events not being independent of each other, or the sample size being too small to accurately represent the population. Additionally, if the data shows a significant amount of overdispersion or underdispersion, the Poisson distribution may not be an appropriate model.

5. How do I handle rejecting the Poisson distribution hypothesis?

If the hypothesis test indicates that the data does not follow a Poisson distribution, you may need to consider using a different distribution or model that better fits the data. This could involve transforming the data or using a different type of statistical analysis that does not rely on the Poisson distribution. It is also important to carefully consider the assumptions and limitations of any statistical model before drawing conclusions from the data.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
2K
Replies
1
Views
886
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Calculus and Beyond Homework Help
Replies
6
Views
606
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
925
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Calculus and Beyond Homework Help
Replies
21
Views
1K
Back
Top