# When/how to reject Poisson distribution hypothesis?

1. Oct 14, 2015

### DaanV

1. The problem statement, all variables and given/known data
I have run into a situation that my gut tells me is impossible (alright extremely unlikely) when assuming a Poisson distribution. I want to make this gut feeling more formal by testing it against a Poisson distribution. Sadly I'm not a schooled statistician.

Generalised form
I have $N$ objects that I am dividing over $C$ compartments.
$N$ can be of distinguishable types $N_A$ and $N_B$ such that $N_A + N_B = N$
(say green and blue tennis balls in buckets)

Sadly I cannot distinguish the number of objects that are in a bucket, but I can tell of what type they are. I.e. I can see if a bucket contains no tennis balls, only green, only blue or both.

Usually, $C >> N$

If my null hypothesis is that the distribution of tennis balls over the buckets is random, what calculations do I have to perform to reject that hypothesis with e.g. 99% certainty?

Real example
This example should (IMO) quite clearly yield a rejection of the hypothesis, but I have other examples where the distinction may be less obvious. So I'm looking for ways to calculate.

I have 1026059 "buckets". 29755 of these contain only (at least one) $N_A$, 14 contain both (at least one each) $N_A$ and $N_B$. 0 "buckets" contain only $N_B$. The remainder (1026059 - 29755 - 14 - 0 = 996290) contains no objects, is empty.

It seems incredibly unlikely to me that all 14 $N_B$ objects would end up in a bucket also containing $N_A$, despite there being so many empty buckets left. But how do I bring this formally?

2. Relevant equations
$H_0$ = Poisson distribution
$H_1$ = a greater than random probability of B being linked to A

Something along those lines?

3. The attempt at a solution
I want to determine $N_A$ and $N_B$ from the bucket counts, assuming a Poisson distribution. Then from there I want to determine the expected distribution of $N_A$ and $N_B$ over the buckets, again assuming a Poisson distribution. That should show that we expect very few "dual positive" buckets.

Probability of $b$ balls in any bucket, given an average 'Loading' $L$ of balls per bucket ($L = \frac{N}{C}$), is given by:
$P(b;L)=\frac{L^b * e^{-L}}{b!}$

The fraction of empty buckets over total buckets ($\frac{C^-}{C}$) is then given by
$\frac{C^-}{C}=P(0;L)=\frac{L^0 * e^{-L}}{0!} = e^{-L}$ so that we can compute $L=-ln(\frac{C^-}{C})$

Doing so we can calculate $L_A$ and $L_B$ in an equivalent manner. Using the example above, I get to $L_A=0.0289$ and $L_B=1.364*10^{-5}$.

From there I can calculate what distribution of fragments we would expect, given that loading. Doing so yields that we would expect on average 0.4 buckets with both $N_A$ and $N_B$ in them, rather than the 14 that I found.

So.. How do I show that the expected distribution does not fit with the observed distribution?

Thanks in advance for any help provided!

2. Oct 14, 2015

### Stephen Tashi

You've done a fair job of making the problem clear, but some details need clarification.

I suggest we not use the same symbol $N_A$ to denote both a type of thing and also the number of things of that type. Let's say "$A$" denotes a type of thing and "$N_A$" denotes the number of things of that type.

As I understand it, you don't know the numerical values of $N_A$ and $N_B$.

If you are trying to estimate the values of $N_A, N_B$ from the data, this is (naturally) a problem of "statistical estimation". If you are trying to do a statistical test of the hypothesis that the balls are placed independently and at random in the buckets, this is a problem of statistical "hypothesis testing". A method of hypothesis testing might involve a preliminary step that is "estimation".

As I understand the problem, you don't have an "observed distribution". Instead, you have 3 numerical values $(L,L_A, L_B)$ A distribution of those values would be a function of 3 variables $f(x,y,z)$. if you specify a certain volume V, you can ask "What is the probability that the observed 3 values will fall in that volume?". For example, you might specify the volume by $y \le 0.4, z \le 0.4$ Hypothesis testing relies on the subjective procedure of "rejecting" the hypothesis that $f(x,y,z)$ is the distribution from which the data is drawn when there is a small probability of the particular triplet of values observed in the data falling in the specified volume.

Can you write computer simulations? I think this would be the quickest way to investigate your problem. There might be a pencil-and-paper way to get some results, but none pops into my mind at the moment.

3. Oct 14, 2015

### Stephen Tashi

By the way, is this a homework problem? It doesn't sound like a homework problem to me. If it isn't, then I'll suggest the thread be moved to the mathematics section on set theory and probability.

4. Oct 15, 2015

### DaanV

You're absolutely right. Sorry about the confusion.

That's right, I don't know $N_A$ and $N_B$ (the numbers of green and blue balls). I only know the number of buckets that have balls in them, and what colour(s) are in those buckets. I am indeed trying to test whether or not the balls are placed independently and at random.

Please forgive my ignorance on statistical nomenclature. What I mean by "observed distribution" is that I know the number of buckets, and I can literally count (observe) the number of buckets with coloured balls in them. Again, I do not know the distribution of balls over buckets.
I would say I have 4 numerical values: $C_{empty}, C_A, C_B, C_{AB}$, denoting the number of buckets after distribution of the balls.

I'm not sure I follow. Only 2 'values' (A and B) should fall in the same 'volume' (bucket).

Here's my shot:
Let $C$ be the number of buckets and $N$ the number of balls. Let $P(C_A)$ be the probability for any bucket to contain an $A$, $P(C_B)$ to contain $B$ and $P(C_{AB})$ to contain both.

Can I then say that the probability for any one droplet to contain both A and B:
$P(C_{AB}) = P(C_A) * P(C_B) = \frac{N_A}{C} * \frac{N_B}{C}$
?

Then the probability for X buckets to contain both $A$ and $B$, given $P(C_{AB})$, could be computed by:
$P = P(C_{AB})^X * (^C_X)$
Does that sound remotely correct?
Of course problems arise in that I don't yet know $N_A$ and $N_B$. I could estimate those statistically, as you mentioned earlier, under the assumption of independent and random placement.

No.
I'll be following an introductory R course soon-ish.

No it's not homework. But I figured that this would be low-level probability theory, so might as well go here. Apologies if this is the wrong spot.

5. Oct 15, 2015

### Stephen Tashi

I'm referring to the abstract scenario for hypothesis testing. This involves a "statistic" and the determination of whether that "observed value of the statistic" falls inside or outside a "acceptance region".

In the simplest case, a observed value of a statistic is a number - for example, the mean value of a sample of the heights of 10 people is a statistic. A particular observed value of the statistic might be 71.3 inches. The acceptance region can be an interval on the number line -for example "between 69 and 72 inches".

We assume a given population distribution (e.g. a normal distribution with mean 68 inches and standard deviation 6 inches) then we peform a hypothesis test by "rejecting" the null hypothesis if the observed value of the statistic falls outside the acceptance region.

The probabilities ("significance level" etc) that are associated with a hypothesis test are determined by the probabilities that the statistic falls inside or outside the acceptance region.

Note that in the above example, the probability that the observed statistic takes one particular value (e.g 71.3) is small. (In fact, it is theoretically zero.) So to make any use of probability, we have to test using a region that contains an interval of values.

For a given real world problem, one can define many different statistics. A statistic can be a vector and and observed value of such a statistic is a vector of values. One statistic that can be defined in your problem is the vector with the 3 components that are given by

number of buckets with only red balls,
number of buckets with only green balls,
number of buckets with both red and green balls

An "acceptance region" for a test using that statistic would be defined by a set of vectors. Since these are 3 dimensional vectors, I called such a set a "volume". Since we are dealing with vectors of whole numbers, that might not be a good term.

I'll have to think about this problem more in order to evaluate you're work. Right now, I have to go to an appointment.

6. Oct 15, 2015

### Ray Vickson

7. Oct 15, 2015

### Stephen Tashi

When we say "the probability", we have to define the event precisely. You can speak of the probability that a randomly selected selected bucket contains an "A" (meaning "at least one "A", but possibly more).

(However if the number of things of type "A" and the number of buckets is fixed then picking two randomly selected buckets gets into the situation of "random sampling with replacement" rather than "random sampling without replacement".)

Yes, since we've assumed independence.
I'd say no - just because the problem would still makes sense if there were more things of type "A" than there were buckets (i.e. when $N_A > C$) . In that case, it is still possible (by chance) to have some empty buckets.

You may find some useful formulae by looking up probability problems that involve the jargon "occupancy numbers". I'll try that myself, but I won't have time this evening.

-----
I'm going to suggest this thread be moved to a section that handles real world problems instead of homework problems.

Last edited: Oct 15, 2015
8. Oct 16, 2015

### DaanV

This is a problem derived from my research in a cancer institute. Have you ever heard of (Droplet) Digital PCR? In essence, the idea is to dilute a sample containing DNA to limiting dilutions, until you reach a point where you would only expect one DNA molecule of interest in a well (or in the case of droplet digital PCR, in a droplet).

We then perform a PCR reaction, amplifying only a region of interest, in the presence of two distinct hydrolysis probes targeting that region. One probe targets the wildtype variant, the other a specific mutant variant. We then perform 40 cycles of PCR amplification. If there was only a wildtype molecule in the beginning, the well (or droplet) would only emit the wildtype (HEX) wavelength, if there was the specific mutant it would emit mutant (FAM) wavelength. Or both, or none.

Perhaps red and green balls is not a perfect analogy. Perhaps it should be red and green paint, or something like that. Painting something red twice doesn't make a difference to painting it red once, but we can see if it was painted both red and green.

Does that help to make the problem a bit clearer?

Thanks, I'll try performing a search with those terms. I'm fine with the topic being moved elsewhere.

9. Oct 16, 2015

### Stephen Tashi

This morning's thoughts:

Thinking of tennis balls, I'll use "R" and "G" to denote types instead of "A" and "B".

Let $S_R$ be the observed number of buckets that have some red balls in them (i.e. at least one red ball).
Let $S_G$ be the observed number of buckets that have some green balls in them.

Under the assumption that each ball is placed randomly in the buckets, I think that any set of locations for the buckets with some red balls in them is equally likely, likewise for locations with some green balls. So, taking $S_R$ and $S_G$ as "given", and letting $X$ denote the buckets that have both red and green balls, I will consider $X$ to be a random variable. I'll denote the particular value of $X$ that was observed in your experiment by $B$.

Visualize taking random samples of $X$ as follows. Create $S_R$ labels that say "Has some red" and $S_G$ labels that say "Has some green". Place these labels in the buckets "at random" with the condition that there is at most one "Has some red" label in a bucket and at most one "Has some green" label in a bucket. Count the number buckets that have both a "Has some red" and "Has some green" label in it. This count gives the observed value of $X$.

I'm sure the distribution of $X$ can be calculated or simulated. We can rephrase the definition of $X$ in a different balls-in-cells scenario, where the labels now are considered to be "balls". To compute the distribution of $X$, we need to solve (for each possible value of $k$ ) the problem:

$S_R$ red balls and $S_G$ green balls are placed randomly in $N$ cells with the condition that no cell contains more than one red ball and no cell contains more than one green ball. Find the probability that exactly $k$ of the cells contain 2 balls, one ball of each color.

The distribution of $X$ will have a mean value and (hopefully) a single peak. We can base statistical tests on how far the observed value $B$ is from that peak.

-----
Applying statistics to real world problems is subjective and questions involveing probability can be conceptually complicated. One can foresee objections to the above type of test like "You treated part of the observed data as given and part of it as a random variable". There is also the question of who is to be persuaded by the test - just yourself? your boss? the referees of a scientific journal?

10. Oct 16, 2015

### Staff: Mentor

It seems like you are overcomplicating this. To me it seems like a place for a simple chi square test for independence. I mean, it doesn't sound like you really care if it is Poisson or normal or some other distribution. What it sounds like you want to know is if it is independent. That can be tested using a contingency table and a chi square test regardless of the underlying distribution.

11. Oct 16, 2015

### Stephen Tashi

What do you mean by "it"?
Distribution of what random variable?

12. Oct 16, 2015

### Staff: Mentor

The OP seems to want to know whether or not "a bucket contains at least one red ball" is independent of whether or not "the bucket contains at least one green ball".

It doesn't matter how the probability of a bucket containing a red ball is distributed. You just set up a 2x2 contingency table. Columns are red ball (yes or no) and rows are green ball (yes or no). Run the chi square test under the null hypothesis that they are independent, and see if the data provides evidence against that.

13. Oct 16, 2015

### Staff: Mentor

Yes this is the typical data for a chi square test. You want to test for independence, yes?

14. Oct 17, 2015

### Stephen Tashi

He may be happy to investigate that, but is it equivalent to the hypothesis H that the both colors of balls are placed in the buckets independently and at random? It would seem to be implied by H, but does it imply H?

15. Oct 17, 2015

### Staff: Mentor

@DaanV please let us know if it is independence that you wish to test or the distribution.

16. Oct 19, 2015

### DaanV

Thanks for all the replies guys. Only just now checking back after the weekend.
For now, I do think it would be sufficient to show whether or not the events (Red and/or Green) are independent. I will look up chi square test for independence (sorry, as noted I'm not a statistics guy).

Terribly sorry Stephen Tashi, if my description of the situation put you on the wrong track. Thanks a whole bunch for all the time invested here.