# Is it possible to calculate a binomial distribution with a non-constant p?

1. Aug 25, 2010

### jonnybgood66

Here's the actual problem I'm faced with. Suppose a segment of dna with 100 mutations (SNPs) which occur at different frequencies from each other and between 2 different populations for the same mutation. The expected number of mutations occuring in the segment of dna is different in either population, and using this difference I can predict from which population the segment originates. I need to determine the intersection/low point between the 2 frequency curves at which I can say, to the left of this point the segment is assigned to Pop 1, to the right of this point it's assigned to Pop 2. I've managed to do this by generating thousands of simulated curves with RND. But this causes the run time to increase by half an hour, which is unacceptable. That's when I started looking into trying to calculate this curve. From what I've read, the binomial distribution is clearly what I want, except for one thing, it assumes p is constant. In my problem, every p is different. Is this possible to calculate? [I have a feeling it isn't]

PS: I've glanced over the Beta-binomial distribution, but it seems to involve a random p. In my example, I have a known value for p for each of the k trials/events, and I need to use those exact p values.

2. Aug 25, 2010

### winterfors

It should be possible to calculate the probability of the DNA segment coming from population A versus population B given the number of mutations, provided that you know the mutation rate for each of the mutation sites (i.e. each "p") and how these are different for populations A and B. Did I get you right that you have this information?

3. Aug 25, 2010

### jonnybgood66

I did a poor job of explaining myself. Suppose 100 events/trials, each with a different (and known) frequency of success (p). I want to calculate the probabilities of the 100 trials resulting in 0 total succesful trials, 1 total successful trial, 2, 3, ... 99, and 100 total successful trials.

4. Aug 25, 2010

### CRGreathouse

I can't think of any good method for large numbers of successes. For 0, of course the probability is just
$$P+0=\prod_{n=1}^{100}(1 - p_n)$$
For 1, it is
$$P_1=P_0\sum_{n=1}^{100}\frac{p_n}{1-p_n}$$
For 2 it is
$$P_2=P_0\sum_{n=1}^{99}\sum_{m=n+1}^{100}\frac{p_np_m}{(1-p_n)(1-p_m)}$$

But scaling these directly would take too long -- binomially many terms, ~10^29 in the worst case of 50. I can see (what I think is) a better way to calculate the case 2, but I'm not sure if it can be generalized:
$$P_2=\frac12P_1\sum_{n=1}^{100}\frac{p_n}{1-p_n}-\frac12P_0\sum_{n=1}^{100}p_n^2$$

5. Aug 26, 2010

### winterfors

CRGreathouse is right. However, there is a generalization of his second approach:

Using the notation above, you can calculate the probability $P_k$ of having $k$ mutations out of $N$ using a recursive formula:

$${P_k} = \left\{ \begin{array}{l} \prod\limits_{n = 1}^N {(1 - {p_n}), \qquad k = 0} \\ \frac{1}{k}\sum\limits_{n = 1}^k {{{( - 1)}^{n - 1}}{P_{k - n}}{R_n}} , \qquad k > 0 \\ \end{array} \right.$$

where

$${R_n} = \sum\limits_{j = 1}^N {{{\left( {\frac{{{p_j}}}{{1 - {p_j}}}} \right)}^n}}$$

You first calculate $P_0$, then insert in the formula for $P_1$, then insert
$P_0$ and $P_1$ into the formula for $P_2$, etc. etc.

6. Aug 26, 2010

### winterfors

Considering your general problem of determining which population the DNA sample comes from, I really think you should use the information above to calculate the probability of the sample coming from each of the populations, rather than just using a threshold number of mutations to distinguish one form the other.

You could of course get a better estimate of what population the sample comes from by considering each mutation separately, rather than just the total number of mutations (which would also let you avoid this slightly complicated "generalized binomial" distribution of the cumulative number of mutations)

Good luck with your problem in any case.

-Emanuel

7. Aug 26, 2010

### CRGreathouse

winterfors: That looks really nice. Unfortunately I can't seem to get it to work -- if your formula is right then my code is wrong:

Code (Text):
generalizedBinomial(P:vec,n:int)={
my(N=#P,pr=vector(n+1));
pr[1]=prod(i=1,N,1.-P[i]);
for(k=1,n,
pr[k+1]=sum(i=1,k,(-1)^(i-1)*pr[k-i+1]*sum(j=1,N,(P[j]/(1-P[j]))^k))/k
);
pr[n+1]
};
addhelp(generalizedBinomial, "generalizedBinomial(P,n): Given a vector P of probabilities, gives the probability that exactly n of the independent events occur.");

8. Aug 26, 2010

### jonnybgood66

Thank you very much! I'm going to study this very carefully.

9. Aug 26, 2010

### winterfors

I'm not completely familiar with the script language you are coding in, but it seems to me you are taking (P[j]/(1-P[j])) to the power of "k" instead of the outer summation index "i".

Last edited: Aug 26, 2010
10. Aug 26, 2010

### jonnybgood66

I presume you meant to write

$$P_0=$$

right?

11. Aug 26, 2010

### jonnybgood66

I'm using your formulas on a scaled down version with only 10 events (n=10). The formula for P2 would match the actual real results if the final part were equal to 1, instead of P²:

$$P_2=\frac12P_1\sum_{n=1}^{100}\frac{p_n}{1-p_n}-\frac12P_0\sum_{n=1}^{100}1$$

Could there be an error in that part? Also, could you show me how your example formula would look like for $$P_3$$? [EDIT: it's supposed to say P3 in the previous sentence, I don't know why my edit isn't changing that mistake. Obviously, I don't need the P2 formula.] I'm trying to grasp how the subsequent formulas for P=4, P=5, etc. would be. This is difficult for me, but I'm managing to straddle along. :) I still have to get to winterfors' post.

Last edited: Aug 26, 2010
12. Aug 26, 2010

### winterfors

There is a slight error in this formula, it should read

$$P_2=\frac12P_1\sum_{n=1}^{100}\frac{p_n}{1-p_n}-\frac12P_0\sum_{n=1}^{100}\frac{p_n^2}{(1-p_n)^2}$$

For $P_n$ of higher $n$ the expressions get increasingly complex if you do not write it as a function of lower order $P_n$ (as in the recursion formula above)

$$P_3 = \frac{1}{6}{P_0}\left( {{{\left( {\sum\limits_{j = 1}^{100}{\frac{{{p_j}}}{{1 - {p_j}}}} } \right)}^3} - 3\left( {\sum\limits_{j = 1}^{100}{\frac{{{p_j}}}{{1 - {p_j}}}} } \right)\sum\limits_{j = 1}^{100}{{{\left( {\frac{{{p_j}}}{{1 - {p_j}}}} \right)}^2}} + 2\sum\limits_{j = 1}^{100}{{{\left( {\frac{{{p_j}}}{{1 - {p_j}}}} \right)}^3}} } \right)$$

About your problem with Physics forum not updating equations after you have changed them in the editor, this is a "bug" in how the editor works. The equation is changed, but you need to reload the whole page to make the change appear on your screen (due to the web browser cashing the previous bitmap image representing the equation).

13. Aug 26, 2010

### jonnybgood66

Thanks so much, winterfors. And of course, CMGreathouse. And yeah, you're right, I'm not going to write out a monster formula for P25, P26, etc. I was trying to understand the mechanics of the formula, that's why I wanted to see what it looked like for P3. But now that you've confirmed there was a slight error in CRGreathouse's initial formula, I don't think I'll be needing that anymore! I hope. I'm really grateful to both of you. I honestly did not expect this problem was mathematically possible to resolve in a simplified manner.

By the way, presuming everything works out just fine, it looks to me like this type of a statistical problem isn't really all that exotic. I'm kind of surprised there wouldn't be some textbook, known formula that deals with this. After all, this is simply a discrete probability, with the only twist that P is different in each event. Before I wrote in this forum, I must have looked at two dozen known formulas, and none of them even considered the possibility that P would be different in the separate trials/events, it was always fixed. There was only a case involving a random P, but that's not good at all, and in a certain way, it's kind of like a fixed P, just with greater variance. Any thoughts on this? That was why I was so pessimistic when I wrote the initial post here, I was thinking that if none of the well known formulas even considered the possibility of P changing between events, it must be because that results in an impossibility to estimate it with a formula.

14. Aug 26, 2010

### jonnybgood66

So far, I've tested winterfors' iteration formula using a simple model with 10 events and P always equal to 0.5. The iteration formula works perfectly, the results are almost identical to real results generated from thousands of randomized runs.

The correction winterfor made to CRGreathouse's formula actually doesn't change anything in the iteration formula he posted afterwards. So I still don't understand the problem CRGreathouse had with the code. Hope it's a coding problem, and not with the formula itself.

This is looking great.

15. Aug 27, 2010

### jonnybgood66

winterfors' recursive formula, in post #5, is the final word on the matter. Thanks. I've now tested this formula for its intended purpose, with the value of p being different in the N events, and it matches the real results generated with random numbers perfectly. There's only a slight technical detail.

Because of the recursive nature of the formula, the probabilities at the tail end of the series (when k is very close or equal to N) will produce distorted results that can indeed be a little too distorted. There's a simple trick to solve this. All the probabilities, p(1) to p(N), should be flipped over by performing p(1) = 1-p(1) to p(N) = 1-p(N). This produces the same results but p(N) is now calculated first (accurately) while p(1) is now calculated last (distorted). This is only important if the tail end probabilities are required. Calculate the average of all the probabilities p(1) to p(N). If the average is less than 0,5, then P(0) to P(N/2) should be calculated in the standard manner, while P(N/2) to P(N) should be calculated after "flipping" all the probabilities. Note that after flipping the probabilities, the result for P(N) would actually be found in P(0), and likewise P(N-1) would be found in P(1), P(N-2) would be found in P(2), etc. If the average is greater than 0,5 then the opposite applies (first half should be calculated after flipping the probabilities, second half should be calculated in the standard manner).

16. Aug 29, 2010

### bpet

Another way to do the calculation is with generating functions, where P_k is the coefficient of x^k in some polynomial. Here

$$\prod_{n=1}^{100} (p_nx+(1-p_n)) = \sum_{k=0}^{\infty}P_k x^k$$

and the product can be calculated by repeated convolution. Seems to give accurate results even with $$p_n=10^{-6}$$.

17. Aug 31, 2010

### winterfors

Last edited: Aug 31, 2010
18. Sep 9, 2010

### jonnybgood66

This post is related to my original issue, but it's about something else. It's about the math limitations of Visual Basic. In the following image, you can see the results I generated for the recursive probabilities formula posted by winterfors (post #5). At k=0, the probability is just 2.079 x 10 ^ (-107), and it gets progressively bigger at k=1, k=2, etc. But in the final result shown, at k=35, the computer "flips out" and gives a ridiculous result which clearly doesn't follow the pattern.

[PLAIN]http://img266.imageshack.us/img266/3676/visualbasicvariabledoub.gif [Broken]

I've noticed, in other cases, that this "flipping out" phenomenon usually occurs between k=20 and k=30. My belief is that with every additional recursive iteration of the formula, the number of digits stored in the variable that houses the result of k grows larger and larger, until it overflows the large capacity of the DOUBLE data type and that's when the new result comes out plain stupid.

I'm trying to find a way to deal with this. What are your thoughts?:

Option 1. I'm thinking if I could always ROUND the variable to an x number of digits that would be just under the maximum storage capacity of the DOUBLE data type, this way there's never an overflow creating a super crazy result, instead there's a continous loss of precision in the most insignificant digits of the result, and even with compounding over iterations, it can't significantly deviate from the correct result. I'm getting the impression that VB randomly determines which side of the data type gets "the axe", when I could have told him to always chop off on the decimal side of the number whenever the digits overflow its capacity to store them.

Option 2. I've read about a mathlab library that greatly expands the math capabilities of VB. Anyone know about this? Does it have new data types? I'd love a QUADRUPLE data type, even an OCTUPLE would help.

Last edited by a moderator: May 4, 2017
19. Sep 9, 2010

### bpet

It's called catastrophic cancellation and happens because the formula in post #5 is an alternating series; a similar thing can occur if you evaluate exp(-x) by taylor series.

The polynomial multiplication method in post #16 doesn't have this problem and can be implemented easily in Excel (see attached snapshot).

HTH

#### Attached Files:

• ###### Untitled.png
File size:
4.9 KB
Views:
130
20. Sep 9, 2010

### winterfors

I think Visual Basic has a 128-bit noninteger data type called "Decimal", have you tried using that?