- #1

- 137

- 5

## Main Question or Discussion Point

The name in the title is probably not what it's called but it so similar that I chose it anyways.

This is a problem I've been looking into on my spare time and I'm having a difficult time nailing it down. Essentially, it's an "extension" of the negative binomial distribution in the sense that instead of each Bernoulli trial being i.i.d., they're independent but not necessarily identically distributed. In other words, the probability for each independent trial is a sequence of probabilities. When the number of trials are fixed, the number of successes follows a Poisson-Binomial distribution.

However, in the case of a Negative Poisson-Binomial distribution, one would count the number of trials until the [itex]k[/itex]th success occurs. Hence in theory, the support for this distribution are all integers greater than or equal to [itex]k[/itex]. It follows then that the probability sequence must be defined for all positive, non-zero integers or to put it another way, the function that the sequence is drawn from has all positive, non-zero integers (i.e. the index of the trial) as a subset of its domain at least.

However, in practice that function may be a black box or may only be feasibly acquired numerically. So how then would one go around to compute this? If one were to look at the definition, the desired probability is the likelihood of getting [itex]k-1[/itex] successes out of [itex]n-1[/itex] trials followed by a success. If [itex]f_{n}(k)[/itex] is the PMF for a Poisson-Binomial distribution with n trials then the PMF for the Negative Poisson-Binomial distribution is [itex]f_{n-1}(k-1) p_n[/itex] where [itex]p_i[/itex] is the success probability of the [itex]i[/itex]th trial. Keeping [itex]k[/itex] fixed makes this a function of [itex]n[/itex] which is the total number of trials before the [itex]k[/itex]th success occurs. Given values on [itex]p_i \forall i \in \left \{1, 2, 3, ... , n \right \}[/itex], this can be computed in polynomial time using memoized recursion.

Therefore, PMF is defined for [itex]p_i \in [0,1][/itex]. However, with the preconditions about [itex]p_i[/itex], we only know the PMF of a small subset or in other words, it will be truncated. Can we still say anything about the statistical qualities such as mean, variance and median of the full distribution or if there's a way to estimate them from the truncated distribution?

So far, I've estimated them by approximating the distribution with the truncated distribution, which I think should be valid when the cumulative probability of the right tail of the untruncated distribution is sufficiently small. But then again, I suppose this would assume that the probability sequence is well-behaved in the sense that the values of the probability sequence don't jump too much between extremes or has long sequences of extremes. I wonder if all of this is correct and whether or not there is a better approach.

An example would be having an urn that always contain 30 marbles, each being either black or white. You are also told the proportion of black and white marbles. You pick one marble, then empty the urn and fill it with another 30 marbles which may not necessarily have the same proportion of black and white marbles as the previous trial. How many urns would one then need to go through on average until you pick the seventh white marble?

This is a problem I've been looking into on my spare time and I'm having a difficult time nailing it down. Essentially, it's an "extension" of the negative binomial distribution in the sense that instead of each Bernoulli trial being i.i.d., they're independent but not necessarily identically distributed. In other words, the probability for each independent trial is a sequence of probabilities. When the number of trials are fixed, the number of successes follows a Poisson-Binomial distribution.

However, in the case of a Negative Poisson-Binomial distribution, one would count the number of trials until the [itex]k[/itex]th success occurs. Hence in theory, the support for this distribution are all integers greater than or equal to [itex]k[/itex]. It follows then that the probability sequence must be defined for all positive, non-zero integers or to put it another way, the function that the sequence is drawn from has all positive, non-zero integers (i.e. the index of the trial) as a subset of its domain at least.

However, in practice that function may be a black box or may only be feasibly acquired numerically. So how then would one go around to compute this? If one were to look at the definition, the desired probability is the likelihood of getting [itex]k-1[/itex] successes out of [itex]n-1[/itex] trials followed by a success. If [itex]f_{n}(k)[/itex] is the PMF for a Poisson-Binomial distribution with n trials then the PMF for the Negative Poisson-Binomial distribution is [itex]f_{n-1}(k-1) p_n[/itex] where [itex]p_i[/itex] is the success probability of the [itex]i[/itex]th trial. Keeping [itex]k[/itex] fixed makes this a function of [itex]n[/itex] which is the total number of trials before the [itex]k[/itex]th success occurs. Given values on [itex]p_i \forall i \in \left \{1, 2, 3, ... , n \right \}[/itex], this can be computed in polynomial time using memoized recursion.

Therefore, PMF is defined for [itex]p_i \in [0,1][/itex]. However, with the preconditions about [itex]p_i[/itex], we only know the PMF of a small subset or in other words, it will be truncated. Can we still say anything about the statistical qualities such as mean, variance and median of the full distribution or if there's a way to estimate them from the truncated distribution?

So far, I've estimated them by approximating the distribution with the truncated distribution, which I think should be valid when the cumulative probability of the right tail of the untruncated distribution is sufficiently small. But then again, I suppose this would assume that the probability sequence is well-behaved in the sense that the values of the probability sequence don't jump too much between extremes or has long sequences of extremes. I wonder if all of this is correct and whether or not there is a better approach.

An example would be having an urn that always contain 30 marbles, each being either black or white. You are also told the proportion of black and white marbles. You pick one marble, then empty the urn and fill it with another 30 marbles which may not necessarily have the same proportion of black and white marbles as the previous trial. How many urns would one then need to go through on average until you pick the seventh white marble?