Combining independent probabilities in a meaningful way

daviddoria · Jan 27, 2009

I have 1000 experiments on the same data, each of which is trying to decide the probability that the data agrees with a model.

The problem is, even if each experiment REALLY agrees with the model (ie. p(d1) = .99, p(d2) = .99, etc), when I multiply these together to get P(d1 & d2 & d3... & dn), the result is some really really small number (near 0). If I do 2 sets of such experiments, I can tell which set of data had a higher probability of matching the model (simply by seeing which of the two combined numbers was larger), but the actual numbers at that point are meaningless - they are only useful in a "relative to a different one" sense.

Is there a way to combine these to get a "valid" probability without just multiplying them? A reasonable idea seems to just "average" them, ie [tex]\sum p(dn) / n[/tex], but everyone tells me this is a really terrible idea and it is not a real probability any more.

Thoughts?

Thanks,

Dave

quadraphonics · Jan 27, 2009

Use the log likelihood instead.

daviddoria · Jan 27, 2009

But if you sum log likelihoods instead of multiplying regular likelihoods, you still get the same problem of only having tiny numbers that don't mean anything unless you compare them to other of the tiny number (they're just larger than if you didn't take logs). Know what I mean?

David

quadraphonics · Jan 27, 2009

daviddoria said:

But if you sum log likelihoods instead of multiplying regular likelihoods, you still get the same problem of only having tiny numbers that don't mean anything unless you compare them to other of the tiny number (they're just larger than if you didn't take logs). Know what I mean?

I'm afraid not. Why do you care about what the numbers look like in an absolute sense? You just want the model that gives you the highest likelihood, so it's not clear to me what meaning the likelihoods could have absent some comparison to one another.

If you just want to get nicely-scaled numbers, in addition to using the log-likelihood you should divide it by the number of data points (i.e., the average log likelihood).

daviddoria · Jan 28, 2009

Here is a separate problem:

As the number of experiments increases, the likelihood can only go down. It seems like it should go the other way... ie. if you have 5 experiments that each "agree" with the model with .8 likelihood, intuitively you are less sure about the agreement than if you have 1000 experiments that each agree with .99 likelihood, but the sum of the log-likelihoods shows the inverse is true... see what I mean?

Is there a way to overcome this?

Dave

quadraphonics · Jan 28, 2009

daviddoria said:

Here is a separate problem:

As the number of experiments increases, the likelihood can only go down. It seems like it should go the other way... ie. if you have 5 experiments that each "agree" with the model with .8 likelihood, intuitively you are less sure about the agreement than if you have 1000 experiments that each agree with .99 likelihood, but the sum of the log-likelihoods shows the inverse is true... see what I mean?

Is there a way to overcome this?

Yes, use the average log likelihood. In the first case (5 data points with 0.8 likelihood), the average log likelihood is -0.2231. In the second case (1000 data points with 0.99 likelihood), the average log likelihood is -0.0101. Both results are insensitive to the number of data points, and so can be meaningfully compared.

Also, after thinking about it, there are useful interpretations of the absolute value of the average log likelihood. Specifically, under certain assumptions of stationarity/ergodicity on the data, the negative average log likelihood will converge to the cross-entropy between the model distribution and the true distribution. The cross-entropy has a number of interesting properties and interpretations, one of which is that it is lower-bounded by the entropy of the true distribution. You can think of the cross entropy as telling you the average number of bits you'd need to send to specify a given piece of data, under the estimated model. This, in turn, is equal to the true entropy plus the KL divergence between the true distribution and the estimated one, so we see that maximum likelihood estimation is equivalent to minimization of the KL divergence between the model distribution and the true distribution.

winterfors · Jan 30, 2009

quadraphonics is right, multiplying all likelihood functions p(d1|m)p(d2|m)... (or taking the sum of their logarithms) is the right approach.

This product will decrease if you increase the number of experiments (at least if you are dealing with discrete variables). This is because it is a probability in data space (the space spanned by of all d1, d2, d3 etc.), where the probability distribution will be ever more "spread out" as the size of the space increases.

If you are interested in the probability of your model parameter m, to get the actual probability you need to add a normalization constant: K*p(d1|m)p(d2|m)... so that if you sum up the probability of all possible values of m, the result will be equal to 1.

However, if you just want to find the m with the highest probability rather than the actual value of the probability itself, you can ignore calculating the constant and just look for the m that maximizes the product p(d1|m)p(d2|m)...

-Emanuel

daviddoria · Jan 30, 2009

Yea, the problem with just maximizing the sum of log likelihoods is that then if I run a second experiment, the two resulting numbers are not comparable. I guess that's where finding that normalization constant comes in.. I can compare the normalized values. That could be harder than trivial to find though!

Thanks for the discussion, all.

Dave

quadraphonics · Jan 30, 2009

daviddoria said:

Yea, the problem with just maximizing the sum of log likelihoods is that then if I run a second experiment, the two resulting numbers are not comparable. I guess that's where finding that normalization constant comes in.. I can compare the normalized values. That could be harder than trivial to find though!

That's only a problem if the experiments have different numbers of data points. But it's trivial to compensate for this: use the *average* log likelihood (i.e., normalized by the number of data points). This will not be sensitive to the number of data points (to first order; if the number of data points ever becomes small relative to the degrees of freedom of the model, you will then get over-fitting effects).

Combining independent probabilities in a meaningful way

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Graduate Probability puzzle

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect