Combining independent probabilities in a meaningful way

In summary, multiplying all likelihood functions p(d1|m)p(d2|m)... (or taking the sum of their logarithms) is the right approach to computing the probability of a model parameter across multiple data sets. However, this product will decrease as the number of experiments increases, as it is a probability in data space (the space spanned by of all d1, d2, d3 etc.), and the probability distribution will be ever more "spread out" as the size of the space increases.
  • #1
daviddoria
97
0
I have 1000 experiments on the same data, each of which is trying to decide the probability that the data agrees with a model.

The problem is, even if each experiment REALLY agrees with the model (ie. p(d1) = .99, p(d2) = .99, etc), when I multiply these together to get P(d1 & d2 & d3... & dn), the result is some really really small number (near 0). If I do 2 sets of such experiments, I can tell which set of data had a higher probability of matching the model (simply by seeing which of the two combined numbers was larger), but the actual numbers at that point are meaningless - they are only useful in a "relative to a different one" sense.

Is there a way to combine these to get a "valid" probability without just multiplying them? A reasonable idea seems to just "average" them, ie [tex]\sum p(dn) / n[/tex], but everyone tells me this is a really terrible idea and it is not a real probability any more.

Thoughts?

Thanks,

Dave
 
Physics news on Phys.org
  • #2
Use the log likelihood instead.
 
  • #3
But if you sum log likelihoods instead of multiplying regular likelihoods, you still get the same problem of only having tiny numbers that don't mean anything unless you compare them to other of the tiny number (they're just larger than if you didn't take logs). Know what I mean?

David
 
  • #4
daviddoria said:
But if you sum log likelihoods instead of multiplying regular likelihoods, you still get the same problem of only having tiny numbers that don't mean anything unless you compare them to other of the tiny number (they're just larger than if you didn't take logs). Know what I mean?

I'm afraid not. Why do you care about what the numbers look like in an absolute sense? You just want the model that gives you the highest likelihood, so it's not clear to me what meaning the likelihoods could have absent some comparison to one another.

If you just want to get nicely-scaled numbers, in addition to using the log-likelihood you should divide it by the number of data points (i.e., the average log likelihood).
 
  • #5
Here is a separate problem:

As the number of experiments increases, the likelihood can only go down. It seems like it should go the other way... ie. if you have 5 experiments that each "agree" with the model with .8 likelihood, intuitively you are less sure about the agreement than if you have 1000 experiments that each agree with .99 likelihood, but the sum of the log-likelihoods shows the inverse is true... see what I mean?

Is there a way to overcome this?

Dave
 
  • #6
daviddoria said:
Here is a separate problem:

As the number of experiments increases, the likelihood can only go down. It seems like it should go the other way... ie. if you have 5 experiments that each "agree" with the model with .8 likelihood, intuitively you are less sure about the agreement than if you have 1000 experiments that each agree with .99 likelihood, but the sum of the log-likelihoods shows the inverse is true... see what I mean?

Is there a way to overcome this?

Yes, use the average log likelihood. In the first case (5 data points with 0.8 likelihood), the average log likelihood is -0.2231. In the second case (1000 data points with 0.99 likelihood), the average log likelihood is -0.0101. Both results are insensitive to the number of data points, and so can be meaningfully compared.

Also, after thinking about it, there are useful interpretations of the absolute value of the average log likelihood. Specifically, under certain assumptions of stationarity/ergodicity on the data, the negative average log likelihood will converge to the cross-entropy between the model distribution and the true distribution. The cross-entropy has a number of interesting properties and interpretations, one of which is that it is lower-bounded by the entropy of the true distribution. You can think of the cross entropy as telling you the average number of bits you'd need to send to specify a given piece of data, under the estimated model. This, in turn, is equal to the true entropy plus the KL divergence between the true distribution and the estimated one, so we see that maximum likelihood estimation is equivalent to minimization of the KL divergence between the model distribution and the true distribution.
 
  • #7
quadraphonics is right, multiplying all likelihood functions p(d1|m)p(d2|m)... (or taking the sum of their logarithms) is the right approach.

This product will decrease if you increase the number of experiments (at least if you are dealing with discrete variables). This is because it is a probability in data space (the space spanned by of all d1, d2, d3 etc.), where the probability distribution will be ever more "spread out" as the size of the space increases.

If you are interested in the probability of your model parameter m, to get the actual probability you need to add a normalization constant: K*p(d1|m)p(d2|m)... so that if you sum up the probability of all possible values of m, the result will be equal to 1.

However, if you just want to find the m with the highest probability rather than the actual value of the probability itself, you can ignore calculating the constant and just look for the m that maximizes the product p(d1|m)p(d2|m)...

-Emanuel
 
  • #8
Yea, the problem with just maximizing the sum of log likelihoods is that then if I run a second experiment, the two resulting numbers are not comparable. I guess that's where finding that normalization constant comes in.. I can compare the normalized values. That could be harder than trivial to find though!

Thanks for the discussion, all.

Dave
 
  • #9
daviddoria said:
Yea, the problem with just maximizing the sum of log likelihoods is that then if I run a second experiment, the two resulting numbers are not comparable. I guess that's where finding that normalization constant comes in.. I can compare the normalized values. That could be harder than trivial to find though!

That's only a problem if the experiments have different numbers of data points. But it's trivial to compensate for this: use the *average* log likelihood (i.e., normalized by the number of data points). This will not be sensitive to the number of data points (to first order; if the number of data points ever becomes small relative to the degrees of freedom of the model, you will then get over-fitting effects).
 

What is meant by "combining independent probabilities in a meaningful way"?

Combining independent probabilities in a meaningful way refers to the process of calculating the likelihood of two or more independent events occurring together. It involves using mathematical methods to determine the probability of a compound event made up of two or more independent events.

Why is it important to combine independent probabilities in a meaningful way?

Combining independent probabilities in a meaningful way is important because it allows us to make more accurate predictions and decisions based on the likelihood of multiple events occurring together. It is especially useful in fields such as statistics, finance, and risk management.

What are some common methods for combining independent probabilities?

Some common methods for combining independent probabilities include using the addition rule, multiplication rule, and the binomial distribution. These methods take into account the individual probabilities of each event and combine them to determine the probability of the compound event.

How does combining independent probabilities differ from combining dependent probabilities?

Combining independent probabilities differs from combining dependent probabilities in that independent events are not affected by each other, while dependent events are. This means that the probability of independent events occurring together is calculated differently than the probability of dependent events occurring together.

What are some real-world applications of combining independent probabilities?

Combining independent probabilities has many real-world applications, including predicting stock market trends, estimating the likelihood of success in medical treatments, and determining the probability of winning in gambling games. It is also used in risk management to assess the likelihood of multiple events occurring together and to make informed decisions based on that information.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
324
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
15
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
1K
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
3K
Back
Top