Asymmetry between probability distributions

Click For Summary

Discussion Overview

The discussion revolves around the asymmetry observed in Kullback-Leibler divergence (KLD) between prior and posterior probability distributions defined on a finite event space. Participants explore the implications of this asymmetry, its geometric representation, and the conditions under which it manifests, without reaching a consensus on the underlying reasons or implications.

Discussion Character

  • Exploratory, Technical explanation, Debate/contested, Mathematical reasoning

Main Points Raised

  • One participant notes that KLD is not symmetric, leading to different values when switching the roles of prior and posterior probabilities.
  • Another participant proposes a definition of ">" based on KLD values and questions its transitivity, suggesting that if P>Q and Q>R, then P>R should hold.
  • A participant highlights that KLD indicates how quickly one learns that the true distribution is not the assumed one, emphasizing the asymmetry in this learning process.
  • Continuity between extreme and non-extreme probabilities is mentioned as a justification for the observed asymmetry, though some participants argue against this view.
  • Counter-examples are presented to challenge the idea of transitivity in KLD relationships, with specific probability distributions demonstrating this violation.
  • Participants discuss the implications of these findings on the understanding of KLD as a measure of dissimilarity between probability distributions.

Areas of Agreement / Disagreement

Participants express differing views on the implications of KLD asymmetry and its transitivity, with no consensus reached on the validity of certain geometric models or the conditions under which the asymmetry holds.

Contextual Notes

Some participants acknowledge the complexity of proving certain relationships in KLD, particularly in higher dimensions, and note that the triangle inequality may not hold in the same way as with symmetric measures.

Who May Find This Useful

Readers interested in probability theory, information theory, and the mathematical properties of divergence measures may find this discussion relevant.

noowutah
Messages
56
Reaction score
3
I have made an interesting observation that I can't explain to myself. Think about a prior probability P and a posterior probability Q. They are defined on an event space W with only three elements: w1, w2, and w3 (the number of elements won't matter as long as it's finite). The Kullback-Leibler divergence measures how far these probability distributions are apart, i.e. how much information it takes to get from P to Q. If P(w1)=p1 etc. then

KLD(Q,P)=q1*log(q1/p1)+q2*log(q2/p2)+q3*log(q3/p3)

The KLD is not symmetric, so if P and Q switch roles (Q is now the prior and P the posterior), the divergence will be different. If you think of P and Q as points on a simplex (all points in R3 with r1+r2+r3=1 with rj>0; the simplex in R3 looks like an equilateral triangle), the KLD does NOT define a metric topology on this simplex, because KLD(Q,P) is in general not equal to KLD(P,Q).

My original intuition was that the way this asymmetry works is that as you go from the centre to the periphery of the equilateral triangle (i.e. the entropy of the probability distribution decreases), less information is necessary compared to the other way around going from the periphery to the centre, so

H(P)>H(Q) implies that KLD(Q,P)<KLD(P,Q)

Note that the prior is the second argument for KLD -- that's a bit counterintuitive. H is here the Shannon entropy H(P)=-p1log(p1)-p2log(p2)-p3log(p3).

In any case, my intuition is wrong. Let P (the prior) be fixed. Then you can partition the simplex into those points Q for which KLD(Q,P)>KLD(P,Q) (colour them red) and those for which KLD(Q,P)<KLD(P,Q) (colour them blue). The partitions are pretty and far from trivial. How could you defend this in terms of intuitions about probability distributions? Is there any way to explain, without recourse to information theory, why going from P to Q1 is harder than going from Q1 to P; while it is easier going from P to Q2 than going from Q2 to P? Q1 is an arbitrary red point, while Q2 is an arbitrary blue point.

Here is the partition for P=(1/3,1/3,1/3):

http://streetgreek.com/lpublic/various/asym-eq.png

And here for P=(0.4,0.4,0.2):

http://streetgreek.com/lpublic/various/asym422.png

And here for P=(0.242,0.604,0.154):

http://streetgreek.com/lpublic/various/asym262.png

And here for P=(0.741,0.087,0.172):

http://streetgreek.com/lpublic/various/asym712.png
 
Last edited by a moderator:
  • Like
Likes   Reactions: mfb
Physics news on Phys.org
An interesting problem!

Let's define ">" on points on the plane as P>Q iff KLD(Q,P)<KLD(P,Q), i. e. going from Q to P needs more information than the opposite.
Is this transitive? If P>Q and Q>R, is P>R?
If yes, there should be a "smallest" point, one where the whole plane is colored blue.

Your points don't seem to suggest this. If the relation is not transitive, it has a weird consequence: you can find a set P, Q, R where going P->Q->R->P needs more information than going P->R->Q->P.

For an event space with just two events, the solution should be:
P>Q for (p<q and p+q<1) or (p>q and p+q>1) where p=p1 and q=q1.
In other words, P>Q if |p-1/2| < |q-1/2|.
Going closer to the middle needs more information than going outwards. This is transitive.
(I'm sure there is at least one sign error in it)

Example:
p -> q:
0.3 -> 0.1: KLD(Q,P)=0.0505
0.1 -> 0.3: KLD(Q,P)=0.0667
 
  • Like
Likes   Reactions: noowutah
Notice that in the definition of the KLD there is an expectation value taken with respect to Q: KLD(Q,P)=q1*log(q1/p1)+q2*log(q2/p2)+q3*log(q3/p3). And of what? the logarithm of ratios of probabilies according to P and to Q.

KLD tells us how fast you learn, when the true distribution is Q, that it isn't P. And this quantity is asymmetric. Which is pretty obvious (the asymmetry) when you think about some examples in which one of the p's or one of the q's is zero.
 
  • Like
Likes   Reactions: noowutah
gill1109 -- yes, you are absolutely right. This is precisely what I am trying to show: that the asymmetry is also justified when one of the p's or q's is NOT zero. This is not as obvious as it appears. Continuity between the extreme probability case and the non-extreme probability case is, of course, one argument for asymmetry. But some people have invested a lot of time into a geometric model of non-extreme probabilities where all the distances are symmetric. I am trying to show that they are wrong. There is a sense in which my argument isn't doing so well -- the asymmetries, as the diagrams show, are all over the place and intuitively unpredictable.

mfb -- excellent point. I've been trying since I saw your post to prove that in the two-dimensional case H(p)>H(q) implies the kind of asymmetry you suggest. It's turning out to be a more difficult proof than I envisioned but you must be right about this. Be that as it may, it's not true for the three-dimensional case; there are lots of counter-examples, as my diagrams show. Transitivity should not be at issue here, but more so the triangle inequality: it should definitely be harder to get from P->Q->R than to go from P->R directly, but that's true for both symmetric measures and the KLD.

I will keep working on this. If anybody has ideas please let me know.
 
Last edited by a moderator:
How does your plot look like for P=(0.5, 0.25, 0.25)? That seems to be the center of one of the three lobes in the first plot.
 
http://www.streetgreek.com/lpublic/various/asym533.png
P=(0.5,0.25,0.25)
 
Last edited by a moderator:
Interesting. No transitivity then.

P = (1/3, 1/3, 1/3)
Q = (1/2, 1/4, 1/4)
R = (0.4,0.4,0.2)

KLD(Q,P)>KLD(P,Q)
KLD(R,Q)>KLD(Q,R)
KLD(P,R)>KLD(R,Q)
KLD(Q,P)+KLD(R,Q)+KLD(P,R) > KLD(P,Q)+KLD(Q,R)+KLD(R,Q)
 
  • Like
Likes   Reactions: noowutah
Fascinating, mfb! That should be another problem for the Kullback-Leibler divergence as a measure of dissimilarity between probability distributions. Violation of this kind of transitivity is even harder to square with epistemic intuitions we have about probabilities and updating them than the non-trivial asymmetry patterns that I pointed out.
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 10 ·
Replies
10
Views
9K
  • · Replies 2 ·
Replies
2
Views
3K
Replies
5
Views
4K
  • · Replies 7 ·
Replies
7
Views
7K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 15 ·
Replies
15
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
3
Views
2K
  • · Replies 1 ·
Replies
1
Views
1K