# Asymmetry between probability distributions

Tags:
1. Sep 4, 2015

### stlukits

I have made an interesting observation that I can't explain to myself. Think about a prior probability P and a posterior probability Q. They are defined on an event space W with only three elements: w1, w2, and w3 (the number of elements won't matter as long as it's finite). The Kullback-Leibler divergence measures how far these probability distributions are apart, i.e. how much information it takes to get from P to Q. If P(w1)=p1 etc. then

KLD(Q,P)=q1*log(q1/p1)+q2*log(q2/p2)+q3*log(q3/p3)

The KLD is not symmetric, so if P and Q switch roles (Q is now the prior and P the posterior), the divergence will be different. If you think of P and Q as points on a simplex (all points in R3 with r1+r2+r3=1 with rj>0; the simplex in R3 looks like an equilateral triangle), the KLD does NOT define a metric topology on this simplex, because KLD(Q,P) is in general not equal to KLD(P,Q).

My original intuition was that the way this asymmetry works is that as you go from the centre to the periphery of the equilateral triangle (i.e. the entropy of the probability distribution decreases), less information is necessary compared to the other way around going from the periphery to the centre, so

H(P)>H(Q) implies that KLD(Q,P)<KLD(P,Q)

Note that the prior is the second argument for KLD -- that's a bit counterintuitive. H is here the Shannon entropy H(P)=-p1log(p1)-p2log(p2)-p3log(p3).

In any case, my intuition is wrong. Let P (the prior) be fixed. Then you can partition the simplex into those points Q for which KLD(Q,P)>KLD(P,Q) (colour them red) and those for which KLD(Q,P)<KLD(P,Q) (colour them blue). The partitions are pretty and far from trivial. How could you defend this in terms of intuitions about probability distributions? Is there any way to explain, without recourse to information theory, why going from P to Q1 is harder than going from Q1 to P; while it is easier going from P to Q2 than going from Q2 to P? Q1 is an arbitrary red point, while Q2 is an arbitrary blue point.

Here is the partition for P=(1/3,1/3,1/3):

http://streetgreek.com/lpublic/various/asym-eq.png [Broken]

And here for P=(0.4,0.4,0.2):

http://streetgreek.com/lpublic/various/asym422.png [Broken]

And here for P=(0.242,0.604,0.154):

http://streetgreek.com/lpublic/various/asym262.png [Broken]

And here for P=(0.741,0.087,0.172):

http://streetgreek.com/lpublic/various/asym712.png [Broken]

Last edited by a moderator: May 7, 2017
2. Sep 4, 2015

### Staff: Mentor

An interesting problem!

Let's define ">" on points on the plane as P>Q iff KLD(Q,P)<KLD(P,Q), i. e. going from Q to P needs more information than the opposite.
Is this transitive? If P>Q and Q>R, is P>R?
If yes, there should be a "smallest" point, one where the whole plane is colored blue.

Your points don't seem to suggest this. If the relation is not transitive, it has a weird consequence: you can find a set P, Q, R where going P->Q->R->P needs more information than going P->R->Q->P.

For an event space with just two events, the solution should be:
P>Q for (p<q and p+q<1) or (p>q and p+q>1) where p=p1 and q=q1.
In other words, P>Q if |p-1/2| < |q-1/2|.
Going closer to the middle needs more information than going outwards. This is transitive.
(I'm sure there is at least one sign error in it)

Example:
p -> q:
0.3 -> 0.1: KLD(Q,P)=0.0505
0.1 -> 0.3: KLD(Q,P)=0.0667

3. Sep 5, 2015

### gill1109

Notice that in the definition of the KLD there is an expectation value taken with respect to Q: KLD(Q,P)=q1*log(q1/p1)+q2*log(q2/p2)+q3*log(q3/p3). And of what? the logarithm of ratios of probabilies according to P and to Q.

KLD tells us how fast you learn, when the true distribution is Q, that it isn't P. And this quantity is asymmetric. Which is pretty obvious (the asymmetry) when you think about some examples in which one of the p's or one of the q's is zero.

4. Sep 6, 2015

### stlukits

gill1109 -- yes, you are absolutely right. This is precisely what I am trying to show: that the asymmetry is also justified when one of the p's or q's is NOT zero. This is not as obvious as it appears. Continuity between the extreme probability case and the non-extreme probability case is, of course, one argument for asymmetry. But some people have invested a lot of time into a geometric model of non-extreme probabilities where all the distances are symmetric. I am trying to show that they are wrong. There is a sense in which my argument isn't doing so well -- the asymmetries, as the diagrams show, are all over the place and intuitively unpredictable.

mfb -- excellent point. I've been trying since I saw your post to prove that in the two-dimensional case H(p)>H(q) implies the kind of asymmetry you suggest. It's turning out to be a more difficult proof than I envisioned but you must be right about this. Be that as it may, it's not true for the three-dimensional case; there are lots of counter-examples, as my diagrams show. Transitivity should not be at issue here, but more so the triangle inequality: it should definitely be harder to get from P->Q->R than to go from P->R directly, but that's true for both symmetric measures and the KLD.

I will keep working on this. If anybody has ideas please let me know.

Last edited: Sep 6, 2015
5. Sep 6, 2015

### Staff: Mentor

How does your plot look like for P=(0.5, 0.25, 0.25)? That seems to be the center of one of the three lobes in the first plot.

6. Sep 6, 2015

### stlukits

http://www.streetgreek.com/lpublic/various/asym533.png [Broken]
P=(0.5,0.25,0.25)

Last edited by a moderator: May 7, 2017
7. Sep 6, 2015

### Staff: Mentor

Interesting. No transitivity then.

P = (1/3, 1/3, 1/3)
Q = (1/2, 1/4, 1/4)
R = (0.4,0.4,0.2)

KLD(Q,P)>KLD(P,Q)
KLD(R,Q)>KLD(Q,R)
KLD(P,R)>KLD(R,Q)
KLD(Q,P)+KLD(R,Q)+KLD(P,R) > KLD(P,Q)+KLD(Q,R)+KLD(R,Q)

8. Sep 8, 2015

### stlukits

Fascinating, mfb! That should be another problem for the Kullback-Leibler divergence as a measure of dissimilarity between probability distributions. Violation of this kind of transitivity is even harder to square with epistemic intuitions we have about probabilities and updating them than the non-trivial asymmetry patterns that I pointed out.