# [KL-divergence] comparison of pdf's

1. Jul 23, 2011

### pawelch

Hi all,
I am trying to devise a mathematical model for my project I am working at. Description is as follows:
we have a sample space
$$\Omega=\{w_1,w_2,\cdots, w_N\}$$
It is very large. Suppose further, that we have some assumption of frequency of occurrence of each $w_i$ , stored in probability vector $\pi$ .

In general, suppose we observe occurrences of $w_i$ , stored as a sequence $\{S\}$ . Now, we would like to compare $\{S\}$ against $\pi$ .
One of the solution would be application of Kullback–Leibler divergence. However, the problem is that $|S| < |\Omega|$ (by $||$ I mean cardinality) and as a result, we will observe that for some $w_i \in \Omega$ , we have corresponding $w_s \in S$ that are 0 (it stems from the fact that $\{S\}$ has not managed to explore $\Omega$ throughout). In this case we have undefined element $0\cdot \log \frac{0}{s_i} = \infty$ .
In principal, I generate multiple $\{S\}$ that are of two types, say, $\{1,0\}$ . The underlying concept for my project is that $\{S\}_1$ of the first type will hit $w_i$ that posses high frequency in $\Omega$ , on the contrary $\{S\}_2$ of the second type would generate $w_i$ that have low frequency. And, it is unlikely that $|S| = |\Omega|$
Thus, again, I thought I would compare $\{S\}_1$ against $\pi$ and then $\{S\}_2$ against $\pi$ and observe the differences. But, because of the assumption of $0\cdot \log \frac{0}{s_i} = \infty$ , I could not get it right. Thus, I thought that maybe I "normalise" (i.e. shrink) $\Omega$, so that the new $\Omega$ contains only elements that have occurred in $\{S\}$. but I have been told it is not a good idea either.

So hmm.. well, the question is how should I compare both types $\{S\}_1$ and $\{S\}_2$ against $\pi$ if their are of different length ?

Thank you for any suggestions, and accept my apology for poor mathematical language of this description,

cheers!

2. Jul 23, 2011

### pmsrw3

In applications like this (entropy, information theory, etc), 0 log 0 = 0.

Does that help?

3. Jul 23, 2011

### pawelch

Hi pmsrw3,

That is what I thought as well. However, we think of $P$ as a "true" distribution, in my case that would be $\pi$. According to people form http://www-nlp.stanford.edu/fsnlp/mathfound/fsnlp-slides-kl.pdf" [Broken] when there is 0 occurrence, then it is infinity (and I trust them because I have seen it many times before). and if two sequences $\{S\}_1$ and $\{S\}_2$ result in infinity, then it is hard to compare them against each other and draw a conclusion.

Please correct me if my thinking is wrong.
Thank you for your help and time

Last edited by a moderator: May 5, 2017
4. Jul 23, 2011

### pmsrw3

According to the PDF at that link, 0 log (0/q) = 0. It's on the first page, just below eq 5.

Maybe the problem is that you haven't actually defined si, which appears in your original post.

What is it that you have "seen many times before"?

Last edited by a moderator: May 5, 2017
5. Jul 23, 2011

### pawelch

Maybe I have not defined it well, maybe your are right. However, if you have a look at the definition I have posted, there is $\pi$ that corresponds to some expected frequency - and its components are positive. Also $\pi$ is our true distribution, thus in this sense it is reasonable that we observe cases when $\pi_i \log \frac{\pi_i}{0} = \infty$.

I have seen many times before the conclusion that it is infinity.

6. Jul 23, 2011

### pmsrw3

Well, I confess I'm sailing a little blind here, because I'm not really familiar with KL divergence. But it seems clear to me that, when comparing a theoretical (what you call "true") distribution against a sampled distribution, you should be using $\log\frac{s}{\pi}$, not $\log\frac{\pi}{s}$. It is perfectly reasonable to get $\infty$ when $\pi=0$. That would mean you had seen something happen that your theory says should never happen. That observation alone is enough to throw out the theory. But seeing 0 in your sample when theory predicts it can happen is OK, and should not lead to an infinite divergence.

Once again, I have to ask for clarification. You have seen many times that WHAT is infinity? I agree that infinities can emerge from a calculation of KL divergence. But both the reference you cited and the Wikipedia page agree that 0 log (0/q) should be interpreted as 0.

7. Jul 23, 2011

### pawelch

Hmm... maybe you are right here, I am not saying you are not because I am not an expert for sure.. According to wikipedia

Does it mean that my expected ${\pi}$ I have calculated from past data, is actually $Q$, and as you are saying ${\pi}\log\frac{0}{\pi}$? If that is the case, then you are right. However, I might have been lead into confusion, by interpreting "true distribution of data,or a precisely calculated theoretical distribution" as my ${\pi}$ .

Thanks again for bearing with me :)

8. Jul 23, 2011

### pmsrw3

Umm. Well, if $\pi$ is also an empirical frequency from sampling a binomial (or multinomial) distribution, then yes, you have a problem. In that case, I wonder if KL divergence is the tool to use. Is there some particular reason you chose it? There's a pretty simple information theoretic measure of the difference between two sampled distributions. It's called (at least sometimes) the G-test of independence. You can find a basic explanation http://en.wikipedia.org/wiki/G-test" [Broken]. You'd have to adapt it to your case, but that is straightforward.

Last edited by a moderator: May 5, 2017
9. Jul 23, 2011

### pawelch

Ok let's me put it this way. suppose there is a transition matrix $A$, that defines probability of transition between states. That matrix has special properties, and in my model after some simplifications, is actualy an ergodic Markov chain, having $\pi$ as the limiting probability vector. Thus in general the expected frequency of each state $s_i$ should correspond to $\pi_i$, as long as the states are visited according to $A$. Of course - to some extent.
Because, as I mentioned, we expect to observe $\{S\}_1$ to hit high frequencies more frequently, in general the difference between $\{S\}_1$ and $\pi$ should be small. On the contrary, $\{S\}_2$ generates low frequencies more frequently, thus deviates from $\pi$ substantialy. One of the way to compare that deviation is the application of KL divergence.
Thus I think that $\pi$ is clearly our original distribution we use in comparison.

Thank you for your suggestions and help

10. Jul 23, 2011

### pmsrw3

OK. Well, since A is ergodic, it will predict a finite probability of visiting every state. So if you use the pi's as denominators in your KL divergence computation, you'll never get an infinite answer. Problem solved.

What are {S}1 and {S}2? You never defined either one. I'm guessing that you think {S}1 is a time series evolving according to A, so it should (after sufficient time to reach SS) have frequencies that match the pi's, if you let it go on long enough. And then {S}2 must be something different, but I have no clue what.

11. Jul 23, 2011

### pawelch

Hmm could you please tell me why it is so?

You might think of them as time series that generate states according to A. The only difference between them is the number of occurence of low-frequency or high-frequency events from $\Omega$. Because both sequences can generate arbitrary numbers, I would like to see the difference between {S}1 and {S}2 in reference to $\pi$. The problem is that in general A is huge and I do not have access to "original" previous data regarding hitting of any state. but it seems that the ergodic assumption solves the problem of the expected frequency, by providing $\pi$ ... if if holds..

12. Jul 23, 2011

### pmsrw3

Well, that's essentially the definition of ergodic. Every state is reachable from every other state. So the SS is going to have a finite probability for every state.

This is still not making sense to me. If both of them generate states according to A, and if A is ergodic, then both of them should have exactly the same behavior at steady-state.

13. Jul 23, 2011

### pawelch

You are right here again, I put it in wrong words.
In general $\{S\}_2$ are agents who made mistakes and errors and I would like to see when they make errors (I do not care about state order,but only about the deviatiom from $\pi$, because as long as they make errors, I need to fix them). Normaly, agents should perform a random walk on a graph according to $A$. The only information I get is the sequence of states. If it is close to $\pi$ I am happy, because I presume that the correct walk shoud be (and is) according to $A$. Because the only information I get are states, then the only possible distinction is difference between two frequencies. am I right here.. ? That is why I am after some method of pdfs' comparison..
Also, the important fact here is that I need to know as soon as possible which sequence I am likely to observe (either $\{S\}_2$ or $\{S\}_1$) because, I am allowed to observe them for a limited number of changes, say 50 and the number of possible states is comparingly larger say 10,000, they will visit only limited number of states out of all range. What follows, is the problem that they will have zero frequency at components $s_i$ in reference to $\pi_i$

Thank you for helping me here..

14. Jul 23, 2011

### pmsrw3

Actually, I would recommend an entirely different method. The answer to this question: "Because the only information I get are states, then the only possible distinction is difference between two frequencies. am I right here.. ?" is "No". By looking only at the frequency of states and ignoring the sequence, you're throwing out a great deal of information. Using the information embedded in the sequence would allow you to do what you say you need to do: figure out whether you have sequence 1 or sequence 2 as quickly as possible.

There are established, very effective methods for doing exactly this. They go by the name of Hidden Markov Models (HMMs). You fit a state sequence to a Markov model and get a probability that the model can account for the behavior you see. Furthermore, you can get this information for every time point.

The "hidden" in "Hidden Markov Model" refers to the fact that one generally can't see the state -- you can only observe something that is imperfectly correlated with it. In your case you apparently CAN observe state with perfect accuracy, so you don't need the H of HMMs. But that wouldn't invalidate the models -- it would just reduce them to a simpler special case.

HMMs are used widely and there are algorithms and code out there. My main source for HMM coding is section 16.3 of Numerical Recipes, 3rd ed.

15. Jul 25, 2011

### pawelch

I have been told that HMM might be of use here.. However, I find it hard to come up with $b_j(k)$ for my model.
and I'll just hang around with a thought that higher education is for chosen ones :)

16. Jul 25, 2011

### pmsrw3

That's what I was getting at with my comments about your model not using the H of HMM. The answer in your case is trivial. You have a whole lot of states, which are the j's. You (unlike most folks doing HMM) know what state you're in. So the emissions of your model are exactly the same as the states -- the k's are the same as the j's. And the probability that state j will emit symbol k is 1 if j and k are the same, 0 otherwise: $b_j(k)=\delta_{jk}$.

17. Jul 26, 2011

### pawelch

Following what you suggested I was trying to translate my problem into HMM. Suppose, that I ran multiple times (say 1,000,000) agents of type $\{S\}_1$ (these are the clever ones) on $A$, in order to estimate the transition matrix based on their behavior $A_{est}$. Obviously, they do not follow $A$ exactly, and sometimes they make mistakes. those mistakes are minor, yet still occur. That is why $A_{est}$ is slightly different than $A$.
Now, suppose that at each state$k$, we have a subset of possible states $Q_k$, from which an agent can pick a state, and move forward. Suppose further that $\{S\}_2$ have not learned $A_{est}$, thus if they are at $k$ state, then they pick any further state $j$ from a set $Q_{err}=Q_k \cup Q_i$, where $Q_i$ is made up from an arbitrary number of inaccessible states. As a result, I can come up with another transition matrix $A_{err}$.
Therefore, if I observe a sequence of states $\{S\}_u$, generated by an unknown agent, I want to find out whether it is more probable that it has been generated by $\{S\}_1$ or $\{S\}_2$. Thus, I want to calculate whether: $P(S_u,A_{est}|A_{est},\pi) \leq P(S_u,A_{est}|A_{err},\pi)$ or $P(S_u,A_{est}|A_{est},\pi) > P(S_u,A_{est}|A_{err},\pi)$. In words, given $A_{est}$ and $\{S\}_u$ I want to discover whether it has been generated using $A_{est}$ or $A_{err}$.
Does it make any sense..?

Thank you again for helping me out here..

18. Jul 26, 2011

### pmsrw3

"Obviously" they don't follow A exactly? Why is that obvious?

Well, never mind. I wouldn't think you'd have to get Aest and Aerr from simulation. You ought to be able to figure them exactly. You know A, and you know the kinds of errors they can make (at least in the simulation) and you know the probability of error. This ought to be enough to derive the exact stochastic matrices Aest and Aerr. That would be useful since you have, as you've emphasized before, an extremely high-dimensional state space and you can't really completely fill it out by simulation.

Looks right, but I don't entirely understand your notation. What is $P(S_u,A_{est}|A_{err},\pi)$? What's the Aest doing there and the pi?

19. Jul 26, 2011

### pawelch

I always have problems to put it in right words.. However, what I meant was, since we know Aest and Aerr we want to find out whether an observed sequence Su, has been generated by using either Aest or Aerr.
We can put it this way $P(S_u,S_{observable}|A_{err},\pi)$ is a marginal pmf of an unobserved sequence, and observable one, given that it has been generated from a $A_{err}$ matrix, and $P(S_u,S_{observable}|A_{est},\pi)$ is the same for $A_{est}$ matrix. Therefore, I actually want to compare $P(S_u,S_{observable}|A_{err},B_{err},\pi)$ and $P(S_u,S_{observable}|A_{est},B_{est},\pi)$
I might be making mistake again here, but for a HMM, and my problem, if we want to calculate $P(S_u,S_{observable}|A_{err},B_{err},\pi)$we would describe the emission matrix as $A_{est}$, thus $B_{err}=A_{est}$. However, for $P(S_u,S_{observable}|A_{est},B_{est},\pi)$, we have $A_{est}=B_{est}$ and $P(S_{observable}|A_{est},\pi) = \pi_{s_1}\Pi a_{si,sj}$, because I assume it has been produced from the $A_{est}$. Thus, effectively I want to compare: $P(S_u,S_{observable}|A_{err},A_{est},\pi)$ against $P(S_{observable}|A_{est},\pi)$. Does it make any sense?

Thank you

20. Jul 26, 2011

### pmsrw3

Oh, I see (mostly). I had completely forgotten that we started with HMMs and non-trivial emission probabilities. I would say so just want the probability of getting state sequence Su if the transition matrix was Aerr, which I would call $P(S_u | A_\mbox{err})$, or maybe $P_{A_\mbox{err}}(S_u)$. But these are just trivial issues of notation. I think we're talking about the same thing.