[KL-divergence] comparison of pdf's

pawelch · Jul 23, 2011

Hi all,
I am trying to devise a mathematical model for my project I am working at. Description is as follows:
we have a sample space
<br /> \Omega=\{w_1,w_2,\cdots, w_N\}<br />
It is very large. Suppose further, that we have some assumption of frequency of occurrence of each w_i , stored in probability vector \pi .

In general, suppose we observe occurrences of w_i , stored as a sequence \{S\} . Now, we would like to compare \{S\} against \pi .
One of the solution would be application of Kullback–Leibler divergence. However, the problem is that |S| < |\Omega| (by || I mean cardinality) and as a result, we will observe that for some w_i \in \Omega , we have corresponding w_s \in S that are 0 (it stems from the fact that \{S\} has not managed to explore \Omega throughout). In this case we have undefined element 0\cdot \log \frac{0}{s_i} = \infty .
In principal, I generate multiple \{S\} that are of two types, say, \{1,0\} . The underlying concept for my project is that \{S\}_1 of the first type will hit w_i that posses high frequency in \Omega , on the contrary \{S\}_2 of the second type would generate w_i that have low frequency. And, it is unlikely that |S| = |\Omega|
Thus, again, I thought I would compare \{S\}_1 against \pi and then \{S\}_2 against \pi and observe the differences. But, because of the assumption of 0\cdot \log \frac{0}{s_i} = \infty , I could not get it right. Thus, I thought that maybe I "normalise" (i.e. shrink) \Omega, so that the new \Omega contains only elements that have occurred in \{S\}. but I have been told it is not a good idea either.

So hmm.. well, the question is how should I compare both types \{S\}_1 and \{S\}_2 against \pi if their are of different length ?

Thank you for any suggestions, and accept my apology for poor mathematical language of this description,

cheers!

pmsrw3 · Jul 23, 2011

pawelch said:

0\cdot \log \frac{0}{s_i} = \infty

In applications like this (entropy, information theory, etc), 0 log 0 = 0.

Does that help?

pawelch · Jul 23, 2011

Hi pmsrw3,
Thank you for your quick answer

pmsrw3 said:

In applications like this (entropy, information theory, etc), 0 log 0 = 0.

Does that help?

That is what I thought as well. However, we think of P as a "true" distribution, in my case that would be \pi. According to people form http://www-nlp.stanford.edu/fsnlp/mathfound/fsnlp-slides-kl.pdf" when there is 0 occurrence, then it is infinity (and I trust them because I have seen it many times before). and if two sequences \{S\}_1 and \{S\}_2 result in infinity, then it is hard to compare them against each other and draw a conclusion.

Please correct me if my thinking is wrong.
Thank you for your help and time

pmsrw3 · Jul 23, 2011

pawelch said:

That is what I thought as well. However, we think of P as a "true" distribution, in my case that would be \pi. According to people form http://www-nlp.stanford.edu/fsnlp/mathfound/fsnlp-slides-kl.pdf" when there is 0 occurrence, then it is infinity

According to the PDF at that link, 0 log (0/q) = 0. It's on the first page, just below eq 5.

Maybe the problem is that you haven't actually defined s_i, which appears in your original post.

and I trust them because I have seen it many times before.

What is it that you have "seen many times before"?

pawelch · Jul 23, 2011

pmsrw3 said:

According to the PDF at that link, 0 log (0/q) = 0. It's on the first page, just below eq 5.Maybe the problem is that you haven't actually defined s_i, which appears in your original post.

Maybe I have not defined it well, maybe your are right. However, if you have a look at the definition I have posted, there is \pi that corresponds to some expected frequency - and its components are positive. Also \pi is our true distribution, thus in this sense it is reasonable that we observe cases when \pi_i \log \frac{\pi_i}{0} = \infty.

pmsrw3 said:

What is it that you have "seen many times before"?

I have seen many times before the conclusion that it is infinity.

Thank you for your answer.

pmsrw3 · Jul 23, 2011

Well, I confess I'm sailing a little blind here, because I'm not really familiar with KL divergence. But it seems clear to me that, when comparing a theoretical (what you call "true") distribution against a sampled distribution, you should be using \log\frac{s}{\pi}, not \log\frac{\pi}{s}. It is perfectly reasonable to get \infty when \pi=0. That would mean you had seen something happen that your theory says should never happen. That observation alone is enough to throw out the theory. But seeing 0 in your sample when theory predicts it can happen is OK, and should not lead to an infinite divergence.

I have seen many times before the conclusion that it is infinity.

Once again, I have to ask for clarification. You have seen many times that WHAT is infinity? I agree that infinities can emerge from a calculation of KL divergence. But both the reference you cited and the Wikipedia page agree that 0 log (0/q) should be interpreted as 0.

pawelch · Jul 23, 2011

pmsrw3 said:

Well, I confess I'm sailing a little blind here, because I'm not really familiar with KL divergence. But it seems clear to me that, when comparing a theoretical (what you call "true") distribution against a sampled distribution, you should be using \log\frac{s}{\pi}, not \log\frac{\pi}{s}.

Hmm... maybe you are right here, I am not saying you are not because I am not an expert for sure.. According to wikipedia

Typically P represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory, model, description, or approximation of P.

Does it mean that my expected {\pi} I have calculated from past data, is actually Q, and as you are saying {\pi}\log\frac{0}{\pi}? If that is the case, then you are right. However, I might have been lead into confusion, by interpreting "true distribution of data,or a precisely calculated theoretical distribution" as my {\pi} .

Thanks again for bearing with me :)

pmsrw3 · Jul 23, 2011

pawelch said:

Does it mean that my expected {\pi} I have calculated from past data, is actually Q, and as you are saying {\pi}\log\frac{0}{\pi}? If that is the case, then you are right. However, I might have been lead into confusion, by interpreting "true distribution of data,or a precisely calculated theoretical distribution" as my {\pi} .

Umm. Well, if \pi is also an empirical frequency from sampling a binomial (or multinomial) distribution, then yes, you have a problem. In that case, I wonder if KL divergence is the tool to use. Is there some particular reason you chose it? There's a pretty simple information theoretic measure of the difference between two sampled distributions. It's called (at least sometimes) the G-test of independence. You can find a basic explanation http://en.wikipedia.org/wiki/G-test" . You'd have to adapt it to your case, but that is straightforward.

pawelch · Jul 23, 2011

pmsrw3 said:

Umm. Well, if \pi is also an empirical frequency from sampling a binomial (or multinomial) distribution, then yes, you have a problem.

Ok let's me put it this way. suppose there is a transition matrix A, that defines probability of transition between states. That matrix has special properties, and in my model after some simplifications, is actualy an ergodic Markov chain, having \pi as the limiting probability vector. Thus in general the expected frequency of each state s_i should correspond to \pi_i, as long as the states are visited according to A. Of course - to some extent.
Because, as I mentioned, we expect to observe \{S\}_1 to hit high frequencies more frequently, in general the difference between \{S\}_1 and \pi should be small. On the contrary, \{S\}_2 generates low frequencies more frequently, thus deviates from \pi substantialy. One of the way to compare that deviation is the application of KL divergence.
Thus I think that \pi is clearly our original distribution we use in comparison.

Thank you for your suggestions and help

pmsrw3 · Jul 23, 2011

OK. Well, since A is ergodic, it will predict a finite probability of visiting every state. So if you use the pi's as denominators in your KL divergence computation, you'll never get an infinite answer. Problem solved.

What are {S}₁ and {S}₂? You never defined either one. I'm guessing that you think {S}₁ is a time series evolving according to A, so it should (after sufficient time to reach SS) have frequencies that match the pi's, if you let it go on long enough. And then {S}₂ must be something different, but I have no clue what.

pawelch · Jul 23, 2011

pmsrw3 said:

So if you use the pi's as denominators in your KL divergence computation, you'll never get an infinite answer. Problem solved.

Hmm could you please tell me why it is so?

pmsrw3 said:

What are {S}₁ and {S}₂? {...}{S}₁ is a time series evolving according to A, so it should (after sufficient time to reach SS) have frequencies that match the pi's, if you let it go on long enough. And then {S}₂ must be something different, but I have no clue what.

You might think of them as time series that generate states according to A. The only difference between them is the number of occurence of low-frequency or high-frequency events from \Omega. Because both sequences can generate arbitrary numbers, I would like to see the difference between {S}₁ and {S}₂ in reference to \pi. The problem is that in general A is huge and I do not have access to "original" previous data regarding hitting of any state. but it seems that the ergodic assumption solves the problem of the expected frequency, by providing \pi ... if if holds..

pmsrw3 · Jul 23, 2011

pawelch said:

Hmm could you please tell me why it is so?

Well, that's essentially the definition of ergodic. Every state is reachable from every other state. So the SS is going to have a finite probability for every state.

You might think of them as time series that generate states according to A. The only difference between them is the number of occurence of low-frequency or high-frequency events from \Omega. Because both sequences can generate arbitrary numbers, I would like to see the difference between {S}₁ and {S}₂ in reference to \pi. The problem is that in general A is huge and I do not have access to "original" previous data regarding hitting of any state. but it seems that the ergodic assumption solves the problem of the expected frequency, by providing \pi ... if if holds..

This is still not making sense to me. If both of them generate states according to A, and if A is ergodic, then both of them should have exactly the same behavior at steady-state.

pawelch · Jul 23, 2011

pmsrw3 said:

This is still not making sense to me. If both of them generate states according to A, and if A is ergodic, then both of them should have exactly the same behavior at steady-state.

You are right here again, I put it in wrong words.
In general \{S\}_2 are agents who made mistakes and errors and I would like to see when they make errors (I do not care about state order,but only about the deviatiom from \pi, because as long as they make errors, I need to fix them). Normaly, agents should perform a random walk on a graph according to A. The only information I get is the sequence of states. If it is close to \pi I am happy, because I presume that the correct walk shoud be (and is) according to A. Because the only information I get are states, then the only possible distinction is difference between two frequencies. am I right here.. ? That is why I am after some method of pdfs' comparison..
Also, the important fact here is that I need to know as soon as possible which sequence I am likely to observe (either \{S\}_2 or \{S\}_1) because, I am allowed to observe them for a limited number of changes, say 50 and the number of possible states is comparingly larger say 10,000, they will visit only limited number of states out of all range. What follows, is the problem that they will have zero frequency at components s_i in reference to \pi_i

Thank you for helping me here..

pmsrw3 · Jul 23, 2011

pawelch said:

In general \{S\}_2 are agents who made mistakes and errors and I would like to see when they make errors (I do not care about state order,but only about the deviatiom from \pi, because as long as they make errors, I need to fix them). Normaly, agents should perform a random walk on a graph according to A. The only information I get is the sequence of states. If it is close to \pi I am happy, because I presume that the correct walk shoud be (and is) according to A. Because the only information I get are states, then the only possible distinction is difference between two frequencies. am I right here.. ? That is why I am after some method of pdfs' comparison..
Also, the important fact here is that I need to know as soon as possible which sequence I am likely to observe (either \{S\}_2 or \{S\}_1) because, I am allowed to observe them for a limited number of changes, say 50 and the number of possible states is comparingly larger say 10,000, they will visit only limited number of states out of all range. What follows, is the problem that they will have zero frequency at components s_i in reference to \pi_i

Actually, I would recommend an entirely different method. The answer to this question: "Because the only information I get are states, then the only possible distinction is difference between two frequencies. am I right here.. ?" is "No". By looking only at the frequency of states and ignoring the sequence, you're throwing out a great deal of information. Using the information embedded in the sequence would allow you to do what you say you need to do: figure out whether you have sequence 1 or sequence 2 as quickly as possible.

There are established, very effective methods for doing exactly this. They go by the name of Hidden Markov Models (HMMs). You fit a state sequence to a Markov model and get a probability that the model can account for the behavior you see. Furthermore, you can get this information for every time point.

The "hidden" in "Hidden Markov Model" refers to the fact that one generally can't see the state -- you can only observe something that is imperfectly correlated with it. In your case you apparently CAN observe state with perfect accuracy, so you don't need the H of HMMs. But that wouldn't invalidate the models -- it would just reduce them to a simpler special case.

HMMs are used widely and there are algorithms and code out there. My main source for HMM coding is section 16.3 of Numerical Recipes, 3rd ed.

pawelch · Jul 25, 2011

I have been told that HMM might be of use here.. However, I find it hard to come up with b_j(k) for my model.
However, thank you for your commitment. your remarks were really helpful throughout!
and I'll just hang around with a thought that higher education is for chosen ones :)

pmsrw3 · Jul 25, 2011

pawelch said:

I have been told that HMM might be of use here.. However, I find it hard to come up with b_j(k) for my model.

That's what I was getting at with my comments about your model not using the H of HMM. The answer in your case is trivial. You have a whole lot of states, which are the j's. You (unlike most folks doing HMM) know what state you're in. So the emissions of your model are exactly the same as the states -- the k's are the same as the j's. And the probability that state j will emit symbol k is 1 if j and k are the same, 0 otherwise: b_j(k)=\delta_{jk}.

pawelch · Jul 26, 2011

pmsrw3 said:

You have a whole lot of states, which are the j's. You (unlike most folks doing HMM) know what state you're in. So the emissions of your model are exactly the same as the states -- the k's are the same as the j's. And the probability that state j will emit symbol k is 1 if j and k are the same, 0 otherwise: b_j(k)=\delta_{jk}.

Following what you suggested I was trying to translate my problem into HMM. Suppose, that I ran multiple times (say 1,000,000) agents of type \{S\}_1 (these are the clever ones) on A, in order to estimate the transition matrix based on their behavior A_{est}. Obviously, they do not follow A exactly, and sometimes they make mistakes. those mistakes are minor, yet still occur. That is why A_{est} is slightly different than A.
Now, suppose that at each statek, we have a subset of possible states Q_k, from which an agent can pick a state, and move forward. Suppose further that \{S\}_2 have not learned A_{est}, thus if they are at k state, then they pick any further state j from a set Q_{err}=Q_k \cup Q_i, where Q_i is made up from an arbitrary number of inaccessible states. As a result, I can come up with another transition matrix A_{err}.
Therefore, if I observe a sequence of states \{S\}_u, generated by an unknown agent, I want to find out whether it is more probable that it has been generated by \{S\}_1 or \{S\}_2. Thus, I want to calculate whether: P(S_u,A_{est}|A_{est},\pi) \leq P(S_u,A_{est}|A_{err},\pi) or P(S_u,A_{est}|A_{est},\pi) > P(S_u,A_{est}|A_{err},\pi). In words, given A_{est} and \{S\}_u I want to discover whether it has been generated using A_{est} or A_{err}.
Does it make any sense..?

Thank you again for helping me out here..

pmsrw3 · Jul 26, 2011

pawelch said:

Following what you suggested I was trying to translate my problem into HMM. Suppose, that I ran multiple times (say 1,000,000) agents of type \{S\}_1 (these are the clever ones) on A, in order to estimate the transition matrix based on their behavior A_{est}. Obviously, they do not follow A exactly, and sometimes they make mistakes. those mistakes are minor, yet still occur. That is why A_{est} is slightly different than A.
Now, suppose that at each statek, we have a subset of possible states Q_k, from which an agent can pick a state, and move forward. Suppose further that \{S\}_2 have not learned A_{est}, thus if they are at k state, then they pick any further state j from a set Q_{err}=Q_k \cup Q_i, where Q_i is made up from an arbitrary number of inaccessible states. As a result, I can come up with another transition matrix A_{err}.

"Obviously" they don't follow A exactly? Why is that obvious?

Well, never mind. I wouldn't think you'd have to get A_est and A_err from simulation. You ought to be able to figure them exactly. You know A, and you know the kinds of errors they can make (at least in the simulation) and you know the probability of error. This ought to be enough to derive the exact stochastic matrices A_est and A_err. That would be useful since you have, as you've emphasized before, an extremely high-dimensional state space and you can't really completely fill it out by simulation.

Therefore, if I observe a sequence of states \{S\}_u, generated by an unknown agent, I want to find out whether it is more probable that it has been generated by \{S\}_1 or \{S\}_2. Thus, I want to calculate whether: P(S_u,A_{est}|A_{est},\pi) \leq P(S_u,A_{est}|A_{err},\pi) or P(S_u,A_{est}|A_{est},\pi) > P(S_u,A_{est}|A_{err},\pi). In words, given A_{est} and \{S\}_u I want to discover whether it has been generated using A_{est} or A_{err}.
Does it make any sense..?

Looks right, but I don't entirely understand your notation. What is P(S_u,A_{est}|A_{err},\pi)? What's the A_est doing there and the pi?

pawelch · Jul 26, 2011

pmsrw3 said:

Looks right, but I don't entirely understand your notation. What is P(S_u,A_{est}|A_{err},\pi)? What's the A_est doing there and the pi?

I always have problems to put it in right words.. However, what I meant was, since we know A_est and A_err we want to find out whether an observed sequence S_u, has been generated by using either A_est or A_err.
We can put it this way P(S_u,S_{observable}|A_{err},\pi) is a marginal pmf of an unobserved sequence, and observable one, given that it has been generated from a A_{err} matrix, and P(S_u,S_{observable}|A_{est},\pi) is the same for A_{est} matrix. Therefore, I actually want to compare P(S_u,S_{observable}|A_{err},B_{err},\pi) and P(S_u,S_{observable}|A_{est},B_{est},\pi)
I might be making mistake again here, but for a HMM, and my problem, if we want to calculate P(S_u,S_{observable}|A_{err},B_{err},\pi)we would describe the emission matrix as A_{est}, thus B_{err}=A_{est}. However, for P(S_u,S_{observable}|A_{est},B_{est},\pi), we have A_{est}=B_{est} and P(S_{observable}|A_{est},\pi) = \pi_{s_1}\Pi a_{si,sj}, because I assume it has been produced from the A_{est}. Thus, effectively I want to compare: P(S_u,S_{observable}|A_{err},A_{est},\pi) against P(S_{observable}|A_{est},\pi). Does it make any sense?

Thank you

pmsrw3 · Jul 26, 2011

Oh, I see (mostly). I had completely forgotten that we started with HMMs and non-trivial emission probabilities. I would say so just want the probability of getting state sequence S_u if the transition matrix was A_err, which I would call P(S_u | A_\mbox{err}), or maybe P_{A_\mbox{err}}(S_u). But these are just trivial issues of notation. I think we're talking about the same thing.

[KL-divergence] comparison of pdf's

Similar threads

Hot Threads

B A Little Probability Puzzle

I Need help solving this Existence Algorithm for truth

A Does this computation satisfy LTL formulas?

A Prove that points which are indistinguishable from 0 exist (using logic)

A Mathematical Connection between Cosmic Expansion and Exponential Growth

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective