[KL-divergence] comparison of pdf's

  • Thread starter pawelch
  • Start date
  • Tags
    Comparison
In summary, the problem is that when comparing two types of sequences with different length, the sizes of the two sequences will not match and the result will be infinity.
  • #1
pawelch
11
0
Hi all,
I am trying to devise a mathematical model for my project I am working at. Description is as follows:
we have a sample space
[tex]
\Omega=\{w_1,w_2,\cdots, w_N\}
[/tex]
It is very large. Suppose further, that we have some assumption of frequency of occurrence of each [itex] w_i[/itex] , stored in probability vector [itex]\pi[/itex] .

In general, suppose we observe occurrences of [itex] w_i[/itex] , stored as a sequence [itex] \{S\}[/itex] . Now, we would like to compare [itex] \{S\} [/itex] against [itex] \pi[/itex] .
One of the solution would be application of Kullback–Leibler divergence. However, the problem is that [itex] |S| < |\Omega| [/itex] (by [itex] || [/itex] I mean cardinality) and as a result, we will observe that for some [itex] w_i \in \Omega [/itex] , we have corresponding [itex] w_s \in S [/itex] that are 0 (it stems from the fact that [itex] \{S\} [/itex] has not managed to explore [itex] \Omega [/itex] throughout). In this case we have undefined element [itex] 0\cdot \log \frac{0}{s_i} = \infty[/itex] .
In principal, I generate multiple [itex] \{S\} [/itex] that are of two types, say, [itex] \{1,0\}[/itex] . The underlying concept for my project is that [itex] \{S\}_1 [/itex] of the first type will hit [itex] w_i [/itex] that posses high frequency in [itex] \Omega[/itex] , on the contrary [itex] \{S\}_2 [/itex] of the second type would generate [itex] w_i [/itex] that have low frequency. And, it is unlikely that [itex] |S| = |\Omega|[/itex]
Thus, again, I thought I would compare [itex] \{S\}_1 [/itex] against [itex] \pi [/itex] and then [itex] \{S\}_2 [/itex] against [itex] \pi [/itex] and observe the differences. But, because of the assumption of [itex] 0\cdot \log \frac{0}{s_i} = \infty[/itex] , I could not get it right. Thus, I thought that maybe I "normalise" (i.e. shrink) [itex] \Omega[/itex], so that the new [itex] \Omega [/itex] contains only elements that have occurred in [itex] \{S\}[/itex]. but I have been told it is not a good idea either.

So hmm.. well, the question is how should I compare both types [itex] \{S\}_1 [/itex] and [itex] \{S\}_2 [/itex] against [itex] \pi [/itex] if their are of different length ?

Thank you for any suggestions, and accept my apology for poor mathematical language of this description,

cheers!
 
Physics news on Phys.org
  • #2
pawelch said:
[itex] 0\cdot \log \frac{0}{s_i} = \infty[/itex]
In applications like this (entropy, information theory, etc), 0 log 0 = 0.

Does that help?
 
  • #3
Hi pmsrw3,
Thank you for your quick answer

pmsrw3 said:
In applications like this (entropy, information theory, etc), 0 log 0 = 0.

Does that help?

That is what I thought as well. However, we think of [itex]P[/itex] as a "true" distribution, in my case that would be [itex]\pi[/itex]. According to people form http://www-nlp.stanford.edu/fsnlp/mathfound/fsnlp-slides-kl.pdf" when there is 0 occurrence, then it is infinity (and I trust them because I have seen it many times before). and if two sequences [itex]\{S\}_1[/itex] and [itex]\{S\}_2[/itex] result in infinity, then it is hard to compare them against each other and draw a conclusion.

Please correct me if my thinking is wrong.
Thank you for your help and time
 
Last edited by a moderator:
  • #4
pawelch said:
That is what I thought as well. However, we think of [itex]P[/itex] as a "true" distribution, in my case that would be [itex]\pi[/itex]. According to people form http://www-nlp.stanford.edu/fsnlp/mathfound/fsnlp-slides-kl.pdf" when there is 0 occurrence, then it is infinity
According to the PDF at that link, 0 log (0/q) = 0. It's on the first page, just below eq 5.

Maybe the problem is that you haven't actually defined si, which appears in your original post.

and I trust them because I have seen it many times before.
What is it that you have "seen many times before"?
 
Last edited by a moderator:
  • #5
pmsrw3 said:
According to the PDF at that link, 0 log (0/q) = 0. It's on the first page, just below eq 5.Maybe the problem is that you haven't actually defined si, which appears in your original post.
Maybe I have not defined it well, maybe your are right. However, if you have a look at the definition I have posted, there is [itex]\pi[/itex] that corresponds to some expected frequency - and its components are positive. Also [itex]\pi[/itex] is our true distribution, thus in this sense it is reasonable that we observe cases when [itex]\pi_i \log \frac{\pi_i}{0} = \infty[/itex].


pmsrw3 said:
What is it that you have "seen many times before"?
I have seen many times before the conclusion that it is infinity.

Thank you for your answer.
 
  • #6
Well, I confess I'm sailing a little blind here, because I'm not really familiar with KL divergence. But it seems clear to me that, when comparing a theoretical (what you call "true") distribution against a sampled distribution, you should be using [itex]\log\frac{s}{\pi}[/itex], not [itex]\log\frac{\pi}{s}[/itex]. It is perfectly reasonable to get [itex]\infty[/itex] when [itex]\pi=0[/itex]. That would mean you had seen something happen that your theory says should never happen. That observation alone is enough to throw out the theory. But seeing 0 in your sample when theory predicts it can happen is OK, and should not lead to an infinite divergence.

I have seen many times before the conclusion that it is infinity.
Once again, I have to ask for clarification. You have seen many times that WHAT is infinity? I agree that infinities can emerge from a calculation of KL divergence. But both the reference you cited and the Wikipedia page agree that 0 log (0/q) should be interpreted as 0.
 
  • #7
pmsrw3 said:
Well, I confess I'm sailing a little blind here, because I'm not really familiar with KL divergence. But it seems clear to me that, when comparing a theoretical (what you call "true") distribution against a sampled distribution, you should be using [itex]\log\frac{s}{\pi}[/itex], not [itex]\log\frac{\pi}{s}[/itex].

Hmm... maybe you are right here, I am not saying you are not because I am not an expert for sure.. According to wikipedia

Typically P represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory, model, description, or approximation of P.

Does it mean that my expected [itex]{\pi}[/itex] I have calculated from past data, is actually [itex]Q[/itex], and as you are saying [itex]{\pi}\log\frac{0}{\pi}[/itex]? If that is the case, then you are right. However, I might have been lead into confusion, by interpreting "true distribution of data,or a precisely calculated theoretical distribution" as my [itex]{\pi}[/itex] .

Thanks again for bearing with me :)
 
  • #8
pawelch said:
Does it mean that my expected [itex]{\pi}[/itex] I have calculated from past data, is actually [itex]Q[/itex], and as you are saying [itex]{\pi}\log\frac{0}{\pi}[/itex]? If that is the case, then you are right. However, I might have been lead into confusion, by interpreting "true distribution of data,or a precisely calculated theoretical distribution" as my [itex]{\pi}[/itex] .
Umm. Well, if [itex]\pi[/itex] is also an empirical frequency from sampling a binomial (or multinomial) distribution, then yes, you have a problem. In that case, I wonder if KL divergence is the tool to use. Is there some particular reason you chose it? There's a pretty simple information theoretic measure of the difference between two sampled distributions. It's called (at least sometimes) the G-test of independence. You can find a basic explanation http://en.wikipedia.org/wiki/G-test" . You'd have to adapt it to your case, but that is straightforward.
 
Last edited by a moderator:
  • #9
pmsrw3 said:
Umm. Well, if [itex]\pi[/itex] is also an empirical frequency from sampling a binomial (or multinomial) distribution, then yes, you have a problem.

Ok let's me put it this way. suppose there is a transition matrix [itex]A[/itex], that defines probability of transition between states. That matrix has special properties, and in my model after some simplifications, is actualy an ergodic Markov chain, having [itex]\pi[/itex] as the limiting probability vector. Thus in general the expected frequency of each state [itex]s_i[/itex] should correspond to [itex]\pi_i[/itex], as long as the states are visited according to [itex]A[/itex]. Of course - to some extent.
Because, as I mentioned, we expect to observe [itex] \{S\}_1[/itex] to hit high frequencies more frequently, in general the difference between [itex]\{S\}_1[/itex] and [itex]\pi[/itex] should be small. On the contrary, [itex]\{S\}_2[/itex] generates low frequencies more frequently, thus deviates from [itex]\pi[/itex] substantialy. One of the way to compare that deviation is the application of KL divergence.
Thus I think that [itex]\pi[/itex] is clearly our original distribution we use in comparison.

Thank you for your suggestions and help
 
  • #10
OK. Well, since A is ergodic, it will predict a finite probability of visiting every state. So if you use the pi's as denominators in your KL divergence computation, you'll never get an infinite answer. Problem solved.

What are {S}1 and {S}2? You never defined either one. I'm guessing that you think {S}1 is a time series evolving according to A, so it should (after sufficient time to reach SS) have frequencies that match the pi's, if you let it go on long enough. And then {S}2 must be something different, but I have no clue what.
 
  • #11
pmsrw3 said:
So if you use the pi's as denominators in your KL divergence computation, you'll never get an infinite answer. Problem solved.
Hmm could you please tell me why it is so?

pmsrw3 said:
What are {S}1 and {S}2? {...}{S}1 is a time series evolving according to A, so it should (after sufficient time to reach SS) have frequencies that match the pi's, if you let it go on long enough. And then {S}2 must be something different, but I have no clue what.

You might think of them as time series that generate states according to A. The only difference between them is the number of occurence of low-frequency or high-frequency events from [itex]\Omega[/itex]. Because both sequences can generate arbitrary numbers, I would like to see the difference between {S}1 and {S}2 in reference to [itex]\pi[/itex]. The problem is that in general A is huge and I do not have access to "original" previous data regarding hitting of any state. but it seems that the ergodic assumption solves the problem of the expected frequency, by providing [itex]\pi[/itex] ... if if holds..
 
  • #12
pawelch said:
Hmm could you please tell me why it is so?
Well, that's essentially the definition of ergodic. Every state is reachable from every other state. So the SS is going to have a finite probability for every state.

You might think of them as time series that generate states according to A. The only difference between them is the number of occurence of low-frequency or high-frequency events from [itex]\Omega[/itex]. Because both sequences can generate arbitrary numbers, I would like to see the difference between {S}1 and {S}2 in reference to [itex]\pi[/itex]. The problem is that in general A is huge and I do not have access to "original" previous data regarding hitting of any state. but it seems that the ergodic assumption solves the problem of the expected frequency, by providing [itex]\pi[/itex] ... if if holds..
This is still not making sense to me. If both of them generate states according to A, and if A is ergodic, then both of them should have exactly the same behavior at steady-state.
 
  • #13
pmsrw3 said:
This is still not making sense to me. If both of them generate states according to A, and if A is ergodic, then both of them should have exactly the same behavior at steady-state.
You are right here again, I put it in wrong words.
In general [itex]\{S\}_2[/itex] are agents who made mistakes and errors and I would like to see when they make errors (I do not care about state order,but only about the deviatiom from [itex]\pi[/itex], because as long as they make errors, I need to fix them). Normaly, agents should perform a random walk on a graph according to [itex]A[/itex]. The only information I get is the sequence of states. If it is close to [itex]\pi[/itex] I am happy, because I presume that the correct walk shoud be (and is) according to [itex]A[/itex]. Because the only information I get are states, then the only possible distinction is difference between two frequencies. am I right here.. ? That is why I am after some method of pdfs' comparison..
Also, the important fact here is that I need to know as soon as possible which sequence I am likely to observe (either [itex]\{S\}_2[/itex] or [itex]\{S\}_1[/itex]) because, I am allowed to observe them for a limited number of changes, say 50 and the number of possible states is comparingly larger say 10,000, they will visit only limited number of states out of all range. What follows, is the problem that they will have zero frequency at components [itex]s_i[/itex] in reference to [itex]\pi_i[/itex]

Thank you for helping me here..
 
  • #14
pawelch said:
In general [itex]\{S\}_2[/itex] are agents who made mistakes and errors and I would like to see when they make errors (I do not care about state order,but only about the deviatiom from [itex]\pi[/itex], because as long as they make errors, I need to fix them). Normaly, agents should perform a random walk on a graph according to [itex]A[/itex]. The only information I get is the sequence of states. If it is close to [itex]\pi[/itex] I am happy, because I presume that the correct walk shoud be (and is) according to [itex]A[/itex]. Because the only information I get are states, then the only possible distinction is difference between two frequencies. am I right here.. ? That is why I am after some method of pdfs' comparison..
Also, the important fact here is that I need to know as soon as possible which sequence I am likely to observe (either [itex]\{S\}_2[/itex] or [itex]\{S\}_1[/itex]) because, I am allowed to observe them for a limited number of changes, say 50 and the number of possible states is comparingly larger say 10,000, they will visit only limited number of states out of all range. What follows, is the problem that they will have zero frequency at components [itex]s_i[/itex] in reference to [itex]\pi_i[/itex]
Actually, I would recommend an entirely different method. The answer to this question: "Because the only information I get are states, then the only possible distinction is difference between two frequencies. am I right here.. ?" is "No". By looking only at the frequency of states and ignoring the sequence, you're throwing out a great deal of information. Using the information embedded in the sequence would allow you to do what you say you need to do: figure out whether you have sequence 1 or sequence 2 as quickly as possible.

There are established, very effective methods for doing exactly this. They go by the name of Hidden Markov Models (HMMs). You fit a state sequence to a Markov model and get a probability that the model can account for the behavior you see. Furthermore, you can get this information for every time point.

The "hidden" in "Hidden Markov Model" refers to the fact that one generally can't see the state -- you can only observe something that is imperfectly correlated with it. In your case you apparently CAN observe state with perfect accuracy, so you don't need the H of HMMs. But that wouldn't invalidate the models -- it would just reduce them to a simpler special case.

HMMs are used widely and there are algorithms and code out there. My main source for HMM coding is section 16.3 of Numerical Recipes, 3rd ed.
 
  • #15
I have been told that HMM might be of use here.. However, I find it hard to come up with [itex]b_j(k)[/itex] for my model.
However, thank you for your commitment. your remarks were really helpful throughout!
and I'll just hang around with a thought that higher education is for chosen ones :)
 
  • #16
pawelch said:
I have been told that HMM might be of use here.. However, I find it hard to come up with [itex]b_j(k)[/itex] for my model.
That's what I was getting at with my comments about your model not using the H of HMM. The answer in your case is trivial. You have a whole lot of states, which are the j's. You (unlike most folks doing HMM) know what state you're in. So the emissions of your model are exactly the same as the states -- the k's are the same as the j's. And the probability that state j will emit symbol k is 1 if j and k are the same, 0 otherwise: [itex]b_j(k)=\delta_{jk}[/itex].
 
  • #17
pmsrw3 said:
You have a whole lot of states, which are the j's. You (unlike most folks doing HMM) know what state you're in. So the emissions of your model are exactly the same as the states -- the k's are the same as the j's. And the probability that state j will emit symbol k is 1 if j and k are the same, 0 otherwise: [itex]b_j(k)=\delta_{jk}[/itex].

Following what you suggested I was trying to translate my problem into HMM. Suppose, that I ran multiple times (say 1,000,000) agents of type [itex]\{S\}_1[/itex] (these are the clever ones) on [itex] A[/itex], in order to estimate the transition matrix based on their behavior [itex] A_{est}[/itex]. Obviously, they do not follow [itex] A[/itex] exactly, and sometimes they make mistakes. those mistakes are minor, yet still occur. That is why [itex] A_{est}[/itex] is slightly different than [itex] A[/itex].
Now, suppose that at each state[itex]k[/itex], we have a subset of possible states [itex]Q_k[/itex], from which an agent can pick a state, and move forward. Suppose further that [itex]\{S\}_2[/itex] have not learned [itex] A_{est}[/itex], thus if they are at [itex]k[/itex] state, then they pick any further state [itex]j[/itex] from a set [itex]Q_{err}=Q_k \cup Q_i [/itex], where [itex]Q_i[/itex] is made up from an arbitrary number of inaccessible states. As a result, I can come up with another transition matrix [itex] A_{err}[/itex].
Therefore, if I observe a sequence of states [itex]\{S\}_u[/itex], generated by an unknown agent, I want to find out whether it is more probable that it has been generated by [itex]\{S\}_1[/itex] or [itex]\{S\}_2[/itex]. Thus, I want to calculate whether: [itex]P(S_u,A_{est}|A_{est},\pi) \leq P(S_u,A_{est}|A_{err},\pi)[/itex] or [itex]P(S_u,A_{est}|A_{est},\pi) > P(S_u,A_{est}|A_{err},\pi)[/itex]. In words, given [itex] A_{est}[/itex] and [itex]\{S\}_u[/itex] I want to discover whether it has been generated using [itex] A_{est}[/itex] or [itex] A_{err}[/itex].
Does it make any sense..?

Thank you again for helping me out here..
 
  • #18
pawelch said:
Following what you suggested I was trying to translate my problem into HMM. Suppose, that I ran multiple times (say 1,000,000) agents of type [itex]\{S\}_1[/itex] (these are the clever ones) on [itex] A[/itex], in order to estimate the transition matrix based on their behavior [itex] A_{est}[/itex]. Obviously, they do not follow [itex] A[/itex] exactly, and sometimes they make mistakes. those mistakes are minor, yet still occur. That is why [itex] A_{est}[/itex] is slightly different than [itex] A[/itex].
Now, suppose that at each state[itex]k[/itex], we have a subset of possible states [itex]Q_k[/itex], from which an agent can pick a state, and move forward. Suppose further that [itex]\{S\}_2[/itex] have not learned [itex] A_{est}[/itex], thus if they are at [itex]k[/itex] state, then they pick any further state [itex]j[/itex] from a set [itex]Q_{err}=Q_k \cup Q_i [/itex], where [itex]Q_i[/itex] is made up from an arbitrary number of inaccessible states. As a result, I can come up with another transition matrix [itex] A_{err}[/itex].
"Obviously" they don't follow A exactly? Why is that obvious?

Well, never mind. I wouldn't think you'd have to get Aest and Aerr from simulation. You ought to be able to figure them exactly. You know A, and you know the kinds of errors they can make (at least in the simulation) and you know the probability of error. This ought to be enough to derive the exact stochastic matrices Aest and Aerr. That would be useful since you have, as you've emphasized before, an extremely high-dimensional state space and you can't really completely fill it out by simulation.

Therefore, if I observe a sequence of states [itex]\{S\}_u[/itex], generated by an unknown agent, I want to find out whether it is more probable that it has been generated by [itex]\{S\}_1[/itex] or [itex]\{S\}_2[/itex]. Thus, I want to calculate whether: [itex]P(S_u,A_{est}|A_{est},\pi) \leq P(S_u,A_{est}|A_{err},\pi)[/itex] or [itex]P(S_u,A_{est}|A_{est},\pi) > P(S_u,A_{est}|A_{err},\pi)[/itex]. In words, given [itex] A_{est}[/itex] and [itex]\{S\}_u[/itex] I want to discover whether it has been generated using [itex] A_{est}[/itex] or [itex] A_{err}[/itex].
Does it make any sense..?
Looks right, but I don't entirely understand your notation. What is [itex]P(S_u,A_{est}|A_{err},\pi)[/itex]? What's the Aest doing there and the pi?
 
  • #19
pmsrw3 said:
Looks right, but I don't entirely understand your notation. What is [itex]P(S_u,A_{est}|A_{err},\pi)[/itex]? What's the Aest doing there and the pi?
I always have problems to put it in right words.. However, what I meant was, since we know Aest and Aerr we want to find out whether an observed sequence Su, has been generated by using either Aest or Aerr.
We can put it this way [itex]P(S_u,S_{observable}|A_{err},\pi) [/itex] is a marginal pmf of an unobserved sequence, and observable one, given that it has been generated from a [itex]A_{err}[/itex] matrix, and [itex]P(S_u,S_{observable}|A_{est},\pi) [/itex] is the same for [itex]A_{est}[/itex] matrix. Therefore, I actually want to compare [itex]P(S_u,S_{observable}|A_{err},B_{err},\pi) [/itex] and [itex]P(S_u,S_{observable}|A_{est},B_{est},\pi) [/itex]
I might be making mistake again here, but for a HMM, and my problem, if we want to calculate [itex]P(S_u,S_{observable}|A_{err},B_{err},\pi) [/itex]we would describe the emission matrix as [itex]A_{est}[/itex], thus [itex]B_{err}=A_{est}[/itex]. However, for [itex]P(S_u,S_{observable}|A_{est},B_{est},\pi) [/itex], we have [itex]A_{est}=B_{est} [/itex] and [itex]P(S_{observable}|A_{est},\pi) = \pi_{s_1}\Pi a_{si,sj} [/itex], because I assume it has been produced from the [itex]A_{est}[/itex]. Thus, effectively I want to compare: [itex]P(S_u,S_{observable}|A_{err},A_{est},\pi) [/itex] against [itex]P(S_{observable}|A_{est},\pi)[/itex]. Does it make any sense?

Thank you
 
  • #20
Oh, I see (mostly). I had completely forgotten that we started with HMMs and non-trivial emission probabilities. I would say so just want the probability of getting state sequence Su if the transition matrix was Aerr, which I would call [itex]P(S_u | A_\mbox{err})[/itex], or maybe [itex]P_{A_\mbox{err}}(S_u)[/itex]. But these are just trivial issues of notation. I think we're talking about the same thing.
 

What is KL-divergence?

KL-divergence, also known as Kullback-Leibler divergence, is a mathematical measure of how one probability distribution differs from another. It is commonly used to compare two probability density functions (pdf's).

How is KL-divergence calculated?

KL-divergence is calculated by taking the integral of the product of the two probability density functions, where one function is being compared to the other. The resulting value represents the divergence or difference between the two distributions.

What is the purpose of comparing pdf's using KL-divergence?

The purpose of comparing pdf's using KL-divergence is to quantitatively measure how different two probability distributions are. This can be useful in various fields such as statistics, data analysis, and machine learning.

What are the limitations of using KL-divergence to compare pdf's?

One limitation of KL-divergence is that it is not a symmetric measure, meaning the divergence between two distributions A and B may not be the same as the divergence between B and A. Additionally, it is sensitive to the choice of the reference distribution and may not accurately capture differences in certain types of distributions.

Can KL-divergence be used to compare more than two pdf's?

Yes, KL-divergence can be used to compare multiple pdf's. However, when comparing more than two distributions, it is important to choose a reference distribution and calculate the KL-divergence for each distribution compared to the reference. This allows for a more meaningful comparison between the distributions.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
15
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
5K
  • Engineering and Comp Sci Homework Help
Replies
17
Views
1K
  • Introductory Physics Homework Help
Replies
2
Views
841
Replies
1
Views
938
  • Atomic and Condensed Matter
Replies
4
Views
1K
  • Advanced Physics Homework Help
Replies
6
Views
1K
  • Introductory Physics Homework Help
Replies
17
Views
379
  • Introductory Physics Homework Help
Replies
3
Views
205
  • Math Proof Training and Practice
Replies
25
Views
2K
Back
Top