A question regarding conditional probabilities

Lajka · May 30, 2011

Hello,

I'm hoping I'm asking this in the right place. If not, I apologize.
Anyway, I have a dilemma about some basics in probabilty and pattern recognition, and, hopefully, someone can help me.

I'm not sure I understand what class-conditional pdf f(x|w_{i}) really means, and it's bothering me. Let me elaborate...

When we use terms such as 'conditional probability pdf and cdf', by that we mean:

where A is some event, a subset of a sample space. This event A must also be the domain of our functions defined above. It's a 'new universe', so to speak, for conditional probability cdfs and pdfs, and they only make sense if we look at them over this event A. For example, if we look at the random variable X with Gaussian distribution, and we denote event A as A={1.5<X<4.5}, then the corresponding conditional probability functions (pdf & cdf) look like

As you can see, they're defined only over interval {1.5<x<4.5}, otherwise they wouldn't make sense.
Often we are interested in conditional probabilites functions where the event A = (Y=y_{0}), and then we have

f(x|y)= f(x|Y=y_{0})=f(x,y_{0})/f_{Y}(y_{0})

We can interpret function f(x|y) as an intersection of a joint pdf f(x,y) with a plane y=y_{0} (with f_{Y}(y_{0}) as a normalization factor).

This is all fairly basic stuff, I reckon. And these types of conditional probability functions are the only types I know that exist, and they're all defined over region which is the event which serves as a condition (I can't stress this enough, for reasons seen later). But class-conditional probability functions, such as f(x|w_{i}) in Bayes classifiers theory, seem like a different beast to me.

First of all, let me say that everything about these conditional pdfs, and naive Bayes classifier in general, is perfectly intuitive to me and I don't have a problem from that POV. I understand that, and I don't have the problem with the logic itself presented here. But when I try to define everything rigorously from a mathematical POV, I get stuck.
In other words, I understand what p(x|w_{i}) represents, and why, for instance, p(x|w_{1}) is non-zero even over region w_{2}. However, I don't know how to explain all that using rigorous mathematical apparatus. Let me elaborate even more...

So, we have these classes w_{i}. What exactly are they, mathematically speaking?! Their priors sum up to one, and they will eventually be represented by regions in our sample space, so I will define them as events in my sample space. If we look at the simplest example in 1-D, the conditional probability density functions would look something like this

And then we could tell that event w_{1} is (-inf, \, x_{0}) and event w_{2} is (x_{0},\, +inf).

If you ask me, this doesn't make sense if you consider the definitions of conditional pdfs above. Conditional probability density function, by its very definition, must be confined to a space of the event it's conditioned with. In other words, p(x|w_{1}) should be constrained to the w_{1} region! But not only that it isn't, it spreads out over the w_{2} region as well! That shouldn't be possible, because w_{1} and w_{2} are mutually exclusive events, and their respective regions also do not overlap, which makes sense. But conditional probability density functions defined over them do? Wait, what?!

Of course, this is how we define the error of our classification, but all this doesn't look very convincing to me, strictly mathematically speaking.
Conditional pdf p(x|A) must be defined over the region which corresponds to the event A, period. Functions p(x|w_{1}) and p(x|w_{2}) shouldn't overlap each other like that, because the regions w_{1} and w_{2} are mutually exclusive. This is what basic theory of conditional probability density functions tells us.

So, this is why I think that p(x|w_{i}) is not an ordinary conditional pdf like the one defined in the beginning of this post. But what is it then?! I don't know, I'm confused. Or maybe I shouldn't interpret classes w_{i} as regions in the sample space, and that's the mistake I'm making here. But what are they then, how should I interpret them?

Also, if I assume that it's okay to interpret classes w_{i} as as regions in space, isn't there a recursive problem, because we first define p(x|w_{i}) over, supposedly known, event w_{i}, but we actually don't know what region the event w_{i} occupies in sample space? Because, that's, like, the point of classification, to determine these regions, that's what this is all about.
But is this really okay, to define a function in the beginning which domain is actually unknown?

Hopefully, I made at least some sense here, and thanks in advance for any help I can get.
Cheers.

Stephen Tashi · May 30, 2011

This is an interesting question! I tend to think of practical problems in verbal terms ( events, statements, information etc.) instead of measure theory terms, so my attempt to explain this may be lacking - which won't deter me from trying!

Suppose we generate a random variable X in by the following process:

We have given two different pdfs f_1(x) and f_2(x). (Imagine a normal and an exponential pdf, if you like.) We throw a fair die. If the die lands with the 1-face up, we set X equal to a random realization from f_1. Otherwise we set X equal to a random realization drawn from f_2.

It is possible to define a pdf for X on the real line by (1/6) f_1(x) + (5/6)f_2(x). However, this is not the most fundamental way to look at the "probability space". The fundamental way to look at it is to look at a pdf whose domain is the space of pairs of numbers (x,k) where x is a real number and k is a integer from 1 to 6. If you look at it that way, then the conditional density of X given that k = 1 is formed by confining our attention to a subset of the domain of the pdf for X.

People who take a practical approach to probabiliy are used to dealing with "point masses". For example, consider an abstract dart game which is scored as follows. For each throw, let R be the distance away from the center of the dartboard that your dart lands. You get 1/R points if your dart lands at a distance more than 3 inches from the center and you get 10 points if your dart lands within 3 inches of the center. If S is the random variable representing a persons score on one dart throw, then a pdf f(s) for S defined on the real numbers would be non-zero on the interval (0,1/3) and at the point s = 10. If you try to do a Riemann integral of such a function, it won't integrate to 1 since the non-zero value f(10) occurs over a point, not over a finite interval. So, from the practical point of view, you declare that the value s = 10 is a "point mass", which means that the type of integration you do, must add the probability that s= 10 to total.

The need for probability theory to deal with a combination of continuous and discrete situations is one of the main motivations for developing advanced theories of integration. You won't be able to visualize the mathematics of such situations by thinking only in terms of Riemann integration.

Lajka · Jun 2, 2011

Hey Stephen, thanks for your answer! I really like this example f(x,k) = (1/6) f_1(x) + (5/6)f_2(x). But does that mean that we could visualise f(x,k) as

Am I right? Or is this wrong too?
Because I think you're trying to tell me that I'm wasting my time trying to present everything in terms of probability and sample spaces. For example, how would I present f_1(x) here, as a section of f(x,k) and a plane k=1?
However, I think I was wrong now with visualising classes w_{i} as regions, I don't think they are that, regions are just our way of representing classes. That's why there is a room for error, after all. I'm still wrestling with this idea, tho.

Thanks again!

Stephen Tashi · Jun 2, 2011

It is correct to visualize f(x,k) in the coordinate system you illustrated.

how would I present f1(x) here, as a section of f(x,k) and a plane k=1?

Yes, with the understanting that f1 means the density on the condition that k = 1.

I think you're trying to tell me that I'm wasting my time trying to present everything in terms of probability and sample spaces.

No, I don't mean that. In fact, I don't know how you can understand this material in terms other than sample spaces and probability. It s correct to visualize posterior densities as being defined by regions of densities sample spaces. But you can't insist that the sample space always contains each point in the n-dimensional space of real numbers. The sample space might be the cartesian product of a set of the n-dimensional real numbers with a set of discrete things.

Geometric visualization has limited usefulness. You can have complicated conditionals - for example: the conditional density of x given that (x < 2 and the die landed 1) or ( x is between two consecutive odd integers and the die landed 6). I think algebra is a better guide in such situations than geometry.

Lajka · Jun 6, 2011

Yeah, that's my understanding of it now, too. Like I maybe need not force geometric interpretations if algebraic ones seem to be a better fit under the circumstances, like in those examples you mentioned above. But then again, just for the consistency's sake, I like to know that I'm able to do both, if I need to. I'm still wrestling with this, but I think it'll all blend in if I give it time.

Thanks again for all your help!

P.S. Sorry for late responses, I was out of town for the whole week (I was checking forums on my cell, tho).

A question regarding conditional probabilities

Thread 'My basic understanding of set theory'

Similar threads

Undergrad A variant of the Monty Hall problem

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad What Are the Axioms of Fuzzy Logic and How Do They Extend Boolean Algebra?

High School How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

High School Onto set mapping is the surjective set mapping, and into injective?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers