Probability: why can we use the Dirac delta function for a conditional pdf?

Master1022 · Jan 8, 2021

Hi,

I have a quick question about something which I have read regarding the use of dirac delta functions to represent conditional pdfs. I have heard the word 'mask' thrown around, but I am not sure whether that is related or not.

The source I am reading from states:
[tex] p(x) = \lim_{\sigma \to 0} \mathcal{N}(\mathbf{x}; \mu , \sigma^2) = \delta(x - \mu) [/tex]
which makes sense from a graphical standpoint. As we reduce the variance of a normal distribution, it tends towards the shape of a dirac delta function.

Now it says: "the dirac delta is useful as a conditional pdf when we know that [itex] y = f(x) [/itex], giving:
[tex] p(y | x) = \delta(y - f(x)) [/tex]"
and it is here where I am confused. There is no further explanation of this result. So I know that when the argument of the delta function is 0, it evaluates to 1. However, I cannot understand what is going on here and how to interpret this expression.

Any help or guidance would be greatly appreciated.

Thanks

Twigg · Jan 8, 2021

When ##y=f(x)##, there is one-to-one correlation so you don't expect a continuous pdf like you would if say, ##y = x + \epsilon## where ##\epsilon## was a normally distributed random variable with ##\mu = 0## and ##\sigma = 1##. In the latter case, you would expect a gaussian pdf, because the correlation between x and y is mixed up by the randomness of ##\epsilon##. In the case where ##y = f(x)##, the pdf must reflect the fact that y is a deterministic function of x. When you evaluate the probability of y being in the interval from ##a## to ##b##, all that matters is whether or not ##f(x)## falls in between a and b. In other words $$\int_{a}^{b} p(y|x)dy = 1 \mbox{ for a<f(x)<b, 0 otherwise}$$

Look familiar?

Master1022 · Jan 9, 2021

Thank you very much for your reply @Twigg . Apologies for my delayed reply, I have been re-reading this over and over and something hasn't quite clicked yet.

Twigg said:

When ##y=f(x)##, there is one-to-one correlation so you don't expect a continuous pdf like you would if say, ##y = x + \epsilon## where ##\epsilon## was a normally distributed random variable with ##\mu = 0## and ##\sigma = 1##.

Yes, this makes sense

Twigg said:

In the latter case, you would expect a gaussian pdf, because the correlation between x and y is mixed up by the randomness of ##\epsilon##.

Understood

Twigg said:

In the case where ##y = f(x)##, the pdf must reflect the fact that y is a deterministic function of x.

I think this makes sense. Would you be able to elaborate on this a bit more?

Twigg said:

When you evaluate the probability of y being in the interval from ##a## to ##b##, all that matters is whether or not ##f(x)## falls in between a and b.

Sure, and this is because [itex] y = f(x) [/itex]

Twigg said:

In other words $$\int_{a}^{b} p(y|x)dy = 1 \mbox{ for a<f(x)<b, 0 otherwise}$$

Just to check, are we limiting the 'range' of [itex] f(x) [/itex] to be between [itex] a [/itex] and [itex] b [/itex].

Twigg said:

Look familiar?

Unfortunately not... I'm guessing I should see something resembling the delta function, but it just isn't apparent to me at the moment. It we are defining f(x) to be 0 outside of a certain range, then that reminds me of passing a rectangular window over [itex] f(x) [/itex], but this seems to be irrelevant. Otherwise, I suppose as [itex] a [/itex] and [itex] b [/itex] become very close, then that integral does resemble the result from a delta function.

Stephen Tashi · Jan 9, 2021

Master1022 said:

The source I am reading from states:

What is the source? An engineering article or applied math text may make statements that are not mathematically precise.

[tex] p(x) = \lim_{\sigma \to 0} \mathcal{N}(\mathbf{x}; \mu , \sigma^2) = \delta(x - \mu) [/tex]
which makes sense from a graphical standpoint. As we reduce the variance of a normal distribution, it tends towards the shape of a dirac delta function.

If we define ##\delta(x-y)## to be the function given by
##\delta(x-y) = 1 ## if ##x = y##
##\delta(x-y) = 0 ## if ##x \ne y##

Then ## \int_{-\infty}^{\infty} \delta(x-y) dx = 0 ## So ##\delta(x-y)## is not a probability distribution.

The informal idea of a delta function is to imagine (as you indicated) a function that is defined to zero except at one value, but which still has the property that ##\int_{-\infty}^{\infty}\delta(x-y) dx = 1##. In engineering or applied math, an article may expect the reader to use this type of informal and imprecise concept.

We can look at the practical use of such a concept in this general way:
Suppose we are solving a problem that asks for a result of the form:

Answer ##= lim_{\sigma \rightarrow 0} A( f(x,\mu,\sigma), r,s,...)##

where ##A## is some algorithm that produces a result that is function of several variables, including the variable ##\sigma##. The usual way of find Answer involves working out the formula ##A(...)## and then taking the limit of that formula as ##\sigma \rightarrow 0##. We can daydream about a shortcut method where we substitute-in some expression for ##f(x,\mu,\sigma)## that doesn't involve the variable ##\sigma##, but nevertheless gives us the Answer to the problem and saves us the steps of keeping ##\sigma## in formula.

Twigg · Jan 9, 2021

When I was talking about y being a deterministic function of x, all I really mean is that if x has a well-defined value, then y has a well-defined value. That's really all there is to it. If you know x, you know y exactly.

The limits a and b here are arbitrary numbers and do not have to be in the range of f(x). What happens ~~when a and b are outside the range of f(x)~~ when the interval between a and b does not include the range of f(x)? The conditional probability of finding y given any x is 0. Consequently, the non-conditional probability of finding y between a and b is also 0 (##p(y) = 0##).

By definition, the delta function is the distribution that satisfies $$\int^{a}_{b} \delta(x-c) dx = 1\mbox{ for a<c<b, 0 otherwise}$$

vela · Jan 9, 2021

Master1022 said:

Now it says: "the dirac delta is useful as a conditional pdf when we know that [itex] y = f(x) [/itex], giving:
[tex] p(y | x) = \delta(y - f(x)) [/tex]"
and it is here where I am confused. There is no further explanation of this result. So I know that when the argument of the delta function is 0, it evaluates to 1. However, I cannot understand what is going on here and how to interpret this expression.

It would help if you could provide a bit more context for this statement.

One minor error in what you wrote is that ##\delta(0) \ne 1##. Rather, you have ##\delta(0) = \infty## and
$$\int_{-\infty}^\infty \delta(x)\,dx = 1.$$
The Dirac delta function can be used to describe a discrete probability distribution in a "continuous" way. For example, take rolling a six-sided die. We can describe the probability of each outcome discretely, i.e., ##p(n) = 1/6## for ##n=1,2,3,4,5,6##. Or equivalently, we can write down a probability density function
$$f(x) = \sum_{n=1}^6 \frac 16\, \delta(x-n).$$ Since ##\delta(x-x_0) = 0## when ##x \ne x_0##, this function vanishes when ##x## is not equal to one of 1, 2, 3, 4, 5, or 6 which makes sense as rolling a die can't produce any value other than those six possibilities.

Probability density functions are usually used for continuous distributions, but using the delta function, we can write one down for discrete distributions as well.

Apologies if I'm saying stuff you already know, but it wasn't clear to me what your question was.

Master1022 · Jan 10, 2021

Thanks for the reply @Stephen Tashi

Stephen Tashi said:

What is the source? An engineering article or applied math text may make statements that are not mathematically precise.

It was some lecture notes. The previous slide was the definition of conditional probability and then this slide appeared. The text I put in the original post was the only text that the slide contained. There was no explanation of this concept.

Stephen Tashi said:

If we define ##\delta(x-y)## to be the function given by
##\delta(x-y) = 1 ## if ##x = y##
##\delta(x-y) = 0 ## if ##x \ne y##

Then ## \int_{-\infty}^{\infty} \delta(x-y) dx = 0 ## So ##\delta(x-y)## is not a probability distribution.

Are we saying that if such an integral exists, then ## \delta(x - y) ## is not a probability distribution?

Stephen Tashi said:

The informal idea of a delta function is to imagine (as you indicated) a function that is defined to zero except at one value, but which still has the property that ##\int_{-\infty}^{\infty}\delta(x-y) dx = 1##. In engineering or applied math, an article may expect the reader to use this type of informal and imprecise concept.

Agree

Stephen Tashi said:

We can look at the practical use of such a concept in this general way:
Suppose we are solving a problem that asks for a result of the form:

Answer ##= lim_{\sigma \rightarrow 0} A( f(x,\mu,\sigma), r,s,...)##

where ##A## is some algorithm that produces a result that is function of several variables, including the variable ##\sigma##. The usual way of find Answer involves working out the formula ##A(...)## and then taking the limit of that formula as ##\sigma \rightarrow 0##. We can daydream about a shortcut method where we substitute-in some expression for ##f(x,\mu,\sigma)## that doesn't involve the variable ##\sigma##, but nevertheless gives us the Answer to the problem and saves us the steps of keeping ##\sigma## in formula.

Thanks - I will need to think about this one for a while. I'll get back to you if I have any questions once I have understood this

Master1022 · Jan 10, 2021

Thanks @Twigg and @vela for your responses. I have just combined this into one post to ask the same follow up question.

vela said:

It would help if you could provide a bit more context for this statement.

This slide showed up whilst learning about conditional probability. This was the entirety of the text on this slide and there was no explanation. I wouldn't have minded so much if this concept wasn't used in later proofs.

vela said:

One minor error in what you wrote is that ##\delta(0) \ne 1##. Rather, you have ##\delta(0) = \infty## and
$$\int_{-\infty}^\infty \delta(x)\,dx = 1.$$

Yes, you are correct. Apologies.

Twigg said:

When I was talking about y being a deterministic function of x, all I really mean is that if x has a well-defined value, then y has a well-defined value. That's really all there is to it. If you know x, you know y exactly.

Okay thank you very much, that is clear

Twigg said:

The limits a and b here are arbitrary numbers and do not have to be in the range of f(x). What happens ~~when a and b are outside the range of f(x)~~ when the interval between a and b does not include the range of f(x)? The conditional probability of finding y given any x is 0. Consequently, the non-conditional probability of finding y between a and b is also 0 (##p(y) = 0##).

By definition, the delta function is the distribution that satisfies $$\int^{a}_{b} \delta(x-c) dx = 1\mbox{ for a<c<b, 0 otherwise}$$

vela said:

Apologies if I'm saying stuff you already know, but it wasn't clear to me what your question was.

Thanks for the clarification. However, I am still confused as to what the point of using ## \delta( y - f(x)) ## is? I know that the delta function will be non-zero for ## y = f(x) ##, but what is the point of that information? From my point of view, we have just let some variable ## z = a ## and then have written ## \delta(z - a) ## which doesn't seem to provide any new information... I feel like I am missing something subtle.
Is it a shorthand way of writing that the pdf of ## p(y) ## is only defined over the range of ## f(x) ## and zero elsewhere? Would an extended way of writing this be:
[tex] p(y) = \begin{cases} \text{something}, \text{when we are within range of} f(x) \\
0, \text{elsewhere} \end{cases} [/tex]

Thank you for taking the time to help me, I do really appreciate it.

Twigg · Jan 11, 2021

You're right about it not being new information. The delta function pdf is just what you already know about exact correlations translated into the language of pdfs. My guess is if they bothered to introduce it in your lectures, then they intend to use it for proofs.

vela · Jan 11, 2021

Master1022 said:

Is it a shorthand way of writing that the pdf of ## p(y) ## is only defined over the range of ## f(x) ## and zero elsewhere?

Defined isn't the right word since presumably ##p(y)## is defined for all real ##y##, but your basic idea is correct.

Could you provide us with an example of how the statement was used on the later slides? I do find the statement as given confusing. It seems to rely on unstated assumptions.

Stephen Tashi · Jan 11, 2021

Master1022 said:

Thanks for the clarification. However, I am still confused as to what the point of using ## \delta( y - f(x)) ## is?

It's a useful fiction. Fictions are often used in applied math. For example, there is the fiction of pretending that the derivative of a function ##f## can be expressed as the ratio of two "infinitely small" real numbers as ##\frac{df}{dx}## and using manipulations where the symbols ##df## and ##dx## are manipulated like members of the real number system.

Keep in mind that when we define the properties of a mathematical object, we have not proved that an object with those properties can exist. Using the usual definition of integration, there is no real valued function ##\delta(x)## with the property that ##\int_{a}^{b} \delta(x) dx = 1 ## if ##a < 0 < b## and ## \int_a^b \delta(x) dx = 0 ## otherwise. A "delta function" is only a (real valued) function in the sense that "counterfeit currency" is a type of currency.

It is possible to define "delta functions" so that they exist in mathematical contexts that are more general that functions on ##\mathbb{R}##. (The terminology for this is "generalized functions".) However, in applied math books, the student may be told to imagine that a delta function is a real valued function with certain (impossible) properties.

A basic convenience of imagining delta functions as real valued functions is that it simplifies the notation for mathematical statements about probability distributions. Elementary probability theory usually begins by defining discrete probability distributions. Then it introduces continuous probability densities. A discrete probabiity density defined on a discrete set such as ##{1,2,3,4,5,6}## and satisfies a property like ##\sum_{i=1}^6 f(x) = 1 ## By contrast, continuous probability densities satisfy a property that is stated in terms of integration. It would be convenient to use only one type of notation ( integration or summation) for stating theorems about probability distributions. In the advanced mathematical topic called "measure theory", this is accomplished in a rigorous way using only the symbol for integration and using a very general definition of "integration". A book written for students who have not studied measure theory may still want the convenience of stating all theorems about probability distributions using the notation of integration. This can be accomplished by asking students to imagine delta functions.

Master1022 · Jan 13, 2021

vela said:

Defined isn't the right word since presumably ##p(y)## is defined for all real ##y##, but your basic idea is correct.

Could you provide us with an example of how the statement was used on the later slides? I do find the statement as given confusing. It seems to rely on unstated assumptions.

Sure, so it is used in a proof for the sum of random variables being equal to the convolution if they are independent (which might warrant another thread from me at a later time)

The 'proof' says:
[itex] p(a, b) [/itex] is the joint distribution of [itex] a [/itex] and [itex] b [/itex] and [itex] u = a + b [/itex]. Determine [itex] p(u) [/itex].

Write the probability of everything
$$ p(u, a, b) = p(u | a,b) p(a, b)
= δ(u − (a + b)) p(a, b) $$
Then, as [itex] b = u − a [/itex], we can marginalise out [itex] a [/itex] and [itex] b [/itex] as
$$ p(u) = \iint \delta(u−(a+b) p(a, b) da db $$
$$ = \iint p(A = a, B = u − a) da $$
OR $$ = \iint p(A = u − b, B = b) db $$

And then the proof continues...

Stephen Tashi · Jan 14, 2021

Master1022 said:

Sure, so it is used in a proof for the sum of random variables being equal to the convolution if they are independent (which might warrant another thread from me at a later time)

The 'proof' says:

The proof is not rigorous if we interpret it from the point of view of a student in an introductory probability course. From that point of view, a notations like "##p(a,b)##","##p(u)##" denote real valued functions of real variables.

The delta function can be defined as a "distribution" in a different sense than the "distributions" of elementary probability theory. From the viewpoint of Measure Theory, the delta function is a "measure". A measure is a type of function whose domain is a collection of sets of points.

For example, if ##f(x)## is a probability density in the sense of distributions in elementary probability theory, then it is not technically correct to write ##f([0,3] \cup [4,7] )## since the the domain of ##f## does not contain sets of numbers such as ## [0,3] \cup [4,7]##, but only single numbers. If we want to denote the probability of the set ##[0,3] \cup [4,7]## we write ##\int_{[0,3] \cup [4,7]}f(x) dx ## and evaluate it as ##\int_0^3 f(x) dx + \int_4^7 f(x) dx##.

A probability density ##f## implies a method for assigning probabilities to sets of points through the process of integration. So a probability density implies a measure, but a probability density function itself is not a measure.

The "p-notation" is used ambiguously in many probability books. If we think of "p" as a measure then notation like "##p(x > 3)##" makes sense since a set like ##\{x:x > 3\} = [0,\infty)## can be in the domain of a measure. By contrast, we we think of "p" a probability density function then "##p(x > 3)##" is not correct notation. If we consider ##p## to be a real valued function, we should use notation like ##\int_3^\infty p(x) dx ##.

So in the proof you quoted, a fundamental question is: How the is the p-notation to be interpreted?

Probability: why can we use the Dirac delta function for a conditional pdf?

1. What is the Dirac delta function and how is it used in probability?

2. Why is the Dirac delta function useful for representing conditional probability?

3. How does the Dirac delta function relate to the concept of a "point mass" in probability?

4. Can the Dirac delta function be used for discrete random variables?

5. Are there any limitations to using the Dirac delta function for conditional probability?

Similar threads

Hot Threads

Recent Insights