Conditional expectation and covariance of function of RVs

In summary: I think I understand now. Conditioning sets y to a possible value of its distribution, so after conditioning, y is not considered anymore when differentiating with respect to it.
  • #1
perplexabot
Gold Member
329
5
Hey all, I have been doing some math lately where I need to find the conditional expectation of a function of random variables. I also at some point need to find a derivative with respect to the variable that has been conditioned. I am not sure of my work and would appreciate it if you guys can maybe have a look.

Let us assume you have the following function (an AR(1) process): [itex]z_{k+1} = x_{k+1}z_k + \epsilon_{k+1}[/itex], where
[itex]x_{k+1}=x_k + v_{k+1}[/itex],
[itex]v_{k+1}\sim\mathcal{N}(0,\sigma_v^2)[/itex],
[itex]\epsilon_{k+1}\sim\mathcal{N}(0,\sigma_{\epsilon}^2)[/itex] and [itex]\epsilon[/itex] is iid.

So first, here is the conditional expectation of [itex]z_{k+1}[/itex] given [itex]x_{k+1}[/itex] and [itex]z_k[/itex]
[itex]E[z_{k+1}|x_{k+1},z_k]=x_{k+1}z_k = \mu[/itex] (is this correct?)
Here is the conditional covariance of [itex]z_{k+1}[/itex] given [itex]x_{k+1}[/itex] and [itex]z_k[/itex]
[itex]
\begin{equation}
\begin{split}
cov(z_{k+1}|x_{k+1},z_k) &= E[(z_{k+1}-\mu)(z_{k+1}-\mu)^T|x_{k+1},z_k] \\
&= E[z_{k+1}^2|x_{k+1},z_k]-E[z_{k+1}\mu|x_{k+1},z_k]-E[\mu z_{k+1}|x_{k+1},z_k]+E[\mu^2|x_{k+1},z_k] \\
&= E[z_{k+1}^2|x_{k+1},z_k]-\mu^2-\mu^2+\mu^2 \\
&= E[z_{k+1}^2|x_{k+1},z_k]-\mu^2 \\
&= E[(x_{k+1}z_k)^2 + 2x_{k+1}z_k\epsilon_{k+1}+\epsilon^2|x_{k+1},z_k] - \mu^2 \\
&= \mu^2 + \sigma^2 - \mu^2 \\
& = \sigma^2
\end{split}
\end{equation}
[/itex]
(is this correct?)

As for my derivative of conditioned RV, I think I will wait for your opinion/feedback on the above before I ask.

Any help and or comments will be greatly appreciated.
Thank you for reading : )

EDIT: Fixed an error
 
Last edited:
Physics news on Phys.org
  • #2
perplexabot said:
[itex]E[z_{k+1}|x_{k+1},z_k]=x_{k+1}z_k = \mu[/itex] (is this correct?)
Not necessarily, because ##\epsilon_{k+1}## may be affected by ##z+{k+1}## or ##z_k##. A sufficient condition for it to be correct would be if, in addition to what you have specified, ##\epsilon_{k+1}##is independent of ##v_j## for ##1\leq j\leq k+1##. The iid condition only tells us about dependencies within the sequence of ##\epsilon##s, not between elements of the sequence and other random variables.
Here is the conditional covariance of [itex]z_{k+1}[/itex] given [itex]x_{k+1}[/itex] and [itex]z_k[/itex]
It is a variance, not a covariance. This is only a matter of naming and does not affect the calculations.
[itex]
\begin{equation}
\begin{split}
&= E[(x_{k+1}z_k)^2 + 2x_{k+1}z_k\epsilon_{k+1}+\epsilon^2|x_{k+1},z_k] - \mu^2 \\
&= \mu^2 + \sigma^2 - \mu^2 \\
\end{split}
\end{equation}
[/itex]
This step also relies on the above additional assumption that the ##\epsilon##s are independent of the ##v##s.

In summary, by adding the extra assumption about independence, and replacing 'covariance' by 'variance', it will become sound.
 
  • Like
Likes perplexabot
  • #3
andrewkirk said:
Not necessarily, because ##\epsilon_{k+1}## may be affected by ##z+{k+1}## or ##z_k##. A sufficient condition for it to be correct would be if, in addition to what you have specified, ##\epsilon_{k+1}##is independent of ##v_j## for ##1\leq j\leq k+1##. The iid condition only tells us about dependencies within the sequence of ##\epsilon##s, not between elements of the sequence and other random variables.

It is a variance, not a covariance. This is only a matter of naming and does not affect the calculations.

This step also relies on the above additional assumption that the ##\epsilon##s are independent of the ##v##s.

In summary, by adding the extra assumption about independence, and replacing 'covariance' by 'variance', it will become sound.

Thank you very much! I now realize why [itex]v_{k+1}[/itex] and [itex]\epsilon_{k+1}[/itex] must be independent and that I must include this in my assumptions. : ) That's great help!

Now if I may ask my follow up question about differentiating with respect to a conditioned variable. Let [itex]f(x,y)[/itex] be a function of random variables [itex]x[/itex] and [itex]y[/itex]. Let us say that the [itex]E[f(x,y)|y]=y^2[/itex]. What is the derivative of [itex]E[f(x,y)|y][/itex] with respect to random variable [itex]y[/itex]? In other words, what is the answer to [itex]\frac{\partial E[f(x,y)|y]}{\partial y}[/itex]?

Is it simply, [itex]\frac{\partial y^2}{\partial y} = 2y[/itex]?

I ask this question because I am a little confused about [itex]y[/itex] after conditioning. I know that when you condition the random variable, it is setting it to some possible value of its distribution ([itex]E[f(x,y)|y]=E[f(x,y)|y=y_i][/itex], where [itex]y_i[/itex] is some instance of random variable [itex]y[/itex]). So is [itex]y[/itex] still considered when differentiating with respect to it after conditioning it (woops, that is definitely badly worded, sorry), or is it treated as a constant so that [itex]\frac{\partial y_i^2}{\partial y} = 0[/itex]? You may just disregard this paragraph : /

Thanks again.
 
  • #4
perplexabot said:
Is it simply, [itex]\frac{\partial y^2}{\partial y} = 2y[/itex]?
Yes it is.
I find these things appear clearer if one is fairly formal with one's symbolism, for instance using capital letters for random variables and lower case for items that are not random. Then the expression you wrote as ##E[f(x,y)|y] =y^2## is written as:
$$E\left[f(X,Y)\ |\ Y=y\right]=y^2$$
Since ##y## is an ordinary old non-random variable, we are free to differentiate both sides with respect to it, to get
$$\frac\partial{\partial y}E\left[f(X,Y)\ |\ Y=y\right]=\frac\partial{\partial y}y^2=2y$$
 
  • Like
Likes perplexabot
  • #5
perplexabot said:
Let us assume you have the following function (an AR(1) process): [itex]z_{k+1} = x_{k+1}z_k + \epsilon_{k+1}[/itex], where
[itex]x_{k+1}=x_k + v_{k+1}[/itex],
[itex]v_{k+1}\sim\mathcal{N}(0,\sigma_v^2)[/itex],
[itex]\epsilon_{k+1}\sim\mathcal{N}(0,\sigma_{\epsilon}^2)[/itex] and [itex]\epsilon[/itex] is iid.

That model has a "multiplicative shock" due to ##v_k##, so should you call it an AR(1) process ?

So first, here is the conditional expectation of [itex]z_{k+1}[/itex] given [itex]x_{k+1}[/itex] and [itex]z_k[/itex]
[itex]E[z_{k+1}|x_{k+1},z_k]=x_{k+1}z_k = \mu[/itex] (is this correct?)

Does your notation imply that ##\mu## is a number that can be different value for (say) k = 5 than it is for k = 6 ?
 
  • Like
Likes perplexabot
  • #6
andrewkirk said:
Yes it is.
I find these things appear clearer if one is fairly formal with one's symbolism, for instance using capital letters for random variables and lower case for items that are not random. Then the expression you wrote as ##E[f(x,y)|y] =y^2## is written as:
$$E\left[f(X,Y)\ |\ Y=y\right]=y^2$$
Since ##y## is an ordinary old non-random variable, we are free to differentiate both sides with respect to it, to get
$$\frac\partial{\partial y}E\left[f(X,Y)\ |\ Y=y\right]=\frac\partial{\partial y}y^2=2y$$

Thank you! That makes more sense now. Does one ever differentiate with respect to the random variable (using the notation you provided, [itex]\frac{\partial}{\partial Y}[/itex] rather than [itex]\frac{\partial}{\partial y}[/itex])? I am thinking of cases such as the fisher information matrix.

Stephen Tashi said:
That model has a "multiplicative shock" due to ##v_k##, so should you call it an AR(1) process ?
Does your notation imply that ##\mu## is a number that can be different value for (say) k = 5 than it is for k = 6 ?
Hmmm, I see your point, so you are saying since it has a multiplicative shock component then maybe it is a ARMA process? I'm not sure what to call it. I still think it is an AR process.

I have not assumed stationarity so I guess [itex]\mu[/itex] should be [itex]\mu_{k+1}[/itex], right? I guess my work above only applies when [itex]|x_{k+1}|< 1[/itex].

Thank you!
 
  • #7
perplexabot said:
Hmmm, I see your point, so you are saying since it has a multiplicative shock component then maybe it is a ARMA process?

No, if it has a multiplicative shock it is not a "ARMA" process. I think the terminology "ARMA" is restricted to models where the current value of the process is a linear combination of past values plus an additive noise.

In searching the web, "multiplicative shock" produces many hits dealing with economic modeling, so people have ways of dealing with multiplicative shock in particular situations. However, I haven't found any math notes on stochastic modelling that give general procedures for fitting models with multiplicative shock. By contrast, the method of dealing with ARMA models is standardized ("Box-Jenkins").

I have not assumed stationarity so I guess [itex]\mu[/itex] should be [itex]\mu_{k+1}[/itex], right?

Yes.
 
  • Like
Likes perplexabot
  • #8
Stephen Tashi said:
No, if it has a multiplicative shock it is not a "ARMA" process. I think the terminology "ARMA" is restricted to models where the current value of the process is a linear combination of past values plus an additive noise.

In searching the web, "multiplicative shock" produces many hits dealing with economic modeling, so people have ways of dealing with multiplicative shock in particular situations. However, I haven't found any math notes on stochastic modelling that give general procedures for fitting models with multiplicative shock. By contrast, the method of dealing with ARMA models is standardized ("Box-Jenkins").
Yes.
First, thank you for all the help and clarification.

ARMA, as you said, is where the current value of the process is a linear combination of past values plus an additive noise. BUT, if I am not mistaken, ARMA also depends on past shocks, or past noise values. This can be seen in wiki. I still think that the model above is an AR process. It is an AR process which has random variables as parameters. At least, that is how I am looking at it.
 
  • #9
perplexabot said:
Does one ever differentiate with respect to the random variable (using the notation you provided, [itex]\frac{\partial}{\partial Y}[/itex] rather than [itex]\frac{\partial}{\partial y}[/itex])? I am thinking of cases such as the fisher information matrix.
A real random variable is actually a function from the sample space ##\Omega## to the real numbers. One cannot perform differentiation, as understood in its 'vanilla' form, with respect to a function. There are extensions, such as in Calculus of Variations, where one can differentiate with respect to functions. There is also the Radon-Nikodym derivative, which is a generalisation of the usual notion of derivative, that is used in probability theory. The latter is important in the analysis of Wiener processes (Brownian Motions), which are closely related to your example. However, IIRC, the Radon-Nikodym derivative differentiates with respect to a measure, not a random variable.

I think most simple cases in probability theory can be expressed in terms of vanilla differentiation with respect to non-random variables.
 
  • Like
Likes perplexabot
  • #10
perplexabot said:
First, thank you for all the help and clarification.

ARMA, as you said, is where the current value of the process is a linear combination of past values plus an additive noise. BUT, if I am not mistaken, ARMA also depends on past shocks, or past noise values.

If you have multiplicative noise, a linear combination of past values gives a dependence on products of past noise values. In an ARMA process the expression for the current value doesn't involve products of past noise values.
 
  • Like
Likes perplexabot
  • #11
andrewkirk said:
A real random variable is actually a function from the sample space ##\Omega## to the real numbers. One cannot perform differentiation, as understood in its 'vanilla' form, with respect to a function. There are extensions, such as in Calculus of Variations, where one can differentiate with respect to functions. There is also the Radon-Nikodym derivative, which is a generalisation of the usual notion of derivative, that is used in probability theory. The latter is important in the analysis of Wiener processes (Brownian Motions), which are closely related to your example. However, IIRC, the Radon-Nikodym derivative differentiates with respect to a measure, not a random variable.

I think most simple cases in probability theory can be expressed in terms of vanilla differentiation with respect to non-random variables.
Wow, that definitely expanded my general knowledge of derivatives. I have never even heard of these extensions you speak of. Very cool! Thank you : )

Stephen Tashi said:
If you have multiplicative noise, a linear combination of past values gives a dependence on products of past noise values. In an ARMA process the expression for the current value doesn't involve products of past noise values.

Once again thank you for directing me in the right direction. Your help is much appreciated.

Thank you all.
 
  • #12
I have one more question if I may. Let us use the following equation (the same one from above): [itex]x_{k+1}=x_k + v_{k+1}[/itex], where once again, [itex]v_{k+1}\sim\mathcal{N}(0,\sigma_v^2)[/itex]. It can be shown that
[itex]E[x_{k+1}|x_k]=x_k[/itex]
and that
[itex]cov(x_{k+1}|x_k)=\sigma_v^2[/itex].
Now, substituting and setting up the normal distribution, we get:
[itex]p(x_{k+1}|x_k)=\frac{1}{\sqrt{2\sigma_v^2\pi}}e^{\frac{(x_{k+1}-x_k)^2}{2\sigma_v^2}}[/itex]

Now for my question, can you do the following substitution:
[itex]p(x_{k+1}|x_k)=\frac{1}{\sqrt{2\sigma_v^2\pi}}e^{\frac{((x_k+v_{k+1})-x_k)^2}{2\sigma_v^2}}=\frac{1}{\sqrt{2\sigma_v^2\pi}}e^{\frac{(v_{k+1})^2}{2\sigma_v^2}}[/itex] ?

Surely that substitution (or the following subtraction) is wrong. Why is it wrong? I have a feeling that the subtraction of the two [itex]x_k[/itex] does not apply because one is a random variable while the other is a conditioned version of the random variable. Is that what is wrong?

Thank you.
 
  • #13
@perplexabot Again, resorting to formalism can help us to clarify what is otherwise a fairly confusing move.

What you have written as ##p(x_{k+1}|x_k)## is actually
$$p_{X_{k+1}|(X_k=x_k)}(x_{k+1})$$
where ##p_{X_{k+1}|(X_k=x_k)}## is a function from ##\mathbb R## to ##\mathbb R## that is the conditional probability density function of the random variable ##X_{k+1}##, conditioned on the information that ##X_k=x_k##. Here ##x_k,x_{k+1}## are ordinary, non-random variables. Take careful note of which items are upper and which are lower case. It's important.

So we have
$$p_{X_{k+1}|(X_k=x_k)}(x_{k+1})=\frac{1}{\sqrt{2\sigma_v^2\pi}}e^{\frac{(x_{k+1}-x_k)^2}{2\sigma_v^2}}$$
This is an ordinary old non-random equation, with the only bit that relates to random variables at all being that they were used to identify the function ##p_{X_{k+1}|(X_k=x_k)}##. But having identified that function, it is now a perfectly ordinary, non-random, function.

Hence, if we define ##v_{k+1}## to be ##x_{k+1}-x_k## it follows that:
$$p_{X_{k+1}|(X_k=x_k)}(x_{k+1})=\frac{1}{\sqrt{2\sigma_v^2\pi}}e^{\frac{v_{k+1}{}^2}{2\sigma_v^2}}$$
 
  • Like
Likes perplexabot
  • #14
perplexabot said:
I have a feeling that the subtraction of the two [itex]x_k[/itex] does not apply because one is a random variable while the other is a conditioned version of the random variable. Is that what is wrong?

To add to what andrewkirk wrote, there is a saying:

Random variables are not random and they are not variables.

A "random variable" is defined by a distribution on a probability space, not by a single value. So if ##X## is a random variable, it has a distribution and this distribution is not "randomly" changing from one distribution to another.

Furthermore, if ##X## denotes a "random variable" then it is not a "variable" in the ordinary sense of the word - i.e. it is not a symbol that represents a single (but perhaps unknown) number.

A probability density function is not "a function of a random variable". It is a function of ordinary variables that represent numbers. The interpretation of the value of a probability density function can be related to an event in the probability space of the associated random variable.

For a random variable ##X## we can define a "function of a random variable" like ## Y = X + e^{(5+X)}##. Defining ##Y## in this manner makes ##Y## another random variable - and not a function of ordinary variables.

It would be fair to say that we use the notation for a function of ordinary variables ( ## y = f(x) = x + e^{(5+x)} ##) when we give the definition of ##Y##. However if when you define an ordinary function ##f(x)##, you don't automatically get that there is some other function associated with ##f(x)## that is a "distribution" of ##f(x)##. When you define a function ##f(X)## of a random variable ##X##, you do automatically know that ##f(X)## has some distribution.

To enforce those distinctions, some people prefer to denote random variables by capital letters and possible realizations of those variables by lower case letters. So instead of ##p( x) ## they write ##p(X=x)##.

This isn't a perfect scheme because continuous probability density functions have values that are probability densities - not probabilities. (It analogous to the difference between the physical units of kilograms (which is a mass) vs the physical units of kilograms per meter ( which is a linear mass density).

For example, if you use ##p(...)## to denote "the probability of" then its tempting to write ##p(X = x) =\frac{1}{\sqrt{2\sigma^2\pi}} \ e^{ \frac{x^2}{2\sigma^2}}## However, the normal density function evaluated at ##x## doesn't give you the probability that ##X## is exactly equal to ##x##. Instead, the density function gives the probability density at ##x##. (By analogy, a physical object cannot have a mass of 5 kg at a point, but it can have a mass density of 5 kg/ meter at a point.)

Using the notation convention in andrewkirk's post, we write ##p_Y(x)## to denote the probability density of the random variable ##Y## evaluated at the value given by the ordinary variable ##x##.

So ##p_{X_{k+1}|(X_k=x_k)}(x_{k+1})## denotes the probability density of the random variable "##X_{k+1}| (X_k=x_k)##" evaluated at the the ordinary variable denoted by ##x_k##. And to completely appreciate that notation, you have to understand that "conditioning" is another way to define a new random variable in terms of other random variables.

Not all texts use such precise notation.
 
  • Like
Likes perplexabot
  • #15
Wow! Just WOW! I cannot explain how much both your last posts have helped me. You have cleared up some serious amount of confusion I have been carrying for a while. I am very thankful for your time. Tremendously valuable information!

Thank you andrewkirk!
Thank you Stephen Tashi!
 

What is conditional expectation?

Conditional expectation is a statistical concept that refers to the expected value of a random variable given certain conditions or information. It is denoted as E[X|Y], where X is the random variable and Y is the condition or information.

How is conditional expectation calculated?

The formula for calculating conditional expectation is E[X|Y] = ∑x P(X=x|Y) * x, where x represents the possible values of the random variable X and P(X=x|Y) represents the probability of X taking on the value x given the condition Y.

What is covariance of functions of random variables?

Covariance of functions of random variables refers to the measure of how two functions of random variables vary together. It is denoted as Cov[f(X), g(Y)] and is calculated using the formula Cov[f(X), g(Y)] = E[(f(X) - E[f(X)]) * (g(Y) - E[g(Y)])].

How is covariance related to correlation?

Covariance and correlation are related measures, but they have different interpretations. While covariance measures the degree of linear relationship between two variables, correlation measures the strength and direction of the linear relationship. Correlation is a standardized version of covariance, with values ranging from -1 to 1, while covariance can have any value.

Why is understanding conditional expectation and covariance important in statistics?

Conditional expectation and covariance are important concepts in statistics because they help in understanding the relationship between variables and in making predictions. They are also used in various statistical models and techniques, such as regression analysis and time series analysis, to analyze data and make informed decisions.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
941
  • Set Theory, Logic, Probability, Statistics
Replies
14
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
981
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
646
Back
Top