# Conditional expectation and covariance of function of RVs

Gold Member
Hey all, I have been doing some math lately where I need to find the conditional expectation of a function of random variables. I also at some point need to find a derivative with respect to the variable that has been conditioned. I am not sure of my work and would appreciate it if you guys can maybe have a look.

Let us assume you have the following function (an AR(1) process): $z_{k+1} = x_{k+1}z_k + \epsilon_{k+1}$, where
$x_{k+1}=x_k + v_{k+1}$,
$v_{k+1}\sim\mathcal{N}(0,\sigma_v^2)$,
$\epsilon_{k+1}\sim\mathcal{N}(0,\sigma_{\epsilon}^2)$ and $\epsilon$ is iid.

So first, here is the conditional expectation of $z_{k+1}$ given $x_{k+1}$ and $z_k$
$E[z_{k+1}|x_{k+1},z_k]=x_{k+1}z_k = \mu$ (is this correct?)
Here is the conditional covariance of $z_{k+1}$ given $x_{k+1}$ and $z_k$
$\begin{equation} \begin{split} cov(z_{k+1}|x_{k+1},z_k) &= E[(z_{k+1}-\mu)(z_{k+1}-\mu)^T|x_{k+1},z_k] \\ &= E[z_{k+1}^2|x_{k+1},z_k]-E[z_{k+1}\mu|x_{k+1},z_k]-E[\mu z_{k+1}|x_{k+1},z_k]+E[\mu^2|x_{k+1},z_k] \\ &= E[z_{k+1}^2|x_{k+1},z_k]-\mu^2-\mu^2+\mu^2 \\ &= E[z_{k+1}^2|x_{k+1},z_k]-\mu^2 \\ &= E[(x_{k+1}z_k)^2 + 2x_{k+1}z_k\epsilon_{k+1}+\epsilon^2|x_{k+1},z_k] - \mu^2 \\ &= \mu^2 + \sigma^2 - \mu^2 \\ & = \sigma^2 \end{split} \end{equation}$
(is this correct???)

As for my derivative of conditioned RV, I think I will wait for your opinion/feedback on the above before I ask.

Any help and or comments will be greatly appreciated.
Thank you for reading : )

EDIT: Fixed an error

Last edited:

Related Set Theory, Logic, Probability, Statistics News on Phys.org
andrewkirk
Homework Helper
Gold Member
$E[z_{k+1}|x_{k+1},z_k]=x_{k+1}z_k = \mu$ (is this correct?)
Not necessarily, because ##\epsilon_{k+1}## may be affected by ##z+{k+1}## or ##z_k##. A sufficient condition for it to be correct would be if, in addition to what you have specified, ##\epsilon_{k+1}##is independent of ##v_j## for ##1\leq j\leq k+1##. The iid condition only tells us about dependencies within the sequence of ##\epsilon##s, not between elements of the sequence and other random variables.
Here is the conditional covariance of $z_{k+1}$ given $x_{k+1}$ and $z_k$
It is a variance, not a covariance. This is only a matter of naming and does not affect the calculations.
$\begin{equation} \begin{split} &= E[(x_{k+1}z_k)^2 + 2x_{k+1}z_k\epsilon_{k+1}+\epsilon^2|x_{k+1},z_k] - \mu^2 \\ &= \mu^2 + \sigma^2 - \mu^2 \\ \end{split} \end{equation}$
This step also relies on the above additional assumption that the ##\epsilon##s are independent of the ##v##s.

In summary, by adding the extra assumption about independence, and replacing 'covariance' by 'variance', it will become sound.

• perplexabot
Gold Member
Not necessarily, because ##\epsilon_{k+1}## may be affected by ##z+{k+1}## or ##z_k##. A sufficient condition for it to be correct would be if, in addition to what you have specified, ##\epsilon_{k+1}##is independent of ##v_j## for ##1\leq j\leq k+1##. The iid condition only tells us about dependencies within the sequence of ##\epsilon##s, not between elements of the sequence and other random variables.

It is a variance, not a covariance. This is only a matter of naming and does not affect the calculations.

This step also relies on the above additional assumption that the ##\epsilon##s are independent of the ##v##s.

In summary, by adding the extra assumption about independence, and replacing 'covariance' by 'variance', it will become sound.
Thank you very much! I now realize why $v_{k+1}$ and $\epsilon_{k+1}$ must be independent and that I must include this in my assumptions. : ) That's great help!

Now if I may ask my follow up question about differentiating with respect to a conditioned variable. Let $f(x,y)$ be a function of random variables $x$ and $y$. Let us say that the $E[f(x,y)|y]=y^2$. What is the derivative of $E[f(x,y)|y]$ with respect to random variable $y$? In other words, what is the answer to $\frac{\partial E[f(x,y)|y]}{\partial y}$?

Is it simply, $\frac{\partial y^2}{\partial y} = 2y$?

I ask this question because I am a little confused about $y$ after conditioning. I know that when you condition the random variable, it is setting it to some possible value of its distribution ($E[f(x,y)|y]=E[f(x,y)|y=y_i]$, where $y_i$ is some instance of random variable $y$). So is $y$ still considered when differentiating with respect to it after conditioning it (woops, that is definitely badly worded, sorry), or is it treated as a constant so that $\frac{\partial y_i^2}{\partial y} = 0$? You may just disregard this paragraph : /

Thanks again.

andrewkirk
Homework Helper
Gold Member
Is it simply, $\frac{\partial y^2}{\partial y} = 2y$?
Yes it is.
I find these things appear clearer if one is fairly formal with one's symbolism, for instance using capital letters for random variables and lower case for items that are not random. Then the expression you wrote as ##E[f(x,y)|y] =y^2## is written as:
$$E\left[f(X,Y)\ |\ Y=y\right]=y^2$$
Since ##y## is an ordinary old non-random variable, we are free to differentiate both sides with respect to it, to get
$$\frac\partial{\partial y}E\left[f(X,Y)\ |\ Y=y\right]=\frac\partial{\partial y}y^2=2y$$

• perplexabot
Stephen Tashi
Let us assume you have the following function (an AR(1) process): $z_{k+1} = x_{k+1}z_k + \epsilon_{k+1}$, where
$x_{k+1}=x_k + v_{k+1}$,
$v_{k+1}\sim\mathcal{N}(0,\sigma_v^2)$,
$\epsilon_{k+1}\sim\mathcal{N}(0,\sigma_{\epsilon}^2)$ and $\epsilon$ is iid.
That model has a "multiplicative shock" due to ##v_k##, so should you call it an AR(1) process ?

So first, here is the conditional expectation of $z_{k+1}$ given $x_{k+1}$ and $z_k$
$E[z_{k+1}|x_{k+1},z_k]=x_{k+1}z_k = \mu$ (is this correct?)
Does your notation imply that ##\mu## is a number that can be different value for (say) k = 5 than it is for k = 6 ?

• perplexabot
Gold Member
Yes it is.
I find these things appear clearer if one is fairly formal with one's symbolism, for instance using capital letters for random variables and lower case for items that are not random. Then the expression you wrote as ##E[f(x,y)|y] =y^2## is written as:
$$E\left[f(X,Y)\ |\ Y=y\right]=y^2$$
Since ##y## is an ordinary old non-random variable, we are free to differentiate both sides with respect to it, to get
$$\frac\partial{\partial y}E\left[f(X,Y)\ |\ Y=y\right]=\frac\partial{\partial y}y^2=2y$$
Thank you! That makes more sense now. Does one ever differentiate with respect to the random variable (using the notation you provided, $\frac{\partial}{\partial Y}$ rather than $\frac{\partial}{\partial y}$)? I am thinking of cases such as the fisher information matrix.

That model has a "multiplicative shock" due to ##v_k##, so should you call it an AR(1) process ?

Does your notation imply that ##\mu## is a number that can be different value for (say) k = 5 than it is for k = 6 ?
Hmmm, I see your point, so you are saying since it has a multiplicative shock component then maybe it is a ARMA process? I'm not sure what to call it. I still think it is an AR process.

I have not assumed stationarity so I guess $\mu$ should be $\mu_{k+1}$, right? I guess my work above only applies when $|x_{k+1}|< 1$.

Thank you!

Stephen Tashi
Hmmm, I see your point, so you are saying since it has a multiplicative shock component then maybe it is a ARMA process?
No, if it has a multiplicative shock it is not a "ARMA" process. I think the terminology "ARMA" is restricted to models where the current value of the process is a linear combination of past values plus an additive noise.

In searching the web, "multiplicative shock" produces many hits dealing with economic modeling, so people have ways of dealing with multiplicative shock in particular situations. However, I haven't found any math notes on stochastic modelling that give general procedures for fitting models with multiplicative shock. By contrast, the method of dealing with ARMA models is standardized ("Box-Jenkins").

I have not assumed stationarity so I guess $\mu$ should be $\mu_{k+1}$, right?
Yes.

• perplexabot
Gold Member
No, if it has a multiplicative shock it is not a "ARMA" process. I think the terminology "ARMA" is restricted to models where the current value of the process is a linear combination of past values plus an additive noise.

In searching the web, "multiplicative shock" produces many hits dealing with economic modeling, so people have ways of dealing with multiplicative shock in particular situations. However, I haven't found any math notes on stochastic modelling that give general procedures for fitting models with multiplicative shock. By contrast, the method of dealing with ARMA models is standardized ("Box-Jenkins").

Yes.
First, thank you for all the help and clarification.

ARMA, as you said, is where the current value of the process is a linear combination of past values plus an additive noise. BUT, if I am not mistaken, ARMA also depends on past shocks, or past noise values. This can be seen in wiki. I still think that the model above is an AR process. It is an AR process which has random variables as parameters. At least, that is how I am looking at it.

andrewkirk
Homework Helper
Gold Member
Does one ever differentiate with respect to the random variable (using the notation you provided, $\frac{\partial}{\partial Y}$ rather than $\frac{\partial}{\partial y}$)? I am thinking of cases such as the fisher information matrix.
A real random variable is actually a function from the sample space ##\Omega## to the real numbers. One cannot perform differentiation, as understood in its 'vanilla' form, with respect to a function. There are extensions, such as in Calculus of Variations, where one can differentiate with respect to functions. There is also the Radon-Nikodym derivative, which is a generalisation of the usual notion of derivative, that is used in probability theory. The latter is important in the analysis of Wiener processes (Brownian Motions), which are closely related to your example. However, IIRC, the Radon-Nikodym derivative differentiates with respect to a measure, not a random variable.

I think most simple cases in probability theory can be expressed in terms of vanilla differentiation with respect to non-random variables.

• perplexabot
Stephen Tashi
First, thank you for all the help and clarification.

ARMA, as you said, is where the current value of the process is a linear combination of past values plus an additive noise. BUT, if I am not mistaken, ARMA also depends on past shocks, or past noise values.
If you have multiplicative noise, a linear combination of past values gives a dependence on products of past noise values. In an ARMA process the expression for the current value doesn't involve products of past noise values.

• perplexabot
Gold Member
A real random variable is actually a function from the sample space ##\Omega## to the real numbers. One cannot perform differentiation, as understood in its 'vanilla' form, with respect to a function. There are extensions, such as in Calculus of Variations, where one can differentiate with respect to functions. There is also the Radon-Nikodym derivative, which is a generalisation of the usual notion of derivative, that is used in probability theory. The latter is important in the analysis of Wiener processes (Brownian Motions), which are closely related to your example. However, IIRC, the Radon-Nikodym derivative differentiates with respect to a measure, not a random variable.

I think most simple cases in probability theory can be expressed in terms of vanilla differentiation with respect to non-random variables.
Wow, that definitely expanded my general knowledge of derivatives. I have never even heard of these extensions you speak of. Very cool! Thank you : )

If you have multiplicative noise, a linear combination of past values gives a dependence on products of past noise values. In an ARMA process the expression for the current value doesn't involve products of past noise values.
Once again thank you for directing me in the right direction. Your help is much appreciated.

Thank you all.

Gold Member
I have one more question if I may. Let us use the following equation (the same one from above): $x_{k+1}=x_k + v_{k+1}$, where once again, $v_{k+1}\sim\mathcal{N}(0,\sigma_v^2)$. It can be shown that
$E[x_{k+1}|x_k]=x_k$
and that
$cov(x_{k+1}|x_k)=\sigma_v^2$.
Now, substituting and setting up the normal distribution, we get:
$p(x_{k+1}|x_k)=\frac{1}{\sqrt{2\sigma_v^2\pi}}e^{\frac{(x_{k+1}-x_k)^2}{2\sigma_v^2}}$

Now for my question, can you do the following substitution:
$p(x_{k+1}|x_k)=\frac{1}{\sqrt{2\sigma_v^2\pi}}e^{\frac{((x_k+v_{k+1})-x_k)^2}{2\sigma_v^2}}=\frac{1}{\sqrt{2\sigma_v^2\pi}}e^{\frac{(v_{k+1})^2}{2\sigma_v^2}}$ ?

Surely that substitution (or the following subtraction) is wrong. Why is it wrong? I have a feeling that the subtraction of the two $x_k$ does not apply because one is a random variable while the other is a conditioned version of the random variable. Is that what is wrong?

Thank you.

andrewkirk
Homework Helper
Gold Member
@perplexabot Again, resorting to formalism can help us to clarify what is otherwise a fairly confusing move.

What you have written as ##p(x_{k+1}|x_k)## is actually
$$p_{X_{k+1}|(X_k=x_k)}(x_{k+1})$$
where ##p_{X_{k+1}|(X_k=x_k)}## is a function from ##\mathbb R## to ##\mathbb R## that is the conditional probability density function of the random variable ##X_{k+1}##, conditioned on the information that ##X_k=x_k##. Here ##x_k,x_{k+1}## are ordinary, non-random variables. Take careful note of which items are upper and which are lower case. It's important.

So we have
$$p_{X_{k+1}|(X_k=x_k)}(x_{k+1})=\frac{1}{\sqrt{2\sigma_v^2\pi}}e^{\frac{(x_{k+1}-x_k)^2}{2\sigma_v^2}}$$
This is an ordinary old non-random equation, with the only bit that relates to random variables at all being that they were used to identify the function ##p_{X_{k+1}|(X_k=x_k)}##. But having identified that function, it is now a perfectly ordinary, non-random, function.

Hence, if we define ##v_{k+1}## to be ##x_{k+1}-x_k## it follows that:
$$p_{X_{k+1}|(X_k=x_k)}(x_{k+1})=\frac{1}{\sqrt{2\sigma_v^2\pi}}e^{\frac{v_{k+1}{}^2}{2\sigma_v^2}}$$

• perplexabot
Stephen Tashi
I have a feeling that the subtraction of the two $x_k$ does not apply because one is a random variable while the other is a conditioned version of the random variable. Is that what is wrong?
To add to what andrewkirk wrote, there is a saying:

Random variables are not random and they are not variables.
A "random variable" is defined by a distribution on a probability space, not by a single value. So if ##X## is a random variable, it has a distribution and this distribution is not "randomly" changing from one distribution to another.

Furthermore, if ##X## denotes a "random variable" then it is not a "variable" in the ordinary sense of the word - i.e. it is not a symbol that represents a single (but perhaps unknown) number.

A probability density function is not "a function of a random variable". It is a function of ordinary variables that represent numbers. The interpretation of the value of a probability density function can be related to an event in the probability space of the associated random variable.

For a random variable ##X## we can define a "function of a random variable" like ## Y = X + e^{(5+X)}##. Defining ##Y## in this manner makes ##Y## another random variable - and not a function of ordinary variables.

It would be fair to say that we use the notation for a function of ordinary variables ( ## y = f(x) = x + e^{(5+x)} ##) when we give the definition of ##Y##. However if when you define an ordinary function ##f(x)##, you don't automatically get that there is some other function associated with ##f(x)## that is a "distribution" of ##f(x)##. When you define a function ##f(X)## of a random variable ##X##, you do automatically know that ##f(X)## has some distribution.

To enforce those distinctions, some people prefer to denote random variables by capital letters and possible realizations of those variables by lower case letters. So instead of ##p( x) ## they write ##p(X=x)##.

This isn't a perfect scheme because continuous probability density functions have values that are probability densities - not probabilities. (It analogous to the difference between the physical units of kilograms (which is a mass) vs the physical units of kilograms per meter ( which is a linear mass density).

For example, if you use ##p(...)## to denote "the probability of" then its tempting to write ##p(X = x) =\frac{1}{\sqrt{2\sigma^2\pi}} \ e^{ \frac{x^2}{2\sigma^2}}## However, the normal density function evaluated at ##x## doesn't give you the probability that ##X## is exactly equal to ##x##. Instead, the density function gives the probability density at ##x##. (By analogy, a physical object cannot have a mass of 5 kg at a point, but it can have a mass density of 5 kg/ meter at a point.)

Using the notation convention in andrewkirk's post, we write ##p_Y(x)## to denote the probability density of the random variable ##Y## evaluated at the value given by the ordinary variable ##x##.

So ##p_{X_{k+1}|(X_k=x_k)}(x_{k+1})## denotes the probability density of the random variable "##X_{k+1}| (X_k=x_k)##" evaluated at the the ordinary variable denoted by ##x_k##. And to completely appreciate that notation, you have to understand that "conditioning" is another way to define a new random variable in terms of other random variables.

Not all texts use such precise notation.

• perplexabot
Gold Member
Wow! Just WOW!!! I cannot explain how much both your last posts have helped me. You have cleared up some serious amount of confusion I have been carrying for a while. I am very thankful for your time. Tremendously valuable information!

Thank you andrewkirk!
Thank you Stephen Tashi!