Adding Multivariate Normal Distributions

FallenApple · Apr 2, 2017

So ##\vec Y##~MVN(X##\vec\beta##, ##\sigma^2##I)

and

##\hat {\vec Y}##~MVN(X##\vec\beta##, ##\sigma^2##H)

and I want to show

##\hat{\vec e}##~MVN(##\vec 0##, ##\sigma^2##(I-H))
Where ##\hat{\vec e}## is the vector of observed residuals( ##\vec e=\vec Y- \hat{\vec Y}=(I-H) \vec Y ##).

And H=##X(X'X)^{-1}X'## is the projection matrix where the primed means transpose

Since ##\hat{\vec e}=\vec Y- \hat{\vec Y}## its just the distribution of ##\vec Y## minus the distribution of ##\hat{\vec Y}##

So MVN(X##\vec\beta##, ##\sigma^2##I)---MVN(X##\vec\beta##, ##\sigma^2##H) = MVN(##\vec 0##, ##\sigma^2##(I-H))

this uses the fact that X##\vec\beta##-X##\vec\beta##=##\vec 0## and ##\sigma^2##I-##\sigma^2##H=##\sigma^2##(I-H) by simple vector addition.

Can one subtract that simply? I mean, ##\vec Y## and ##\hat {\vec Y}## are not independent, so how can I just total up the expectations and the variances like that?

andrewkirk · Apr 2, 2017

One cannot subtract correlation matrices like that. There will be square roots and quadratic combinations involved, using Cholesky decompositions of the correlation matrices.

It's hard to be sure though, because your notation appears to be non-standard.
What is meant by ##\vec Y\sim MVN(\mathbf X\vec \beta,\sigma^2\mathbf H)##? In particular, what is ##\mathbf X##? If ##\mathbf X## is simply a constant square matrix, ##\mathbf X\vec \beta## will be just a constant vector, in which case what information is added by writing it in that form rather than simply as a vector ##\vec \beta##?

BvU · Apr 3, 2017

Just for the benefit of those listening in, dear Apple, can you provide a list of symbols and their meaning ? Would be so nice !

FallenApple · Apr 3, 2017

Boldface terms are matricies and hooks means vectors. ##\vec Y## is random and ##\vec{\epsilon}## is random as well. ##\mathbf X## is constant and ##\vec{\beta}## is constant was well. ##\hat {\vec {\beta}}## is the best linear unbiased estimator for ##\vec{\beta}## and is random. It is also the MLE. ##\hat {\vec {\beta}}##~MVN( ##\vec{\beta}##,##\sigma^2 \mathbf(X'X)^{-1}##)

##\vec Y##=##\mathbf X\vec{\beta}##+##\vec {\epsilon}## is a regression equation where there are n observations and p regressors. ##\mathbf X## is the design matrix and is of dimension n by p. ##\vec{\beta}## is the vector is regression coefficients and ##\vec{\epsilon}## is the vector of errors where ##\vec{\epsilon}##~MVN(##\vec 0##,##\sigma^2##I)
I is the identity matrix and is n by n. This implies that the errors are uncorrelated.

FallenApple · Apr 3, 2017

andrewkirk said:

One cannot subtract correlation matrices like that. There will be square roots and quadratic combinations involved, using Cholesky decompositions of the correlation matrices.

It's hard to be sure though, because your notation appears to be non-standard.
What is meant by ##\vec Y\sim MVN(\mathbf X\vec \beta,\sigma^2\mathbf H)##? In particular, what is ##\mathbf X##? If ##\mathbf X## is simply a constant square matrix, ##\mathbf X\vec \beta## will be just a constant vector, in which case what information is added by writing it in that form rather than simply as a vector ##\vec \beta##?

##\mathbf X## is the design matrix for a regression equation. It is a shorthand way of helping to rewrite the regression equation. It is n by p where n is the number of observations and p is the number of regressors used to help predict the observations. So it is not a square matrix. But it is a constant. ##\hat {\vec Y}\sim MVN(\mathbf X\vec \beta,\sigma^2\mathbf H)## means that the vector ##\hat {\vec Y}## is the MLE of ## \vec Y## and follows a multivarate normal distribution. This is because ##\vec{\epsilon}## is random and follows a MVN as well.

FallenApple · Apr 3, 2017

Here is the pdf for the background. The notation is slightly different than here. I used hooks so as to be more clear on what is a vector and what is not. They didn't.

StoneTemplePython · Apr 3, 2017

FallenApple said:

So ##\vec Y##~MVN(X##\vec\beta##, ##\sigma^2##I)

Can one subtract that simply? I mean, ##\vec Y## and ##\hat {\vec Y}## are not independent, so how can I just total up the expectations and the variances like that?

Note that via Linearity of Expectations, you can always add up the expected values. I found the PDF you attached to mostly be straight forward, though I didn't understand your notation.

Note your hat matrix ##\mathbf H## is not going to be full rank, and it only has eigenvalues of 0 and 1. It's real valued and symmetric, so it can be diagonalized in the form of ##\mathbf H = \mathbf {QDQ}^T##. (I found the use of ##\mathbf O## in the PDF to be awkward and look too much like the zero matrix.).

What they want is for you to use 4.1b which says ##Var(\mathbf a + \mathbf{B y}) = Var(\mathbf{B y}) = \mathbf B Var(\mathbf y) \mathbf B^T##. The notation is a touch confusing here, but what they're reminding you of is variance is an application of norm for measuring dispersion about a mean, so translation the mean by a constant has no effect. If we assume ##\mathbf y## has zero mean, I appears that they're just reminding you that the covariance matrix is given by ##\mathbf {y y} ^T## and in turn ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##. (I think Feller is to blame for this being called both the variance matrix and the covariance matrix...)

From here they are saying let's look at variance of the error term vector ##\mathbf e = \mathbf y - \mathbf{ \hat y} = \mathbf {I y} - \mathbf{ H y}= \big(\mathbf{I} - \mathbf H\big) \mathbf y \neq \mathbf 0## (that is we use least squares when we have over-determined systems of equations -- I've excluded that case of zero error from this analysis, accordingly). Hence ##Var(\mathbf e ) = Var(\big(\mathbf{I} - \mathbf H\big) \mathbf y\big) = \big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T##. What I never saw explicitly, but is tucked into the equation for ## Var(\hat \beta)## is that ##Var\big(\mathbf y\big) = \sigma^2 \mathbf I##. Substituting this into the above we get:

##\big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T = \big(\mathbf{I} - \mathbf H\big) \big(\sigma^2 \mathbf I \big) \big(\mathbf{I} - \mathbf H\big)^T = \sigma^2 \big(\mathbf{I} - \mathbf H\big) \big(\mathbf{I} - \mathbf H\big)^T##. From here notice that ##\mathbf {I - H}## is real valued and symmetric, so ##\mathbf {I - H} = \big(\mathbf {I - H}\big)^T##, simplifying the equation to: ## \sigma^2 \big(\mathbf{I} - \mathbf H\big)^2##

Then, they make use of the fact that ##\big(\mathbf{I} - \mathbf H\big)## is itself a projector, so ##\big(\mathbf I - \mathbf H\big)^2 = (\mathbf I - \mathbf H\big)##, leading the expression for the covariance matrix of the error term to be ##\sigma^2 \big(\mathbf{I} - \mathbf H\big)^2 = \sigma^2 \big(\mathbf{I} - \mathbf H\big)##.

If you know the singular value decomposition, I'd recommend using it on ##\mathbf X## and going slowly through all of the above. It will allow you to see exactly what is going on under the hood. (It also is quite useful in interpreting the modified Hat matrix that makes use of regularization penalties -- this is outside the scope of your current PDF, but no doubt coming at some point in the future.)

It's worth pointing out that 4.5a and 4.5b were a bit troubling to me at first. The rank of ##\mathbf H = trace\big(\mathbf H \big) = p##. The writeup states that ##\mathbf H## is n x n. The rank of ##\mathbf I - \mathbf H## = n - p. (Note: using trace operation considerably simplifies Theorem 4.5, 4.6, and 4.7, and in general since trace is a linear operator it can be interchanged with expectations (with some technical care needed for convergence if not a finite case)).

On their own, each of these is rank deficient-- which is disturbing when you look at the formula for the multivariate Guassian as it has you inverting the covariance matrix inside the exponential function, and the normalizing constant has the determinant of the covariance matrix in the denominator. But what these things are really saying in 4.5a and 4.5b is that you can cleanly partition the Guassian into p items that relate to ##\mathbf {\hat y}## and n-p items that relation to ##\mathbf {\hat e}##.

FallenApple · Apr 11, 2017

StoneTemplePython said:

Note that via Linearity of Expectations, you can always add up the expected values. I found the PDF you attached to mostly be straight forward, though I didn't understand your notation.

Note your hat matrix ##\mathbf H## is not going to be full rank, and it only has eigenvalues of 0 and 1. It's real valued and symmetric, so it can be diagonalized in the form of ##\mathbf H = \mathbf {QDQ}^T##. (I found the use of ##\mathbf O## in the PDF to be awkward and look too much like the zero matrix.).

What they want is for you to use 4.1b which says ##Var(\mathbf a + \mathbf{B y}) = Var(\mathbf{B y}) = \mathbf B Var(\mathbf y) \mathbf B^T##. The notation is a touch confusing here, but what they're reminding you of is variance is an application of norm for measuring dispersion about a mean, so translation the mean by a constant has no effect. If we assume ##\mathbf y## has zero mean, I appears that they're just reminding you that the covariance matrix is given by ##\mathbf {y y} ^T## and in turn ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##. (I think Feller is to blame for this being called both the variance matrix and the covariance matrix...)

From here they are saying let's look at variance of the error term vector ##\mathbf e = \mathbf y - \mathbf{ \hat y} = \mathbf {I y} - \mathbf{ H y}= \big(\mathbf{I} - \mathbf H\big) \mathbf y \neq \mathbf 0## (that is we use least squares when we have over-determined systems of equations -- I've excluded that case of zero error from this analysis, accordingly). Hence ##Var(\mathbf e ) = Var(\big(\mathbf{I} - \mathbf H\big) \mathbf y\big) = \big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T##. What I never saw explicitly, but is tucked into the equation for ## Var(\hat \beta)## is that ##Var\big(\mathbf y\big) = \sigma^2 \mathbf I##. Substituting this into the above we get:

##\big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T = \big(\mathbf{I} - \mathbf H\big) \big(\sigma^2 \mathbf I \big) \big(\mathbf{I} - \mathbf H\big)^T = \sigma^2 \big(\mathbf{I} - \mathbf H\big) \big(\mathbf{I} - \mathbf H\big)^T##. From here notice that ##\mathbf {I - H}## is real valued and symmetric, so ##\mathbf {I - H} = \big(\mathbf {I - H}\big)^T##, simplifying the equation to: ## \sigma^2 \big(\mathbf{I} - \mathbf H\big)^2##

Then, they make use of the fact that ##\big(\mathbf{I} - \mathbf H\big)## is itself a projector, so ##\big(\mathbf I - \mathbf H\big)^2 = (\mathbf I - \mathbf H\big)##, leading the expression for the covariance matrix of the error term to be ##\sigma^2 \big(\mathbf{I} - \mathbf H\big)^2 = \sigma^2 \big(\mathbf{I} - \mathbf H\big)##.

If you know the singular value decomposition, I'd recommend using it on ##\mathbf X## and going slowly through all of the above. It will allow you to see exactly what is going on under the hood. (It also is quite useful in interpreting the modified Hat matrix that makes use of regularization penalties -- this is outside the scope of your current PDF, but no doubt coming at some point in the future.)

It's worth pointing out that 4.5a and 4.5b were a bit troubling to me at first. The rank of ##\mathbf H = trace\big(\mathbf H \big) = p##. The writeup states that ##\mathbf H## is n x n. The rank of ##\mathbf I - \mathbf H## = n - p. (Note: using trace operation considerably simplifies Theorem 4.5, 4.6, and 4.7, and in general since trace is a linear operator it can be interchanged with expectations (with some technical care needed for convergence if not a finite case)).

On their own, each of these is rank deficient-- which is disturbing when you look at the formula for the multivariate Guassian as it has you inverting the covariance matrix inside the exponential function, and the normalizing constant has the determinant of the covariance matrix in the denominator. But what these things are really saying in 4.5a and 4.5b is that you can cleanly partition the Guassian into p items that relate to ##\mathbf {\hat y}## and n-p items that relation to ##\mathbf {\hat e}##.

Thanks for your detailed reply.

What I'm not sure about is this ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##.

Why is there no ##\sigma^{2}## term in there? Usually there is such a term since that variance of ##\vec y ## is ##\sigma^{2}*I##

StoneTemplePython · Apr 11, 2017

FallenApple said:

Thanks for your detailed reply.

What I'm not sure about is this ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##.

Why is there no ##\sigma^{2}## term in there? Usually there is such a term since that variance of ##\vec y ## is ##\sigma^{2}*I##

You've got it... they off hand mentioned that ##var \big( \mathbf y\big) = \sigma^2 \mathbf I## when calculating ##\beta##. From here you just recognize this and do the substitution when apropriate:

##\big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T = \big(\mathbf{I} - \mathbf H\big) \big(\sigma^2 \mathbf I \big) \big(\mathbf{I} - \mathbf H\big)^T = \sigma^2 \big(\mathbf{I} - \mathbf H\big) \big(\mathbf{I} - \mathbf H\big)^T##

Adding Multivariate Normal Distributions

Attachments

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad The countability paradox of computable numbers

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect