Adding Multivariate Normal Distributions

Click For Summary

Discussion Overview

The discussion centers around the properties and manipulations of multivariate normal distributions, particularly in the context of regression analysis. Participants explore the distribution of residuals derived from observed data and fitted values, questioning the validity of certain mathematical operations involving covariance matrices.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant asserts that the residuals ##\hat{\vec e}## can be expressed as a multivariate normal distribution, suggesting that the distribution of ##\vec Y## minus the distribution of ##\hat{\vec Y}## leads to ##\hat{\vec e}~MVN(\vec 0, \sigma^2(I-H))##.
  • Another participant challenges the validity of subtracting correlation matrices directly, indicating that such operations typically involve more complex relationships, including Cholesky decompositions.
  • Clarifications are provided regarding the notation used, with one participant explaining that ##\mathbf X## is a design matrix in a regression context, and not a square matrix, which influences the interpretation of the multivariate normal distribution.
  • There is a discussion about the implications of the properties of the hat matrix ##H##, including its rank and eigenvalues, and how these properties affect the variance of the error term vector.
  • One participant emphasizes the importance of understanding the variance of the error term and its relationship to the design matrix and the covariance structure of the data.
  • Another participant suggests using singular value decomposition to gain deeper insights into the underlying structure of the matrices involved.

Areas of Agreement / Disagreement

Participants express differing views on the appropriateness of certain mathematical operations involving covariance matrices. There is no consensus on the validity of the initial claim regarding the distribution of residuals, and the discussion remains unresolved regarding the correct approach to handling these distributions.

Contextual Notes

Participants note that the notation used in the discussion may not be standard, which contributes to some confusion. Additionally, the rank deficiencies of the matrices involved are highlighted as a potential concern when applying the multivariate Gaussian distribution.

FallenApple
Messages
564
Reaction score
61
So ##\vec Y##~MVN(X##\vec\beta##, ##\sigma^2##I)

and

##\hat {\vec Y}##~MVN(X##\vec\beta##, ##\sigma^2##H)

and I want to show

##\hat{\vec e}##~MVN(##\vec 0##, ##\sigma^2##(I-H))
Where ##\hat{\vec e}## is the vector of observed residuals( ##\vec e=\vec Y- \hat{\vec Y}=(I-H) \vec Y ##).

And H=##X(X'X)^{-1}X'## is the projection matrix where the primed means transpose

Since ##\hat{\vec e}=\vec Y- \hat{\vec Y}## its just the distribution of ##\vec Y## minus the distribution of ##\hat{\vec Y}##

So MVN(X##\vec\beta##, ##\sigma^2##I)---MVN(X##\vec\beta##, ##\sigma^2##H) = MVN(##\vec 0##, ##\sigma^2##(I-H))

this uses the fact that X##\vec\beta##-X##\vec\beta##=##\vec 0## and ##\sigma^2##I-##\sigma^2##H=##\sigma^2##(I-H) by simple vector addition.

Can one subtract that simply? I mean, ##\vec Y## and ##\hat {\vec Y}## are not independent, so how can I just total up the expectations and the variances like that?
 
Last edited:
Physics news on Phys.org
One cannot subtract correlation matrices like that. There will be square roots and quadratic combinations involved, using Cholesky decompositions of the correlation matrices.

It's hard to be sure though, because your notation appears to be non-standard.
What is meant by ##\vec Y\sim MVN(\mathbf X\vec \beta,\sigma^2\mathbf H)##? In particular, what is ##\mathbf X##? If ##\mathbf X## is simply a constant square matrix, ##\mathbf X\vec \beta## will be just a constant vector, in which case what information is added by writing it in that form rather than simply as a vector ##\vec \beta##?
 
Just for the benefit of those listening in, dear Apple, can you provide a list of symbols and their meaning ? Would be so nice !
 
Boldface terms are matricies and hooks means vectors. ##\vec Y## is random and ##\vec{\epsilon}## is random as well. ##\mathbf X## is constant and ##\vec{\beta}## is constant was well. ##\hat {\vec {\beta}}## is the best linear unbiased estimator for ##\vec{\beta}## and is random. It is also the MLE. ##\hat {\vec {\beta}}##~MVN( ##\vec{\beta}##,##\sigma^2 \mathbf(X'X)^{-1}##)

##\vec Y##=##\mathbf X\vec{\beta}##+##\vec {\epsilon}## is a regression equation where there are n observations and p regressors. ##\mathbf X## is the design matrix and is of dimension n by p. ##\vec{\beta}## is the vector is regression coefficients and ##\vec{\epsilon}## is the vector of errors where ##\vec{\epsilon}##~MVN(##\vec 0##,##\sigma^2##I)
I is the identity matrix and is n by n. This implies that the errors are uncorrelated.
 
Last edited:
andrewkirk said:
One cannot subtract correlation matrices like that. There will be square roots and quadratic combinations involved, using Cholesky decompositions of the correlation matrices.

It's hard to be sure though, because your notation appears to be non-standard.
What is meant by ##\vec Y\sim MVN(\mathbf X\vec \beta,\sigma^2\mathbf H)##? In particular, what is ##\mathbf X##? If ##\mathbf X## is simply a constant square matrix, ##\mathbf X\vec \beta## will be just a constant vector, in which case what information is added by writing it in that form rather than simply as a vector ##\vec \beta##?
##\mathbf X## is the design matrix for a regression equation. It is a shorthand way of helping to rewrite the regression equation. It is n by p where n is the number of observations and p is the number of regressors used to help predict the observations. So it is not a square matrix. But it is a constant. ##\hat {\vec Y}\sim MVN(\mathbf X\vec \beta,\sigma^2\mathbf H)## means that the vector ##\hat {\vec Y}## is the MLE of ## \vec Y## and follows a multivarate normal distribution. This is because ##\vec{\epsilon}## is random and follows a MVN as well.
 
Last edited:
Here is the pdf for the background. The notation is slightly different than here. I used hooks so as to be more clear on what is a vector and what is not. They didn't.
 

Attachments

FallenApple said:
So ##\vec Y##~MVN(X##\vec\beta##, ##\sigma^2##I)

Can one subtract that simply? I mean, ##\vec Y## and ##\hat {\vec Y}## are not independent, so how can I just total up the expectations and the variances like that?

Note that via Linearity of Expectations, you can always add up the expected values. I found the PDF you attached to mostly be straight forward, though I didn't understand your notation.

Note your hat matrix ##\mathbf H## is not going to be full rank, and it only has eigenvalues of 0 and 1. It's real valued and symmetric, so it can be diagonalized in the form of ##\mathbf H = \mathbf {QDQ}^T##. (I found the use of ##\mathbf O## in the PDF to be awkward and look too much like the zero matrix.).

What they want is for you to use 4.1b which says ##Var(\mathbf a + \mathbf{B y}) = Var(\mathbf{B y}) = \mathbf B Var(\mathbf y) \mathbf B^T##. The notation is a touch confusing here, but what they're reminding you of is variance is an application of norm for measuring dispersion about a mean, so translation the mean by a constant has no effect. If we assume ##\mathbf y## has zero mean, I appears that they're just reminding you that the covariance matrix is given by ##\mathbf {y y} ^T## and in turn ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##. (I think Feller is to blame for this being called both the variance matrix and the covariance matrix...)

From here they are saying let's look at variance of the error term vector ##\mathbf e = \mathbf y - \mathbf{ \hat y} = \mathbf {I y} - \mathbf{ H y}= \big(\mathbf{I} - \mathbf H\big) \mathbf y \neq \mathbf 0## (that is we use least squares when we have over-determined systems of equations -- I've excluded that case of zero error from this analysis, accordingly). Hence ##Var(\mathbf e ) = Var(\big(\mathbf{I} - \mathbf H\big) \mathbf y\big) = \big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T##. What I never saw explicitly, but is tucked into the equation for ## Var(\hat \beta)## is that ##Var\big(\mathbf y\big) = \sigma^2 \mathbf I##. Substituting this into the above we get:

##\big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T = \big(\mathbf{I} - \mathbf H\big) \big(\sigma^2 \mathbf I \big) \big(\mathbf{I} - \mathbf H\big)^T = \sigma^2 \big(\mathbf{I} - \mathbf H\big) \big(\mathbf{I} - \mathbf H\big)^T##. From here notice that ##\mathbf {I - H}## is real valued and symmetric, so ##\mathbf {I - H} = \big(\mathbf {I - H}\big)^T##, simplifying the equation to: ## \sigma^2 \big(\mathbf{I} - \mathbf H\big)^2##

Then, they make use of the fact that ##\big(\mathbf{I} - \mathbf H\big)## is itself a projector, so ##\big(\mathbf I - \mathbf H\big)^2 = (\mathbf I - \mathbf H\big)##, leading the expression for the covariance matrix of the error term to be ##\sigma^2 \big(\mathbf{I} - \mathbf H\big)^2 = \sigma^2 \big(\mathbf{I} - \mathbf H\big)##.

If you know the singular value decomposition, I'd recommend using it on ##\mathbf X## and going slowly through all of the above. It will allow you to see exactly what is going on under the hood. (It also is quite useful in interpreting the modified Hat matrix that makes use of regularization penalties -- this is outside the scope of your current PDF, but no doubt coming at some point in the future.)

It's worth pointing out that 4.5a and 4.5b were a bit troubling to me at first. The rank of ##\mathbf H = trace\big(\mathbf H \big) = p##. The writeup states that ##\mathbf H## is n x n. The rank of ##\mathbf I - \mathbf H## = n - p. (Note: using trace operation considerably simplifies Theorem 4.5, 4.6, and 4.7, and in general since trace is a linear operator it can be interchanged with expectations (with some technical care needed for convergence if not a finite case)).

On their own, each of these is rank deficient-- which is disturbing when you look at the formula for the multivariate Guassian as it has you inverting the covariance matrix inside the exponential function, and the normalizing constant has the determinant of the covariance matrix in the denominator. But what these things are really saying in 4.5a and 4.5b is that you can cleanly partition the Guassian into p items that relate to ##\mathbf {\hat y}## and n-p items that relation to ##\mathbf {\hat e}##.
 
Last edited:
  • Like
Likes   Reactions: FallenApple
StoneTemplePython said:
Note that via Linearity of Expectations, you can always add up the expected values. I found the PDF you attached to mostly be straight forward, though I didn't understand your notation.

Note your hat matrix ##\mathbf H## is not going to be full rank, and it only has eigenvalues of 0 and 1. It's real valued and symmetric, so it can be diagonalized in the form of ##\mathbf H = \mathbf {QDQ}^T##. (I found the use of ##\mathbf O## in the PDF to be awkward and look too much like the zero matrix.).

What they want is for you to use 4.1b which says ##Var(\mathbf a + \mathbf{B y}) = Var(\mathbf{B y}) = \mathbf B Var(\mathbf y) \mathbf B^T##. The notation is a touch confusing here, but what they're reminding you of is variance is an application of norm for measuring dispersion about a mean, so translation the mean by a constant has no effect. If we assume ##\mathbf y## has zero mean, I appears that they're just reminding you that the covariance matrix is given by ##\mathbf {y y} ^T## and in turn ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##. (I think Feller is to blame for this being called both the variance matrix and the covariance matrix...)

From here they are saying let's look at variance of the error term vector ##\mathbf e = \mathbf y - \mathbf{ \hat y} = \mathbf {I y} - \mathbf{ H y}= \big(\mathbf{I} - \mathbf H\big) \mathbf y \neq \mathbf 0## (that is we use least squares when we have over-determined systems of equations -- I've excluded that case of zero error from this analysis, accordingly). Hence ##Var(\mathbf e ) = Var(\big(\mathbf{I} - \mathbf H\big) \mathbf y\big) = \big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T##. What I never saw explicitly, but is tucked into the equation for ## Var(\hat \beta)## is that ##Var\big(\mathbf y\big) = \sigma^2 \mathbf I##. Substituting this into the above we get:

##\big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T = \big(\mathbf{I} - \mathbf H\big) \big(\sigma^2 \mathbf I \big) \big(\mathbf{I} - \mathbf H\big)^T = \sigma^2 \big(\mathbf{I} - \mathbf H\big) \big(\mathbf{I} - \mathbf H\big)^T##. From here notice that ##\mathbf {I - H}## is real valued and symmetric, so ##\mathbf {I - H} = \big(\mathbf {I - H}\big)^T##, simplifying the equation to: ## \sigma^2 \big(\mathbf{I} - \mathbf H\big)^2##

Then, they make use of the fact that ##\big(\mathbf{I} - \mathbf H\big)## is itself a projector, so ##\big(\mathbf I - \mathbf H\big)^2 = (\mathbf I - \mathbf H\big)##, leading the expression for the covariance matrix of the error term to be ##\sigma^2 \big(\mathbf{I} - \mathbf H\big)^2 = \sigma^2 \big(\mathbf{I} - \mathbf H\big)##.

If you know the singular value decomposition, I'd recommend using it on ##\mathbf X## and going slowly through all of the above. It will allow you to see exactly what is going on under the hood. (It also is quite useful in interpreting the modified Hat matrix that makes use of regularization penalties -- this is outside the scope of your current PDF, but no doubt coming at some point in the future.)

It's worth pointing out that 4.5a and 4.5b were a bit troubling to me at first. The rank of ##\mathbf H = trace\big(\mathbf H \big) = p##. The writeup states that ##\mathbf H## is n x n. The rank of ##\mathbf I - \mathbf H## = n - p. (Note: using trace operation considerably simplifies Theorem 4.5, 4.6, and 4.7, and in general since trace is a linear operator it can be interchanged with expectations (with some technical care needed for convergence if not a finite case)).

On their own, each of these is rank deficient-- which is disturbing when you look at the formula for the multivariate Guassian as it has you inverting the covariance matrix inside the exponential function, and the normalizing constant has the determinant of the covariance matrix in the denominator. But what these things are really saying in 4.5a and 4.5b is that you can cleanly partition the Guassian into p items that relate to ##\mathbf {\hat y}## and n-p items that relation to ##\mathbf {\hat e}##.

Thanks for your detailed reply.

What I'm not sure about is this ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##.

Why is there no ##\sigma^{2}## term in there? Usually there is such a term since that variance of ##\vec y ## is ##\sigma^{2}*I##
 
FallenApple said:
Thanks for your detailed reply.

What I'm not sure about is this ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##.

Why is there no ##\sigma^{2}## term in there? Usually there is such a term since that variance of ##\vec y ## is ##\sigma^{2}*I##

You've got it... they off hand mentioned that ##var \big( \mathbf y\big) = \sigma^2 \mathbf I## when calculating ##\beta##. From here you just recognize this and do the substitution when apropriate:

##\big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T = \big(\mathbf{I} - \mathbf H\big) \big(\sigma^2 \mathbf I \big) \big(\mathbf{I} - \mathbf H\big)^T = \sigma^2 \big(\mathbf{I} - \mathbf H\big) \big(\mathbf{I} - \mathbf H\big)^T##
 
  • Like
Likes   Reactions: FallenApple

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
Replies
1
Views
4K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 0 ·
Replies
0
Views
3K
  • · Replies 18 ·
Replies
18
Views
1K
  • · Replies 12 ·
Replies
12
Views
1K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
Replies
2
Views
1K