Adding Multivariate Normal Distributions

In summary, the conversation discusses the distribution of vectors with MVN (multivariate normal) properties, including a random vector ##\vec Y## with mean X##\vec\beta## and variance ##\sigma^2##I, a vector ##\hat {\vec Y}## with MLE (maximum likelihood estimate) properties, and a vector ##\vec{\epsilon}## representing errors in the regression equation. The conversation also mentions the use of a projection matrix H= X(X'X)^-1X' and the vector ##\hat{\vec e}##, which is the difference between ##\vec Y## and ##\hat {\vec Y}##. The main point of the conversation is to show that the
  • #1
FallenApple
566
61
So ##\vec Y##~MVN(X##\vec\beta##, ##\sigma^2##I)

and

##\hat {\vec Y}##~MVN(X##\vec\beta##, ##\sigma^2##H)

and I want to show

##\hat{\vec e}##~MVN(##\vec 0##, ##\sigma^2##(I-H))
Where ##\hat{\vec e}## is the vector of observed residuals( ##\vec e=\vec Y- \hat{\vec Y}=(I-H) \vec Y ##).

And H=##X(X'X)^{-1}X'## is the projection matrix where the primed means transpose

Since ##\hat{\vec e}=\vec Y- \hat{\vec Y}## its just the distribution of ##\vec Y## minus the distribution of ##\hat{\vec Y}##

So MVN(X##\vec\beta##, ##\sigma^2##I)---MVN(X##\vec\beta##, ##\sigma^2##H) = MVN(##\vec 0##, ##\sigma^2##(I-H))

this uses the fact that X##\vec\beta##-X##\vec\beta##=##\vec 0## and ##\sigma^2##I-##\sigma^2##H=##\sigma^2##(I-H) by simple vector addition.

Can one subtract that simply? I mean, ##\vec Y## and ##\hat {\vec Y}## are not independent, so how can I just total up the expectations and the variances like that?
 
Last edited:
Physics news on Phys.org
  • #2
One cannot subtract correlation matrices like that. There will be square roots and quadratic combinations involved, using Cholesky decompositions of the correlation matrices.

It's hard to be sure though, because your notation appears to be non-standard.
What is meant by ##\vec Y\sim MVN(\mathbf X\vec \beta,\sigma^2\mathbf H)##? In particular, what is ##\mathbf X##? If ##\mathbf X## is simply a constant square matrix, ##\mathbf X\vec \beta## will be just a constant vector, in which case what information is added by writing it in that form rather than simply as a vector ##\vec \beta##?
 
  • #3
Just for the benefit of those listening in, dear Apple, can you provide a list of symbols and their meaning ? Would be so nice !
 
  • #4
Boldface terms are matricies and hooks means vectors. ##\vec Y## is random and ##\vec{\epsilon}## is random as well. ##\mathbf X## is constant and ##\vec{\beta}## is constant was well. ##\hat {\vec {\beta}}## is the best linear unbiased estimator for ##\vec{\beta}## and is random. It is also the MLE. ##\hat {\vec {\beta}}##~MVN( ##\vec{\beta}##,##\sigma^2 \mathbf(X'X)^{-1}##)

##\vec Y##=##\mathbf X\vec{\beta}##+##\vec {\epsilon}## is a regression equation where there are n observations and p regressors. ##\mathbf X## is the design matrix and is of dimension n by p. ##\vec{\beta}## is the vector is regression coefficients and ##\vec{\epsilon}## is the vector of errors where ##\vec{\epsilon}##~MVN(##\vec 0##,##\sigma^2##I)
I is the identity matrix and is n by n. This implies that the errors are uncorrelated.
 
Last edited:
  • #5
andrewkirk said:
One cannot subtract correlation matrices like that. There will be square roots and quadratic combinations involved, using Cholesky decompositions of the correlation matrices.

It's hard to be sure though, because your notation appears to be non-standard.
What is meant by ##\vec Y\sim MVN(\mathbf X\vec \beta,\sigma^2\mathbf H)##? In particular, what is ##\mathbf X##? If ##\mathbf X## is simply a constant square matrix, ##\mathbf X\vec \beta## will be just a constant vector, in which case what information is added by writing it in that form rather than simply as a vector ##\vec \beta##?
##\mathbf X## is the design matrix for a regression equation. It is a shorthand way of helping to rewrite the regression equation. It is n by p where n is the number of observations and p is the number of regressors used to help predict the observations. So it is not a square matrix. But it is a constant. ##\hat {\vec Y}\sim MVN(\mathbf X\vec \beta,\sigma^2\mathbf H)## means that the vector ##\hat {\vec Y}## is the MLE of ## \vec Y## and follows a multivarate normal distribution. This is because ##\vec{\epsilon}## is random and follows a MVN as well.
 
Last edited:
  • #6
Here is the pdf for the background. The notation is slightly different than here. I used hooks so as to be more clear on what is a vector and what is not. They didn't.
 

Attachments

  • reg.pdf
    109.3 KB · Views: 412
  • #7
FallenApple said:
So ##\vec Y##~MVN(X##\vec\beta##, ##\sigma^2##I)

Can one subtract that simply? I mean, ##\vec Y## and ##\hat {\vec Y}## are not independent, so how can I just total up the expectations and the variances like that?

Note that via Linearity of Expectations, you can always add up the expected values. I found the PDF you attached to mostly be straight forward, though I didn't understand your notation.

Note your hat matrix ##\mathbf H## is not going to be full rank, and it only has eigenvalues of 0 and 1. It's real valued and symmetric, so it can be diagonalized in the form of ##\mathbf H = \mathbf {QDQ}^T##. (I found the use of ##\mathbf O## in the PDF to be awkward and look too much like the zero matrix.).

What they want is for you to use 4.1b which says ##Var(\mathbf a + \mathbf{B y}) = Var(\mathbf{B y}) = \mathbf B Var(\mathbf y) \mathbf B^T##. The notation is a touch confusing here, but what they're reminding you of is variance is an application of norm for measuring dispersion about a mean, so translation the mean by a constant has no effect. If we assume ##\mathbf y## has zero mean, I appears that they're just reminding you that the covariance matrix is given by ##\mathbf {y y} ^T## and in turn ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##. (I think Feller is to blame for this being called both the variance matrix and the covariance matrix...)

From here they are saying let's look at variance of the error term vector ##\mathbf e = \mathbf y - \mathbf{ \hat y} = \mathbf {I y} - \mathbf{ H y}= \big(\mathbf{I} - \mathbf H\big) \mathbf y \neq \mathbf 0## (that is we use least squares when we have over-determined systems of equations -- I've excluded that case of zero error from this analysis, accordingly). Hence ##Var(\mathbf e ) = Var(\big(\mathbf{I} - \mathbf H\big) \mathbf y\big) = \big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T##. What I never saw explicitly, but is tucked into the equation for ## Var(\hat \beta)## is that ##Var\big(\mathbf y\big) = \sigma^2 \mathbf I##. Substituting this into the above we get:

##\big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T = \big(\mathbf{I} - \mathbf H\big) \big(\sigma^2 \mathbf I \big) \big(\mathbf{I} - \mathbf H\big)^T = \sigma^2 \big(\mathbf{I} - \mathbf H\big) \big(\mathbf{I} - \mathbf H\big)^T##. From here notice that ##\mathbf {I - H}## is real valued and symmetric, so ##\mathbf {I - H} = \big(\mathbf {I - H}\big)^T##, simplifying the equation to: ## \sigma^2 \big(\mathbf{I} - \mathbf H\big)^2##

Then, they make use of the fact that ##\big(\mathbf{I} - \mathbf H\big)## is itself a projector, so ##\big(\mathbf I - \mathbf H\big)^2 = (\mathbf I - \mathbf H\big)##, leading the expression for the covariance matrix of the error term to be ##\sigma^2 \big(\mathbf{I} - \mathbf H\big)^2 = \sigma^2 \big(\mathbf{I} - \mathbf H\big)##.

If you know the singular value decomposition, I'd recommend using it on ##\mathbf X## and going slowly through all of the above. It will allow you to see exactly what is going on under the hood. (It also is quite useful in interpreting the modified Hat matrix that makes use of regularization penalties -- this is outside the scope of your current PDF, but no doubt coming at some point in the future.)

It's worth pointing out that 4.5a and 4.5b were a bit troubling to me at first. The rank of ##\mathbf H = trace\big(\mathbf H \big) = p##. The writeup states that ##\mathbf H## is n x n. The rank of ##\mathbf I - \mathbf H## = n - p. (Note: using trace operation considerably simplifies Theorem 4.5, 4.6, and 4.7, and in general since trace is a linear operator it can be interchanged with expectations (with some technical care needed for convergence if not a finite case)).

On their own, each of these is rank deficient-- which is disturbing when you look at the formula for the multivariate Guassian as it has you inverting the covariance matrix inside the exponential function, and the normalizing constant has the determinant of the covariance matrix in the denominator. But what these things are really saying in 4.5a and 4.5b is that you can cleanly partition the Guassian into p items that relate to ##\mathbf {\hat y}## and n-p items that relation to ##\mathbf {\hat e}##.
 
Last edited:
  • Like
Likes FallenApple
  • #8
StoneTemplePython said:
Note that via Linearity of Expectations, you can always add up the expected values. I found the PDF you attached to mostly be straight forward, though I didn't understand your notation.

Note your hat matrix ##\mathbf H## is not going to be full rank, and it only has eigenvalues of 0 and 1. It's real valued and symmetric, so it can be diagonalized in the form of ##\mathbf H = \mathbf {QDQ}^T##. (I found the use of ##\mathbf O## in the PDF to be awkward and look too much like the zero matrix.).

What they want is for you to use 4.1b which says ##Var(\mathbf a + \mathbf{B y}) = Var(\mathbf{B y}) = \mathbf B Var(\mathbf y) \mathbf B^T##. The notation is a touch confusing here, but what they're reminding you of is variance is an application of norm for measuring dispersion about a mean, so translation the mean by a constant has no effect. If we assume ##\mathbf y## has zero mean, I appears that they're just reminding you that the covariance matrix is given by ##\mathbf {y y} ^T## and in turn ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##. (I think Feller is to blame for this being called both the variance matrix and the covariance matrix...)

From here they are saying let's look at variance of the error term vector ##\mathbf e = \mathbf y - \mathbf{ \hat y} = \mathbf {I y} - \mathbf{ H y}= \big(\mathbf{I} - \mathbf H\big) \mathbf y \neq \mathbf 0## (that is we use least squares when we have over-determined systems of equations -- I've excluded that case of zero error from this analysis, accordingly). Hence ##Var(\mathbf e ) = Var(\big(\mathbf{I} - \mathbf H\big) \mathbf y\big) = \big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T##. What I never saw explicitly, but is tucked into the equation for ## Var(\hat \beta)## is that ##Var\big(\mathbf y\big) = \sigma^2 \mathbf I##. Substituting this into the above we get:

##\big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T = \big(\mathbf{I} - \mathbf H\big) \big(\sigma^2 \mathbf I \big) \big(\mathbf{I} - \mathbf H\big)^T = \sigma^2 \big(\mathbf{I} - \mathbf H\big) \big(\mathbf{I} - \mathbf H\big)^T##. From here notice that ##\mathbf {I - H}## is real valued and symmetric, so ##\mathbf {I - H} = \big(\mathbf {I - H}\big)^T##, simplifying the equation to: ## \sigma^2 \big(\mathbf{I} - \mathbf H\big)^2##

Then, they make use of the fact that ##\big(\mathbf{I} - \mathbf H\big)## is itself a projector, so ##\big(\mathbf I - \mathbf H\big)^2 = (\mathbf I - \mathbf H\big)##, leading the expression for the covariance matrix of the error term to be ##\sigma^2 \big(\mathbf{I} - \mathbf H\big)^2 = \sigma^2 \big(\mathbf{I} - \mathbf H\big)##.

If you know the singular value decomposition, I'd recommend using it on ##\mathbf X## and going slowly through all of the above. It will allow you to see exactly what is going on under the hood. (It also is quite useful in interpreting the modified Hat matrix that makes use of regularization penalties -- this is outside the scope of your current PDF, but no doubt coming at some point in the future.)

It's worth pointing out that 4.5a and 4.5b were a bit troubling to me at first. The rank of ##\mathbf H = trace\big(\mathbf H \big) = p##. The writeup states that ##\mathbf H## is n x n. The rank of ##\mathbf I - \mathbf H## = n - p. (Note: using trace operation considerably simplifies Theorem 4.5, 4.6, and 4.7, and in general since trace is a linear operator it can be interchanged with expectations (with some technical care needed for convergence if not a finite case)).

On their own, each of these is rank deficient-- which is disturbing when you look at the formula for the multivariate Guassian as it has you inverting the covariance matrix inside the exponential function, and the normalizing constant has the determinant of the covariance matrix in the denominator. But what these things are really saying in 4.5a and 4.5b is that you can cleanly partition the Guassian into p items that relate to ##\mathbf {\hat y}## and n-p items that relation to ##\mathbf {\hat e}##.

Thanks for your detailed reply.

What I'm not sure about is this ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##.

Why is there no ##\sigma^{2}## term in there? Usually there is such a term since that variance of ##\vec y ## is ##\sigma^{2}*I##
 
  • #9
FallenApple said:
Thanks for your detailed reply.

What I'm not sure about is this ##var\big(\mathbf{B y}\big) =\mathbf{B y}\big(\mathbf{B y}\big)^T = \mathbf{B yy}^T \mathbf{B}^T = \mathbf{B} Var \big(\mathbf y\big) \mathbf{B}^T ##.

Why is there no ##\sigma^{2}## term in there? Usually there is such a term since that variance of ##\vec y ## is ##\sigma^{2}*I##

You've got it... they off hand mentioned that ##var \big( \mathbf y\big) = \sigma^2 \mathbf I## when calculating ##\beta##. From here you just recognize this and do the substitution when apropriate:

##\big(\mathbf{I} - \mathbf H\big) Var \big(\mathbf y \big) \big(\mathbf{I} - \mathbf H\big)^T = \big(\mathbf{I} - \mathbf H\big) \big(\sigma^2 \mathbf I \big) \big(\mathbf{I} - \mathbf H\big)^T = \sigma^2 \big(\mathbf{I} - \mathbf H\big) \big(\mathbf{I} - \mathbf H\big)^T##
 
  • Like
Likes FallenApple

1. What is a multivariate normal distribution?

A multivariate normal distribution is a probability distribution that describes the likelihood of obtaining a set of values from multiple variables that are normally distributed. It is defined by a mean vector and covariance matrix and is often used in statistical analysis to model complex data.

2. How do you add two multivariate normal distributions?

To add two multivariate normal distributions, you need to calculate the mean and covariance matrix of the resulting distribution. The mean of the new distribution is the sum of the means of the two original distributions, and the covariance matrix is the sum of the covariance matrices of the two distributions.

3. What is the purpose of adding multivariate normal distributions?

The purpose of adding multivariate normal distributions is to model the combined effects of multiple variables on a system. By adding distributions, we can better understand the overall behavior and variability of a system, and make more accurate predictions.

4. Can you add more than two multivariate normal distributions?

Yes, it is possible to add more than two multivariate normal distributions. The process is the same as adding two distributions - you need to calculate the mean and covariance matrix of the resulting distribution by adding the means and covariance matrices of all the individual distributions.

5. Are there any limitations to adding multivariate normal distributions?

One limitation of adding multivariate normal distributions is that the resulting distribution may not always be a normal distribution. In some cases, the addition of multiple distributions can result in a non-normal distribution, which may impact the accuracy of the model. It is important to assess the validity of the resulting distribution before using it for analysis.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
926
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
964
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Calculus and Beyond Homework Help
Replies
9
Views
773
  • Introductory Physics Homework Help
Replies
2
Views
265
  • Introductory Physics Homework Help
Replies
25
Views
285
  • Introductory Physics Homework Help
Replies
12
Views
211
Replies
4
Views
446
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Advanced Physics Homework Help
Replies
8
Views
1K
Back
Top