# Multivariate Gaussian - Normalization factor via diagnolization

1. Jan 14, 2017

### binbagsss

1. The problem statement, all variables and given/known data
Hi,

I am trying to follow my book's hint that to find the normalization factor one should

"Diagnoalize $\Sigma^{-1}$ to get $n$ Gaussian which will have variance given by the eigenvalues of $\Sigma$ . Then integrate gives $\sqrt{2\pi}\Lambda_i$, then use that the product of eigenvalues is the determinant $. 2. Relevant equations What I know:$\Sigma$is symmetric and so it can be diagnolized$\Sigma=PDP^{T}$where$P$is the orthogonal matrix of eigenvectors and$D$is the matrix of eigenvalues 3. The attempt at a solution I'm Stuck: I am blank as to where start explicitly to be honest, having not done any examples in this, if I have, at least for four years or so. Many thanks in advance ! 2. Jan 14, 2017 ### Ray Vickson Just re-express your integral over$(x_1, x_2, \ldots, x_n)$as an integral over$(y_1,y_2, \ldots, y_n)$, where$\vec{y} ={P^T}^{-1} \vec{x}$. Use the standard change-of-variable formula for multivariate integrals. Last edited: Jan 14, 2017 3. Jan 16, 2017 ### binbagsss mmmm okay thanks, I think the general idea is making more sense, so by transforming to a diagonal matrix we loose cross terms in the gaussian so the integral reduces to a product over n individual gaussians. Because$D$is the diagonal matrix giving the eigenvalues of$\Sigma$,$\Sigma^{-1}$eigenvalues will be given by$D^{-1}$and then reading of the form of the gaussian it is easy to see that the eigenvalues of$\Sigma$will be the variances. As for the actual algebra of the transformation, it's not looking so good for me. I have: (By the way, should the transformation be$\tilde{y}=P^T \tilde{x}$, where$\tilde{x}=x-\mu$)$\tilde{y}=P^T \tilde{x} \implies \tilde{x}=(P^T)^{-1}\tilde{y}$So$\tilde{x}^T\Sigma^{-1}\tilde{x}=\tilde{y^{T}}((P^{T})^{-1})^{T}(P^{T})^{-1}D^{-1}P^{-1}(P^{T})^{-1}\tilde{y}$Using$\Sigma^{-1}=(PDP^T)^{-1}$Whilst we know$P^T=P^{-1}$, we can not simplify$((P^{T})^{-1})^T$or$ (P^T)^{-1} $can we? Thanks in advance 4. Jan 16, 2017 ### Ray Vickson Since$\Sigma = P D P^T$, we have$x^T \Sigma^{-1} x = x^T (P^T)^{-1} D^{-1} P^{-1} x$, so I should have written$y = P^{-1} x$. Since, we do have$(P^T)^{-1} = (P^{-1})^T$for any invertible matrix$P$, we have$x^T \Sigma^{-1} x = y^T D^{-1} y$. The Jacobian of the transformation gives  dy_1\, dy_2 \, \cdots \, dy_n = J \: dx_1 \, dx_2, \cdots \, dx_n, where J = \left| \det \left( \frac{\partial y_i}{\partial x_j} \right) \right| = |\det(P^{-1})| = 1. 5. Jan 17, 2017 ### binbagsss I am looking at Kardar, which then says that 'similar manipulations can be used to find the characteristic function' , my lecture notes also say this, and mention we should convert back too !$\tilde{p(\vec{k})}=e^{-i k.\lambda + \frac{1}{2} k.C.k} $So the definition of$\tilde{p(\vec{k})} $is$\tilde{p(\vec{k})} = <e^{-ik.x}>$So using the same transformation as above I have$ \frac{1}{\sqrt{det(2\pi C)}}\int e^{-1/2 y^T D^{-1} y} e^{-i k.(P.y)} dy $Which doesn't look like it's simplified anything, since$Pisn't of an easy form? I'm guessing I need a different substitution? Any help much appreciated anyone, thank you ! 6. Jan 17, 2017 ### StoneTemplePython I too find the multivariate gaussian hard to interpret at times. What I like to do is start with easy problems and build up. When working with single variable Gaussians, you'll frequently use a standard normal r.v. -- i.e. zero mean, unit variance. Let's do the first part of that and make your variables zero mean \begin{align} p(v; \mu, \Sigma) \propto exp \Big(-\frac{1}{2} v^T \Sigma^{-1} v\Big) \end{align} wherev$is just the zero mean (i.e. centered) version of$x$. To put this in mathematically I'd probably say:$v_i = x_i - \mu_i$, or something along those lines. Note the above setup assumes your covariance matrix is computed from$v$not from$x$. (I would have used a new symbol for said covariance matrix, but$\Sigma$is pretty much the notation for a covariance matrix.) Can we assume these variables are independent? If so then that means your$\Sigma$is already diagonalized -- i.e. it is diagonal. Why? because diagonal entries refer to variance and off diagonal entries refer to covariance. Recall that independent variables with zero mean have no covariance. (Technically we could skate by with pairwise independence but let's assume they are mutually independent.) Is this interpretable now? Note: that if we were trying to make$v$have unit variance we'd do$\Sigma^{-\frac{1}{2} }v$-- this is possible because$\Sigma$is Symmetric positive definite (irrespective of independence concerns... unless one of your variables or data sets is degenerate in which case$\Sigma$has an eigenvalue = 0, but this but degeneracy needs to be dealt with one way or another before you can move ahead, so by assumption/ necessity, all eigs >0.) From here you can move on to the normalizing constant. It really is just a matter of recognizing that when you multiply a sequence of random variables that are each described by a constant times exponential function, you add the stuff inside the exponential function, and multiply the constants outside. So with respect to the normalizing constant, if you're looking at the joint pdf of$n$Gaussians, the$\frac{1}{\sqrt 2}$term gets multiplied n times, so you could write that part of the constant as$\frac{1}{(2 \pi)^\frac{n}{2}}$. Also recall from the single variable case, that the other part of the normalizing constant, is one divided by the square root of variance. From here, multiplying a bunch of square roots of variance can also be written as the square root of the product of a bunch of variances... and the product of variances is the product of the diagonal entries of$\Sigma$in our case -- and since$\Sigma$is diagonal, we can also call this the determinant. Hence the other piece of the normalizing constant is$\frac{1}{det(\Sigma ^{\frac{1}{2}})}$or equivalently$\frac{1}{det(\Sigma) ^{\frac{1}{2}}}$. It is perhaps worth pointing out that when I said "that if we were trying to make$v$have unit variance we'd do$\Sigma^{-\frac{1}{2} }v$" -- if we followed through on this, we would not have needed to put this determinant factor in the normalizing constant (as said determinant would be 1.) If you're actually working with data that is supposed to be independent but perhaps isn't, you would be whitening your data (a term from signal processing) at such a step --which can also be interpreted as being proportional to the result you get when setting all singular values = 1, assuming you have a non-singular data matrix and so forth. Post Script: For avoidance of doubt, unless explicitly stated otherwise, when I say square root of something I mean the positive square root -- in the same way that standard deviations are square roots of variance. Last edited: Jan 17, 2017 7. Jan 18, 2017 ### binbagsss So I got the correct normalization constant by doing$\tilde{y}=P^T \tilde{x}$, where$\tilde{x}=x-\mu$) And$\Sigma^{-1}=(PDP^T)^{-1}D$the eigenvalue matrix of$\Sigma$and$P$the eigenvector matrix. However I did not take this fact into consideration- the covariance matrix is the one specified in the question and so must be the one computed from$x$? So this is wrong? Thanks in advance 8. Jan 18, 2017 ### StoneTemplePython The thing is -- variance and covariance are distance metrics. It really doesn't matter if you have a mean in there or not -- the variances certainly don't change (they are defined as dispersion about a given mean using a euclidean distance metric) and if you work through the math, the covariances don't either. Quick walkthrough of the math:$cov(X,Y) = E[XY] - E[X]E[Y]##

consider the case where we shift the mean of X by some amount fixed value called b
$$cov(X + b,Y) = E[(X+b)Y] - E[X+b]E[Y]\\ cov(X + b,Y) = E[XY+ bY] - (E[X] + b)E[Y]\\ cov(X + b,Y) = E[XY] +E[bY] - (E[X] + b)E[Y]\\ cov(X + b,Y) = E[XY] + bE[Y] - (E[X]E[Y] + bE[Y])\\ cov(X + b,Y) = E[XY] - E[X]E[Y] \\ cov(X + b,Y) = cov(X, Y)$$
now set b := mean of X, and you demonstrate covariance doesn’t change after centering X. Repeat application and center Y.

That said I'm quite biased toward working with zero mean variables -- the expressions are a lot simpler, and if you're working with data, there are very compelling linear algebra related reasons to center the data as part of preprocessing. The only real assumption you make is that your underlying variables have a mean (and that your sampling isn't horrible) -- but you don't need to make any assumption beyond that, so it's quite general.

So if you twist my arm -- I'd say you're ok. I almost always try to work with zero mean variables, then if needed, do a bijective claim at the end.

(Note for completeness: the only time I'm aware that you can't really work with zero mean variables, where a mean exists, is when you're looking at things in a different way -- i.e. expected time till failure or expected time till absorption or whatever stochastic process. That doesn't apply here.)

N.B. apparently putting square brackets around b triggers bold and blows up all the associated LaTeX. This kept happening when I was showing that the expectation of a b is b.

Last edited: Jan 18, 2017