I am a Computer Science student who wants to implement the EM statistical clustering algorithm. I'm doing this on my spare time outside of any classes that I'm taking. I've been doing a lot of reading and understand almost everything I need to fully. However, I only understand univariable normal distributions. Could anyone on here give me an undergraduate mathematical rationale for the formal description of the multivariate gaussian distribution's pdf? I'll might then ask some questions about the underlying math. Thanks to anyone who steps up to the plate.
I can't figure out what you are trying to find out. However, multivariate gaussian distribution is the joint distribution function for random variables each of which has a gaussian distribution.
Thank you mathman! I am looking for an undergraduate level explanation of how the multivariate guassian distribution's pdf got deduced. All the while I'm looking for the rationale behind the math - little sentences which explain why each step is justified if it isn't obvious and little sentences which explain the general approach (like justifying the use of matices). Your post makes sense to me. I understand what the function is. I just don't quite understand how we came to that understanding and what all of the parameters of the distribution are / why they are formalized as vectors and matrices.
xnull. I've been studying the same, on and off, for a while now, in the context of combination of error. Can you link-to or post the equations?
actually, that's a common mistake. Whereas each individual random variable of a multivariate gaussian themselves are Gaussian, the converse is not true. There exist Gaussian random variables which do not form a multivariate Gaussian. However, the following are true - independent Gaussian r.v.s form a multivariate Gaussian. - any set of linear combinations of independent Gaussians are multivariate Gaussian. - a vector of random variables is multivariate Gaussian if and only if all linear combinations of them are (individually) Gaussian rvs. - any pdf proportional exp(-Q(x_{1},x_{2},...)) for a quadratic Q is a multivariate Gaussian pdf. - a r.v. X taking values in R^{d} is multivariate Gaussian if and only if its characteristic function is of the form exp(-Q(x)) for a quadratic Q. - the central limit theorem says that for a sequence of IID random variables with finite variance X_{1}, X_{2}, ... taking values in R^{d}, then (X_{1}+...+X_{n})/sqrt(n) converges in distribution to a multivariate Gaussian.
I think Phrak wanted references/links on the expectation maximization algorithm. Now you have a keyword, Phrak! Google away. One use of the EM algorithm (and this is what I suspect xnull is using the EM algorithm for) is to describe a multivariate data distribution as a density mixture, usually in the form of a weighted set of multivariate guassians. This is the approach used in a lot of Bayesian-based supervised and unsupervised learning techniques to create pattern classifiers. This is overly simplistic. Specifically, it ignores correlations. It is those correlations that necessitate the matrix/vector formulation of the multivariate gaussian: [tex]f(\boldsymbol x) = \frac 1 {(2\pi)^{N/2}|\mathbf{\Sigma}|^{1/2}}\, \exp\left(-\,\frac 1 2\, (\boldsymbol x - \boldsymbol \mu)^T\,\Sigma^{-1}\,(\boldsymbol x - \boldsymbol \mu)\right)[/tex] Examining each part, the first term makes the integral of the multivariate gaussian PDF over all space be one. This factor is needed because whether some function integrates to unity over the entire space is the number one test of whether that function is a probability measure. The important part is the exponential. This is the multidimensional analaog of the one dimensional form [itex]\exp(-1/2\,(x-\mu)^2/\sigma^2)[/itex]. The expression [itex](x-\mu)^2}/\sigma^2[/itex] is a one dimensional positive definite quadratic form. Extending to higher dimensions suggests using a multidimensional positive definite quadratic form, to wit [tex] (\boldsymbol x - \boldsymbol b)^T\mathbf A(\boldsymbol x - \boldsymbol b)[/tex] To be a positive definite quadratic form the matrix A has to be positive definite. In addition to being positive definite, this treatment limits the matrix A to be symmetric since the covariance matrix is symmetric. To qualify as a PDF the function needs to integrate to one over all space: we need to find a scale factor [itex]\alpha[/itex] such that [tex]\alpha \int \exp\left(-\,\frac 1 2\, (\boldsymbol x - \boldsymbol b)^T\,\mathbf A\,(\boldsymbol x - \boldsymbol b)\right) d\boldsymbol x= 1[/tex] The easiest way to evaluate this integral is to transform to the principal axis system, in which case the transformed A matrix is diagonal. That such a transform exists is guaranteed to the positive definite, symmetric nature of the matrix A. The integral is separable in this system, resulting in [tex]\int \exp\left(-\,\frac 1 2\, (\boldsymbol x - \boldsymbol b)^T\,\mathbf A\,(\boldsymbol x - \boldsymbol b)\right) d\boldsymbol x= \frac{(2\pi)^{N/2}}{|\mathbf A|^{1/2}} [/tex] The scale factor alpha falls out from this. Scaling to form a PDF, [tex]f(\boldsymbol x) = \frac{|\mathbf A|^{1/2}}{(2\pi)^{N/2}}\, \exp\left(-\,\frac 1 2\, (\boldsymbol x - \boldsymbol b)^T\,\mathbf A\,(\boldsymbol x - \boldsymbol b)\right)[/tex] What is its mean and covariance? You can grind through the math if you want. The mean is simply the vector b and the covariance is simply the inverse of the matrix A. Denoting the mean as mu and the covariance matrix as Sigma leads to the more standard form [tex]f(\boldsymbol x) = \frac1{(2\pi)^{N/2}|\mathbf{\Sigma}|^{1/2}}\, \exp\left(-\,\frac 1 2\, (\boldsymbol x - \boldsymbol \mu)^T\,\mathbf{\Sigma}^{-1}\,(\boldsymbol x - \boldsymbol \mu)\right)[/tex]
Thanks DH. I think xnull and I have multivariate guassian distribution's in common, but depart where EM begins. Both, however utilize N random vectors in some M dimensional basis space of independent probabilities. We seem to have the same goal, of getting a better grasp of how this vector space relates to the space of random variables.
Yes, precisely. I'm using this for unsupervised learning. Phrak, if you want to get help on the EM algorithm you should start another thread so that it is more easily searchable by Googlers. I'll be sure to keep my eye out in case you do. Thank you D F. That was precisely the type of description I was looking for. I do not understand why the integral become separable when its transformed to the principal axis system, nor what the principal axis system would be for this n-dimensional space, nor why the matrix A is diagonal to this axis system. But since I understand where the pdf came from now, and since my lack of understanding these things seem to lie outside the scope of my original question, I'll extend my thanks to you (again!). You've made one unsupervised robot very happy. Thanks several times D F.
I edited my previous post. The modification is shown here, with the changes bold: These additions are needed to answer xnull's question in post #9. You're welcome. The reason the system is diagonalizable is that all positive definite symmetric matrices are diagonalizable. Moreover, the diagonalization is done with an orthogonal transformation matrix. This is a direct result of the spectral theorem. How to construct this? Simple: Use either the eigen decomposition or singular value decomposition. (For a positive definite symmetric matrix these yield the same result.) What does it look like? The expression [itex](\boldsymbol x - \boldsymbol b)^T\mathbf A(\boldsymbol x - \boldsymbol b) = c[/tex] is the equation of an N-dimensional ellipsoid. The axes might of this ellipsoid might be pointed any which way. They will, however, be orthogonal to one another. The ellipsoid axes form an orthogonal basis for the N-space: Call this the principal axis system. There exists some transformation matrix that transforms from the original system to the principal axis system. The covariance matrix in this principal axis system takes on a simple form: It is diagonal. The transformation matrix that transforms to the principal axis system is the orthogonal matrix that diagonalizes the covariance matrix (so, another means of construction.) Why is the system separable in the principal axis system? Simple: Do the math. The original quadratic form [itex](\boldsymbol x - \boldsymbol b)^T\mathbf A(\boldsymbol x - \boldsymbol b)[/tex], upon transformed to the principal axis system, becomes the much simpler [itex]\sum_i a'_i {x'_i}^2 = a'_1 {x'_1}^2 + a'_2 {x'_2}^2+ \cdots + a'_N {x'_N}^2[/itex]. (Here, the a'_{i} are the diagonal elements of the transformed A matrix and the x'_{i} are the principal independent variables.) Now exponentiate this: [tex]\exp\left(-\,\frac 1 2\,(\boldsymbol x - \boldsymbol b)^T\mathbf A(\boldsymbol x - \boldsymbol b)\right) \;\to\; \exp\left(-\,\frac 1 2\,a'_1 {x'_1}^2\right)\exp\left(-\,\frac 1 2\,a'_2 {x'_2}^2\right)\cdots \exp\left(-\,\frac 1 2\,a'_N {x'_N}^2\right)[/tex] Integrating over N-space yields [tex]\prod_{i=1}^N\sqrt{\frac{2\pi}{a'_i}} = \frac{(2\pi)^{N/2}}{|\mathbf A'|^{1/2}}[/tex] Since determinant of a matrix is invariant with respect to orthogonal transformation, [itex]|\mathbf A'| = |\mathbf A|[/itex].
Wow, D H, you went all out. Thank you many times. I really understand this now! My experiences with this forum (the few) have been exceptional. I posted the same question on two different forums - one focused on math and the other with a focus on statistics. Only this forum responded. And the quality of the responses. Wow. Which means I'm going to try to stick around and help other people as best I can, where I can. Great community guys.