# Determining the covariance matrix of a multivariate normal distribution

1. Oct 19, 2009

### sandsifter

Hi all,

I have a stats problem I'm trying to figure out.

Suppose I have a very large population (~millions) of colored balls with exactly 50% red, 30% green, 20% blue. If I take a random sample of 1000 of these balls, the distribution of colors I end up with can be modeled as a multivariate normal distribution, with each dependent variable denoting the number of red, green and blue balls
[r g b].

I can calculate the expected values and variances for each variable:

The means are (obviously): Mu = [500 300 200]

And each variance will be given by, (100-P)*P/n

Where P is the probability of picking a particular color (e.g. 0.5 for red) and n is the sample size, 1000 in this case. Note that this is a valid approximation as long as the population I am sampling is large compared to the sample size.

However, I'm not sure how to determine the covariance matrix. I know the diagonal elements are the variances [edit: should be standard deviations, not variances] of each variable, so that's easy. However, given that the variables must add up to 1000 (r+g+b=1000), and knowing the means and variances of each, shouldn't I be able to use an analytical expression to determine the covariances?

Intuitively, I know that for every extra red ball I pick, this translates to 300/(300+200) probability that there will be one less green ball, and 200/(300+200) probability that there will be one less blue ball, in the final distribution.

What am I missing here? I'm about to perform a Monte Carlo simulation to empirically determine the covariances, but I'm sure there's a neater way!

Last edited: Oct 20, 2009
2. Oct 20, 2009

### sandsifter

I will try to rephrase the question in a more general way. I have 3 dependent variables, X, Y and Z. I know their marginal distributions (which are normal) and I know roughly how they move together:

X+Y+Z = constant

and if I increase X, Y will on average decrease by a proportional amount, Y/(Z+Y), and Z will decrease by Z/(Z+Y). I *think* this is the case (see the more detailed formulation above).

Given all this, isn't it possible to calculate the correlations and hence covariances between X, Y and Z?

3. Oct 22, 2009

### jasonRF

sandsifter,

I'm not sure what you need your model for, so I am not sure what approximations actually make sense, here.

To me this looks like a classic Bernoulli trial problem. I choose N=1000 balls, each with probability .5 of being red, .3 green, and .2 blue. So the probability of getting kr red balls and kg green balls and kb blue balls (with kr+kg+kb=N) should be a multinomial. If you want, you can use the Demoivre-Laplace theorem to approximate this with a multivariate normal distribution, but that will only be accurate for a k within about sqrt(N) or so of the expected value. I don't know your goal of the model, so I don't know if that approximation is useful to you or not. Note that under that approximation you will have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Of course, your approach will also have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Why? If you model X,Y, and Z as jointly normal, then X+Y+Z is also normal (you should prove this if you don't already know it), so with this model you cannot impose the hard X+Y+Z=N constraint.

jason

4. Oct 22, 2009

### sandsifter

Jason,

You are right. I took it for granted that the normal approximation is valid (it probably is close enough, but your last statement is making me think a little bit...).

My aim was to use the Mahalanobis distance to see if certain data sets were close to a certain model (e.g. in my example here, the model would be the multivariate distribution with means of 50% red, 30% green, 20% blue and the appropriate variances and covariances).

Edit: It's clear to me that the normal approximation is indeed valid, as long as the sample size is large enough. I know that for the bivariate case there are rules of thumb for using the approximation. Not sure how these scale to the multivariate case, but in my example where I'm using n=1000, I would probably have a good approximation regardless of the variables' proportions.

Thanks for pointing me in the right direction and for your reply. I really appreciate it!

Last edited: Oct 22, 2009