Determining the covariance matrix of a multivariate normal distribution

Click For Summary
SUMMARY

The discussion focuses on determining the covariance matrix of a multivariate normal distribution derived from a sample of colored balls with known proportions: 50% red, 30% green, and 20% blue. The expected values are calculated as Mu = [500, 300, 200], with variances determined using the formula (100-P)*P/n, where P is the probability of selecting a color and n is the sample size (1000). The participants agree that while the diagonal elements of the covariance matrix represent the variances, the covariances can be analytically derived using the relationships between the variables, particularly under the constraint that the total number of balls remains constant. The conversation highlights the use of the multinomial distribution and the Demoivre-Laplace theorem for approximating the multivariate normal distribution.

PREREQUISITES
  • Understanding of multivariate normal distribution
  • Knowledge of covariance and variance calculations
  • Familiarity with multinomial distribution
  • Experience with the Demoivre-Laplace theorem
NEXT STEPS
  • Research the derivation of the covariance matrix for multivariate distributions
  • Learn about the Mahalanobis distance and its applications in statistical modeling
  • Explore the implications of the Demoivre-Laplace theorem in multivariate statistics
  • Investigate the conditions under which normal approximations are valid for multinomial distributions
USEFUL FOR

Statisticians, data scientists, and researchers working with multivariate distributions, particularly those interested in modeling categorical data and understanding covariance structures.

sandsifter
Messages
3
Reaction score
0
Hi all,

I have a stats problem I'm trying to figure out.

Suppose I have a very large population (~millions) of colored balls with exactly 50% red, 30% green, 20% blue. If I take a random sample of 1000 of these balls, the distribution of colors I end up with can be modeled as a multivariate normal distribution, with each dependent variable denoting the number of red, green and blue balls
[r g b].

I can calculate the expected values and variances for each variable:

The means are (obviously): Mu = [500 300 200]

And each variance will be given by, (100-P)*P/n

Where P is the probability of picking a particular color (e.g. 0.5 for red) and n is the sample size, 1000 in this case. Note that this is a valid approximation as long as the population I am sampling is large compared to the sample size.
However, I'm not sure how to determine the covariance matrix. I know the diagonal elements are the variances [edit: should be standard deviations, not variances] of each variable, so that's easy. However, given that the variables must add up to 1000 (r+g+b=1000), and knowing the means and variances of each, shouldn't I be able to use an analytical expression to determine the covariances?

Intuitively, I know that for every extra red ball I pick, this translates to 300/(300+200) probability that there will be one less green ball, and 200/(300+200) probability that there will be one less blue ball, in the final distribution.

What am I missing here? I'm about to perform a Monte Carlo simulation to empirically determine the covariances, but I'm sure there's a neater way!Thanks in advance!
 
Last edited:
Physics news on Phys.org
I will try to rephrase the question in a more general way. I have 3 dependent variables, X, Y and Z. I know their marginal distributions (which are normal) and I know roughly how they move together:

X+Y+Z = constant

and if I increase X, Y will on average decrease by a proportional amount, Y/(Z+Y), and Z will decrease by Z/(Z+Y). I *think* this is the case (see the more detailed formulation above).

Given all this, isn't it possible to calculate the correlations and hence covariances between X, Y and Z?
 
sandsifter,

I'm not sure what you need your model for, so I am not sure what approximations actually make sense, here.

To me this looks like a classic Bernoulli trial problem. I choose N=1000 balls, each with probability .5 of being red, .3 green, and .2 blue. So the probability of getting kr red balls and kg green balls and kb blue balls (with kr+kg+kb=N) should be a multinomial. If you want, you can use the Demoivre-Laplace theorem to approximate this with a multivariate normal distribution, but that will only be accurate for a k within about sqrt(N) or so of the expected value. I don't know your goal of the model, so I don't know if that approximation is useful to you or not. Note that under that approximation you will have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Of course, your approach will also have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Why? If you model X,Y, and Z as jointly normal, then X+Y+Z is also normal (you should prove this if you don't already know it), so with this model you cannot impose the hard X+Y+Z=N constraint.

jason
 
jasonRF said:
sandsifter,

I'm not sure what you need your model for, so I am not sure what approximations actually make sense, here.

To me this looks like a classic Bernoulli trial problem. I choose N=1000 balls, each with probability .5 of being red, .3 green, and .2 blue. So the probability of getting kr red balls and kg green balls and kb blue balls (with kr+kg+kb=N) should be a multinomial. If you want, you can use the Demoivre-Laplace theorem to approximate this with a multivariate normal distribution, but that will only be accurate for a k within about sqrt(N) or so of the expected value. I don't know your goal of the model, so I don't know if that approximation is useful to you or not. Note that under that approximation you will have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Of course, your approach will also have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Why? If you model X,Y, and Z as jointly normal, then X+Y+Z is also normal (you should prove this if you don't already know it), so with this model you cannot impose the hard X+Y+Z=N constraint.

jason

Jason,

You are right. I took it for granted that the normal approximation is valid (it probably is close enough, but your last statement is making me think a little bit...).

My aim was to use the Mahalanobis distance to see if certain data sets were close to a certain model (e.g. in my example here, the model would be the multivariate distribution with means of 50% red, 30% green, 20% blue and the appropriate variances and covariances).

Edit: It's clear to me that the normal approximation is indeed valid, as long as the sample size is large enough. I know that for the bivariate case there are rules of thumb for using the approximation. Not sure how these scale to the multivariate case, but in my example where I'm using n=1000, I would probably have a good approximation regardless of the variables' proportions.

Thanks for pointing me in the right direction and for your reply. I really appreciate it!
 
Last edited:

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 14 ·
Replies
14
Views
2K
Replies
5
Views
5K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K