Determining the covariance matrix of a multivariate normal distribution

In summary, the conversation revolves around a stats problem involving a large population of colored balls and a random sample of 1000 balls. The distribution of colors in the sample can be modeled as a multivariate normal distribution, and the means and variances for each variable can be calculated. However, there is confusion about how to determine the covariance matrix. The conversation discusses different approaches and approximations, including the use of the Mahalanobis distance to determine if certain data sets fit the expected model.
  • #1
sandsifter
3
0
Hi all,

I have a stats problem I'm trying to figure out.

Suppose I have a very large population (~millions) of colored balls with exactly 50% red, 30% green, 20% blue. If I take a random sample of 1000 of these balls, the distribution of colors I end up with can be modeled as a multivariate normal distribution, with each dependent variable denoting the number of red, green and blue balls
[r g b].

I can calculate the expected values and variances for each variable:

The means are (obviously): Mu = [500 300 200]

And each variance will be given by, (100-P)*P/n

Where P is the probability of picking a particular color (e.g. 0.5 for red) and n is the sample size, 1000 in this case. Note that this is a valid approximation as long as the population I am sampling is large compared to the sample size.
However, I'm not sure how to determine the covariance matrix. I know the diagonal elements are the variances [edit: should be standard deviations, not variances] of each variable, so that's easy. However, given that the variables must add up to 1000 (r+g+b=1000), and knowing the means and variances of each, shouldn't I be able to use an analytical expression to determine the covariances?

Intuitively, I know that for every extra red ball I pick, this translates to 300/(300+200) probability that there will be one less green ball, and 200/(300+200) probability that there will be one less blue ball, in the final distribution.

What am I missing here? I'm about to perform a Monte Carlo simulation to empirically determine the covariances, but I'm sure there's a neater way!Thanks in advance!
 
Last edited:
Physics news on Phys.org
  • #2
I will try to rephrase the question in a more general way. I have 3 dependent variables, X, Y and Z. I know their marginal distributions (which are normal) and I know roughly how they move together:

X+Y+Z = constant

and if I increase X, Y will on average decrease by a proportional amount, Y/(Z+Y), and Z will decrease by Z/(Z+Y). I *think* this is the case (see the more detailed formulation above).

Given all this, isn't it possible to calculate the correlations and hence covariances between X, Y and Z?
 
  • #3
sandsifter,

I'm not sure what you need your model for, so I am not sure what approximations actually make sense, here.

To me this looks like a classic Bernoulli trial problem. I choose N=1000 balls, each with probability .5 of being red, .3 green, and .2 blue. So the probability of getting kr red balls and kg green balls and kb blue balls (with kr+kg+kb=N) should be a multinomial. If you want, you can use the Demoivre-Laplace theorem to approximate this with a multivariate normal distribution, but that will only be accurate for a k within about sqrt(N) or so of the expected value. I don't know your goal of the model, so I don't know if that approximation is useful to you or not. Note that under that approximation you will have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Of course, your approach will also have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Why? If you model X,Y, and Z as jointly normal, then X+Y+Z is also normal (you should prove this if you don't already know it), so with this model you cannot impose the hard X+Y+Z=N constraint.

jason
 
  • #4
jasonRF said:
sandsifter,

I'm not sure what you need your model for, so I am not sure what approximations actually make sense, here.

To me this looks like a classic Bernoulli trial problem. I choose N=1000 balls, each with probability .5 of being red, .3 green, and .2 blue. So the probability of getting kr red balls and kg green balls and kb blue balls (with kr+kg+kb=N) should be a multinomial. If you want, you can use the Demoivre-Laplace theorem to approximate this with a multivariate normal distribution, but that will only be accurate for a k within about sqrt(N) or so of the expected value. I don't know your goal of the model, so I don't know if that approximation is useful to you or not. Note that under that approximation you will have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Of course, your approach will also have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Why? If you model X,Y, and Z as jointly normal, then X+Y+Z is also normal (you should prove this if you don't already know it), so with this model you cannot impose the hard X+Y+Z=N constraint.

jason

Jason,

You are right. I took it for granted that the normal approximation is valid (it probably is close enough, but your last statement is making me think a little bit...).

My aim was to use the Mahalanobis distance to see if certain data sets were close to a certain model (e.g. in my example here, the model would be the multivariate distribution with means of 50% red, 30% green, 20% blue and the appropriate variances and covariances).

Edit: It's clear to me that the normal approximation is indeed valid, as long as the sample size is large enough. I know that for the bivariate case there are rules of thumb for using the approximation. Not sure how these scale to the multivariate case, but in my example where I'm using n=1000, I would probably have a good approximation regardless of the variables' proportions.

Thanks for pointing me in the right direction and for your reply. I really appreciate it!
 
Last edited:

1. What is a multivariate normal distribution?

A multivariate normal distribution is a statistical distribution that describes the probability of observing a set of random variables that are correlated with each other. It is often used to model complex data that cannot be accurately represented by a single variable.

2. Why is it important to determine the covariance matrix of a multivariate normal distribution?

The covariance matrix is important because it contains information about the relationships between the different variables in the distribution. This information can be used to understand the data better, make predictions, and detect patterns and trends.

3. How is the covariance matrix calculated?

The covariance matrix is calculated by finding the covariance between each pair of variables in the multivariate normal distribution. The covariance is a measure of how two variables change together, and it is calculated by taking the product of the differences between each variable and its mean, and then dividing by the total number of observations.

4. What does the covariance matrix tell us about the distribution?

The covariance matrix tells us about the strength and direction of the relationships between the variables in the distribution. A positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates that they tend to change in opposite directions. A covariance of zero means there is no relationship between the variables.

5. Can the covariance matrix be used to determine the probability of specific outcomes?

No, the covariance matrix alone cannot be used to determine the probability of specific outcomes. It is used to describe the overall distribution and relationships between variables, but to calculate probabilities, additional information such as the mean and standard deviations of each variable is needed.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
916
  • Set Theory, Logic, Probability, Statistics
Replies
14
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
335
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
996
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
814
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
Back
Top