Determining the covariance matrix of a multivariate normal distribution

Click For Summary

Discussion Overview

The discussion revolves around determining the covariance matrix of a multivariate normal distribution in the context of sampling colored balls from a large population. Participants explore the relationship between the counts of different colored balls and the implications of their distributions, focusing on statistical modeling and approximations.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes a scenario involving a large population of colored balls and proposes that the distribution of colors can be modeled as a multivariate normal distribution, seeking to determine the covariance matrix.
  • Another participant suggests that the problem resembles a classic Bernoulli trial and proposes using a multinomial distribution to model the counts of colored balls, noting the limitations of the normal approximation.
  • Some participants discuss the implications of the constraint that the total number of balls must equal a constant, questioning how this affects the covariance calculations.
  • One participant expresses uncertainty about the validity of the normal approximation and its applicability to their model, while another acknowledges the potential usefulness of the approximation under certain conditions.
  • There is mention of using the Mahalanobis distance to assess how closely certain datasets align with the proposed multivariate distribution model.

Areas of Agreement / Disagreement

Participants express differing views on the appropriateness of the normal approximation for the given problem. While some acknowledge its potential validity under large sample sizes, others raise concerns about the implications of the constant sum constraint on the covariance calculations. The discussion remains unresolved regarding the best approach to determine the covariance matrix.

Contextual Notes

Participants highlight the need for caution regarding the assumptions underlying the normal approximation, particularly in relation to the constraints imposed by the total count of colored balls. There is also mention of the potential for nonzero probabilities of exceeding or falling below the total count in the proposed models.

sandsifter
Messages
3
Reaction score
0
Hi all,

I have a stats problem I'm trying to figure out.

Suppose I have a very large population (~millions) of colored balls with exactly 50% red, 30% green, 20% blue. If I take a random sample of 1000 of these balls, the distribution of colors I end up with can be modeled as a multivariate normal distribution, with each dependent variable denoting the number of red, green and blue balls
[r g b].

I can calculate the expected values and variances for each variable:

The means are (obviously): Mu = [500 300 200]

And each variance will be given by, (100-P)*P/n

Where P is the probability of picking a particular color (e.g. 0.5 for red) and n is the sample size, 1000 in this case. Note that this is a valid approximation as long as the population I am sampling is large compared to the sample size.
However, I'm not sure how to determine the covariance matrix. I know the diagonal elements are the variances [edit: should be standard deviations, not variances] of each variable, so that's easy. However, given that the variables must add up to 1000 (r+g+b=1000), and knowing the means and variances of each, shouldn't I be able to use an analytical expression to determine the covariances?

Intuitively, I know that for every extra red ball I pick, this translates to 300/(300+200) probability that there will be one less green ball, and 200/(300+200) probability that there will be one less blue ball, in the final distribution.

What am I missing here? I'm about to perform a Monte Carlo simulation to empirically determine the covariances, but I'm sure there's a neater way!Thanks in advance!
 
Last edited:
Physics news on Phys.org
I will try to rephrase the question in a more general way. I have 3 dependent variables, X, Y and Z. I know their marginal distributions (which are normal) and I know roughly how they move together:

X+Y+Z = constant

and if I increase X, Y will on average decrease by a proportional amount, Y/(Z+Y), and Z will decrease by Z/(Z+Y). I *think* this is the case (see the more detailed formulation above).

Given all this, isn't it possible to calculate the correlations and hence covariances between X, Y and Z?
 
sandsifter,

I'm not sure what you need your model for, so I am not sure what approximations actually make sense, here.

To me this looks like a classic Bernoulli trial problem. I choose N=1000 balls, each with probability .5 of being red, .3 green, and .2 blue. So the probability of getting kr red balls and kg green balls and kb blue balls (with kr+kg+kb=N) should be a multinomial. If you want, you can use the Demoivre-Laplace theorem to approximate this with a multivariate normal distribution, but that will only be accurate for a k within about sqrt(N) or so of the expected value. I don't know your goal of the model, so I don't know if that approximation is useful to you or not. Note that under that approximation you will have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Of course, your approach will also have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Why? If you model X,Y, and Z as jointly normal, then X+Y+Z is also normal (you should prove this if you don't already know it), so with this model you cannot impose the hard X+Y+Z=N constraint.

jason
 
jasonRF said:
sandsifter,

I'm not sure what you need your model for, so I am not sure what approximations actually make sense, here.

To me this looks like a classic Bernoulli trial problem. I choose N=1000 balls, each with probability .5 of being red, .3 green, and .2 blue. So the probability of getting kr red balls and kg green balls and kb blue balls (with kr+kg+kb=N) should be a multinomial. If you want, you can use the Demoivre-Laplace theorem to approximate this with a multivariate normal distribution, but that will only be accurate for a k within about sqrt(N) or so of the expected value. I don't know your goal of the model, so I don't know if that approximation is useful to you or not. Note that under that approximation you will have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Of course, your approach will also have a nonzero probability of kr+kg+kb>N (and of kr+kg+kb<0). Why? If you model X,Y, and Z as jointly normal, then X+Y+Z is also normal (you should prove this if you don't already know it), so with this model you cannot impose the hard X+Y+Z=N constraint.

jason

Jason,

You are right. I took it for granted that the normal approximation is valid (it probably is close enough, but your last statement is making me think a little bit...).

My aim was to use the Mahalanobis distance to see if certain data sets were close to a certain model (e.g. in my example here, the model would be the multivariate distribution with means of 50% red, 30% green, 20% blue and the appropriate variances and covariances).

Edit: It's clear to me that the normal approximation is indeed valid, as long as the sample size is large enough. I know that for the bivariate case there are rules of thumb for using the approximation. Not sure how these scale to the multivariate case, but in my example where I'm using n=1000, I would probably have a good approximation regardless of the variables' proportions.

Thanks for pointing me in the right direction and for your reply. I really appreciate it!
 
Last edited:

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 14 ·
Replies
14
Views
2K
Replies
5
Views
6K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K