- #1
sandsifter
- 3
- 0
Hi all,
I have a stats problem I'm trying to figure out.
Suppose I have a very large population (~millions) of colored balls with exactly 50% red, 30% green, 20% blue. If I take a random sample of 1000 of these balls, the distribution of colors I end up with can be modeled as a multivariate normal distribution, with each dependent variable denoting the number of red, green and blue balls
[r g b].
I can calculate the expected values and variances for each variable:
The means are (obviously): Mu = [500 300 200]
And each variance will be given by, (100-P)*P/n
Where P is the probability of picking a particular color (e.g. 0.5 for red) and n is the sample size, 1000 in this case. Note that this is a valid approximation as long as the population I am sampling is large compared to the sample size.However, I'm not sure how to determine the covariance matrix. I know the diagonal elements are the variances [edit: should be standard deviations, not variances] of each variable, so that's easy. However, given that the variables must add up to 1000 (r+g+b=1000), and knowing the means and variances of each, shouldn't I be able to use an analytical expression to determine the covariances?
Intuitively, I know that for every extra red ball I pick, this translates to 300/(300+200) probability that there will be one less green ball, and 200/(300+200) probability that there will be one less blue ball, in the final distribution.
What am I missing here? I'm about to perform a Monte Carlo simulation to empirically determine the covariances, but I'm sure there's a neater way!Thanks in advance!
I have a stats problem I'm trying to figure out.
Suppose I have a very large population (~millions) of colored balls with exactly 50% red, 30% green, 20% blue. If I take a random sample of 1000 of these balls, the distribution of colors I end up with can be modeled as a multivariate normal distribution, with each dependent variable denoting the number of red, green and blue balls
[r g b].
I can calculate the expected values and variances for each variable:
The means are (obviously): Mu = [500 300 200]
And each variance will be given by, (100-P)*P/n
Where P is the probability of picking a particular color (e.g. 0.5 for red) and n is the sample size, 1000 in this case. Note that this is a valid approximation as long as the population I am sampling is large compared to the sample size.However, I'm not sure how to determine the covariance matrix. I know the diagonal elements are the variances [edit: should be standard deviations, not variances] of each variable, so that's easy. However, given that the variables must add up to 1000 (r+g+b=1000), and knowing the means and variances of each, shouldn't I be able to use an analytical expression to determine the covariances?
Intuitively, I know that for every extra red ball I pick, this translates to 300/(300+200) probability that there will be one less green ball, and 200/(300+200) probability that there will be one less blue ball, in the final distribution.
What am I missing here? I'm about to perform a Monte Carlo simulation to empirically determine the covariances, but I'm sure there's a neater way!Thanks in advance!
Last edited: