What is the Best Way to Determine Sampling Accuracy in Large Populations?

Virous
Messages
68
Reaction score
0
Good afternoon!

Suppose I have a box with N marbles of different color and I want to know the ratio of the number of green ones to N (number X). The number of marbles (N) is so huge that there`s absolutely no way to get them all out of the box and count. What I do instead is I take a sample of M randomly picked marbles and count the ratio of green ones within this sample. I obtain another value - Y.

It is obvious that as M approaches N, Y approaches X. The question is: up to what significant figure can I trust the value of Y?

I know that to get a proper answer I have to calculate probabilities of each significant figure being correct and then establish a trust boundary (let`s say, if it is 90% or more - I consider it to be correct). And I do understand the process of calculation. My question is the following:

Is there an already well established solution for this problem that I can just quote in my essay and avoid adding unnecessary information? I`m pretty sure that it exists, since I can`t be the only one facing this problem. I just need the name :)

Deep Thanks!
 
Physics news on Phys.org
Virous said:
Suppose I have a box with N marbles of different color and I want to know the ratio of the number of green ones to N (number X). The number of marbles (N) is so huge that there`s absolutely no way to get them all out of the box and count. What I do instead is I take a sample of M randomly picked marbles and count the ratio of green ones within this sample. I obtain another value - Y.
Say you draw n<<N marbles, and g of them are green, then you want to use this result as an estimator for the probability of drawing a green marble out of the population? i.e. P(g)=g/n ... how good an estimator is this?
Well, if n=1, that's a pretty bad estimator ...

Working out population statistics from sample statistics is well studied and there are lots of ways to approach it - covered in standard statistics textbooks. For you, I think: look up "Bayesian Analysis" - it helps to think of the sample as n independent trials with possible outcomes g and not-g.
 
Virous said:
Is there an already well established solution for this problem that I can just quote in my essay and avoid adding unnecessary information? I`m pretty sure that it exists, since I can`t be the only one facing this problem. I just need the name :)

The name "confidence interval" is often associated with the scenario you describe, but it does not give you the ironclad guarantees that you want - even though many people misinterpret it as doing so. (For example "90% confidence" isn't synonymous with "90% probability".) http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

If you really want to make the claim that "There is a 90% probablity that the actual proportion of the population is in the following interval:..." then, as Simon Bridge said, you must take a Bayesian approach.
 
Your case can be modeled by a binomial distribution. The standard deviation is √(p(1-p)/n), where p is estimated by the sample average and n is the sample size.
 
Thanks to everyone!

The solution came from an A-Level book in the form of confidence intervals. Special thanks to Stephen for telling me what to look for in the book`s index. Unfortunately, overloading the essay with mathematical stuff appeared to be inevitable :(.
 
Hi all, I've been a roulette player for more than 10 years (although I took time off here and there) and it's only now that I'm trying to understand the physics of the game. Basically my strategy in roulette is to divide the wheel roughly into two halves (let's call them A and B). My theory is that in roulette there will invariably be variance. In other words, if A comes up 5 times in a row, B will be due to come up soon. However I have been proven wrong many times, and I have seen some...
Thread 'Detail of Diagonalization Lemma'
The following is more or less taken from page 6 of C. Smorynski's "Self-Reference and Modal Logic". (Springer, 1985) (I couldn't get raised brackets to indicate codification (Gödel numbering), so I use a box. The overline is assigning a name. The detail I would like clarification on is in the second step in the last line, where we have an m-overlined, and we substitute the expression for m. Are we saying that the name of a coded term is the same as the coded term? Thanks in advance.
Back
Top