# Sampling accuracy estimation

Tags:
1. Oct 28, 2014

### Virous

Good afternoon!

Suppose I have a box with N marbles of different color and I want to know the ratio of the number of green ones to N (number X). The number of marbles (N) is so huge that theres absolutely no way to get them all out of the box and count. What I do instead is I take a sample of M randomly picked marbles and count the ratio of green ones within this sample. I obtain another value - Y.

It is obvious that as M approaches N, Y approaches X. The question is: up to what significant figure can I trust the value of Y?

I know that to get a proper answer I have to calculate probabilities of each significant figure being correct and then establish a trust boundary (lets say, if it is 90% or more - I consider it to be correct). And I do understand the process of calculation. My question is the following:

Is there an already well established solution for this problem that I can just quote in my essay and avoid adding unnecessary information? Im pretty sure that it exists, since I cant be the only one facing this problem. I just need the name :)

Deep Thanks!

2. Oct 29, 2014

### Simon Bridge

Say you draw n<<N marbles, and g of them are green, then you want to use this result as an estimator for the probability of drawing a green marble out of the population? i.e. P(g)=g/n ... how good an estimator is this?
Well, if n=1, that's a pretty bad estimator ...

Working out population statistics from sample statistics is well studied and there are lots of ways to approach it - covered in standard statistics text books. For you, I think: look up "Bayesian Analysis" - it helps to think of the sample as n independent trials with possible outcomes g and not-g.

3. Oct 29, 2014

### Stephen Tashi

The name "confidence interval" is often associated with the scenario you describe, but it does not give you the ironclad guarantees that you want - even though many people misinterpret it as doing so. (For example "90% confidence" isn't synonymous with "90% probability".) http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

If you really want to make the claim that "There is a 90% probablity that the actual proportion of the population is in the following interval:..." then, as Simon Bridge said, you must take a Bayesian approach.

4. Oct 29, 2014

### mathman

Your case can be modeled by a binomial distribution. The standard deviation is √(p(1-p)/n), where p is estimated by the sample average and n is the sample size.

5. Oct 29, 2014

### Virous

Thanks to everyone!

The solution came from an A-Level book in the form of confidence intervals. Special thanks to Stephen for telling me what to look for in the book`s index. Unfortunately, overloading the essay with mathematical stuff appeared to be inevitable :(.