What is the Best Way to Determine Sampling Accuracy in Large Populations?

Click For Summary

Discussion Overview

The discussion revolves around determining the accuracy of sampling methods in large populations, specifically focusing on estimating the ratio of a specific color of marbles (green) in a large box of marbles. Participants explore statistical methods and concepts relevant to sampling, including confidence intervals and Bayesian analysis.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes a scenario where they cannot count all marbles and seeks to understand how reliable their sample ratio (Y) is compared to the actual ratio (X).
  • Another participant suggests that working out population statistics from sample statistics is well-studied and recommends looking into "Bayesian Analysis" for better estimations.
  • A third participant mentions "confidence intervals" as a common solution but cautions that it does not provide absolute guarantees and highlights the difference between confidence and probability.
  • One participant provides a mathematical model using the binomial distribution, indicating how to calculate standard deviation based on sample size and estimated proportion.

Areas of Agreement / Disagreement

Participants express varying opinions on the reliability of different statistical approaches, with some advocating for Bayesian methods while others highlight the limitations of confidence intervals. There is no consensus on a single best method, and the discussion remains open-ended.

Contextual Notes

Participants note the complexity of interpreting confidence intervals and the need for a Bayesian approach to make stronger claims about population proportions. The discussion reflects the nuances and assumptions inherent in statistical estimation.

Who May Find This Useful

This discussion may be useful for students or researchers interested in statistical methods for estimating population parameters from samples, particularly in fields involving large datasets or probabilistic modeling.

Virous
Messages
68
Reaction score
0
Good afternoon!

Suppose I have a box with N marbles of different color and I want to know the ratio of the number of green ones to N (number X). The number of marbles (N) is so huge that there`s absolutely no way to get them all out of the box and count. What I do instead is I take a sample of M randomly picked marbles and count the ratio of green ones within this sample. I obtain another value - Y.

It is obvious that as M approaches N, Y approaches X. The question is: up to what significant figure can I trust the value of Y?

I know that to get a proper answer I have to calculate probabilities of each significant figure being correct and then establish a trust boundary (let`s say, if it is 90% or more - I consider it to be correct). And I do understand the process of calculation. My question is the following:

Is there an already well established solution for this problem that I can just quote in my essay and avoid adding unnecessary information? I`m pretty sure that it exists, since I can`t be the only one facing this problem. I just need the name :)

Deep Thanks!
 
Physics news on Phys.org
Virous said:
Suppose I have a box with N marbles of different color and I want to know the ratio of the number of green ones to N (number X). The number of marbles (N) is so huge that there`s absolutely no way to get them all out of the box and count. What I do instead is I take a sample of M randomly picked marbles and count the ratio of green ones within this sample. I obtain another value - Y.
Say you draw n<<N marbles, and g of them are green, then you want to use this result as an estimator for the probability of drawing a green marble out of the population? i.e. P(g)=g/n ... how good an estimator is this?
Well, if n=1, that's a pretty bad estimator ...

Working out population statistics from sample statistics is well studied and there are lots of ways to approach it - covered in standard statistics textbooks. For you, I think: look up "Bayesian Analysis" - it helps to think of the sample as n independent trials with possible outcomes g and not-g.
 
Virous said:
Is there an already well established solution for this problem that I can just quote in my essay and avoid adding unnecessary information? I`m pretty sure that it exists, since I can`t be the only one facing this problem. I just need the name :)

The name "confidence interval" is often associated with the scenario you describe, but it does not give you the ironclad guarantees that you want - even though many people misinterpret it as doing so. (For example "90% confidence" isn't synonymous with "90% probability".) http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

If you really want to make the claim that "There is a 90% probability that the actual proportion of the population is in the following interval:..." then, as Simon Bridge said, you must take a Bayesian approach.
 
Your case can be modeled by a binomial distribution. The standard deviation is √(p(1-p)/n), where p is estimated by the sample average and n is the sample size.
 
Thanks to everyone!

The solution came from an A-Level book in the form of confidence intervals. Special thanks to Stephen for telling me what to look for in the book`s index. Unfortunately, overloading the essay with mathematical stuff appeared to be inevitable :(.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 29 ·
Replies
29
Views
6K
  • · Replies 16 ·
Replies
16
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K