Estimating group sample size

zut837 · Oct 22, 2010

I'm working on this puzzle:

The people in a country are partitioned into clans. In order to estimate the average size of a clan, a survey is conducted where 1000 randomly selected people are asked to state the size of the clan to which they belong. How does one compute an estimate average clan size from the data collected?

And am a bit stuck. If you take a pure average of the surveyed people you will overestimate the group size because you will have more representatives from the larger clans. Thus each sampled variable needs to be downweighted in some way -- to factor out multiple samples from the same clan.

Any ideas?

Xerxes1986 · Oct 22, 2010

1. Create a histogram and group the results together to whatever degree of accuracy you want (i.e. lump all people who answered 100 plus/minus 10 together into the "100" bin

2. Now if you plot this and squint you basically have a function f(x) where f is the number of clans and x is the size of the clan. f(x) is essentially the probability of finding a clan of size x.

3. Now we want to calculate the "expectation value" of x, or the average value of the size of the clan. This is usually done with an integrable function like this

[tex]\int^{\infty}_{-\infty} x * f(x) dx[/tex]

Since we don't have a continuous function, you can use the summation:

[tex]\sum x P(x)[/tex]

Where P(x) is the probability of finding a clan of size x. Then you sum over all x's.

CRGreathouse · Oct 22, 2010

You can just weight each respondent as 1/n, where n is the group size reported.

zut837 · Oct 22, 2010

with respect to the first response, how is that not the same as a weighted average,
then, since your sample will have more members of larger clans you will still be overestimating the frequency of the larger clans,

for the second, if you weight each response n by 1/n aren't you simple just scaling all the responses to 1, I'm not 100% sure I understand your approach, could you try to be a bit more explicit?

SW VandeCarr · Oct 23, 2010

zut837 said:

with respect to the first response, how is that not the same as a weighted average,
then, since your sample will have more members of larger clans you will still be overestimating the frequency of the larger clans,

for the second, if you weight each response n by 1/n aren't you simple just scaling all the responses to 1, I'm not 100% sure I understand your approach, could you try to be a bit more explicit?

If I understand this correctly you have two numbers, n and N.. For the number of people in a given clan you have a label n(i) in terms of sampling order for each member i of the clan. There are j clans so every individual in the sample space can be labeled n(i,j). Consider an array of i rows and j columns. What is a summation over columns, over rows and over the whole array?

Note: In a true random sample, every individual will have an equal probability of being selected, so the proportional size of the clans in the sample will approach the true proportions as N grows large.

paradigm · Nov 2, 2010

A weighted average (i.e. the proportion of respondents per response multiplied by the magnitude of the respone, where the "response" is size of the respondent's clan) would absolutely work, given two assumptions:
1)No two clans are the same size.
2)Answers are 100% accurate, not estimated/rounded.

Difficulties may arise if either of the preceding two conditions are violated. Consider the following: of 3 persons surveyed, 2 respond that they come from a clan with 100 total members, while the remaining person responds that his/her clan is comprised of 50. Now, the first 2 individuals may or may not be from the same clan, and neither alternative is outside the realm of statistical possibility. Consequently, average clan size could be either (100+100+50)/3 = 83.33 ppl or (100+50)/2 = 75 ppl. We simply don't know... UNLESS we know the total population of the country. If the country has only 150 people, we know our second calculation must be correct.

Similarly, we must know the total population in order to determine whether there are multiple clans of identical size, which would otherwise obfuscate our calculations. Given a sample size of 1000, the survey should "even itself out," and the percentage of respondents from a given clan should be roughly equivalent to the percentage of the national population that is composed of members from that clan. That is: R/1000 = S/T, where R is the number of respondents from clan A, S is the size of clan A, and T is the total population of the country. Given this information, we can determine the number of clans of a given size by dividing (R/1000)/(S/T), where R is the subset of respondents claiming to have an identical clan size. If 1/4th the respondents come from a clan that supposedly constitutes 1/12th the population, this is a good indication that there are 3 clans of this size. Depending on the number of different reported sizes, this may be cumbersome to do by hand, but it should give you the right answer.
# CLANS = (R₁/1000)/(S₁/T) + (R₂/1000)/(S₂/T)+ (R₃/1000)/(S₃/T) + ...
Once you have the number of clans, you can divide the total population by number of clans to get the average size!
PPL/CLAN = T/(# CLANS)

Estimating group sample size

What is group sample size estimation?

Why is group sample size estimation important?

What factors should be considered when estimating group sample size?

How is group sample size estimated?

Can group sample size be adjusted during the course of a study?

Similar threads

Hot Threads

Recent Insights