How Can You Accurately Estimate Average Clan Size from Survey Data?

  • Thread starter Thread starter zut837
  • Start date Start date
  • Tags Tags
    Group Sample size
AI Thread Summary
Estimating the average clan size from survey data requires careful consideration to avoid overrepresentation of larger clans. A pure average of reported clan sizes will skew results, necessitating a weighted approach to account for multiple responses from the same clan. The discussion suggests creating a histogram to visualize clan sizes and using probability functions to calculate the expectation value of clan size. Accurate estimation hinges on knowing the total population and ensuring that survey responses reflect true clan sizes without rounding errors. Ultimately, dividing the total population by the estimated number of clans yields the average clan size.
zut837
Messages
3
Reaction score
0
I'm working on this puzzle:

The people in a country are partitioned into clans. In order to estimate the average size of a clan, a survey is conducted where 1000 randomly selected people are asked to state the size of the clan to which they belong. How does one compute an estimate average clan size from the data collected?

And am a bit stuck. If you take a pure average of the surveyed people you will overestimate the group size because you will have more representatives from the larger clans. Thus each sampled variable needs to be downweighted in some way -- to factor out multiple samples from the same clan.

Any ideas?
 
Physics news on Phys.org
1. Create a histogram and group the results together to whatever degree of accuracy you want (i.e. lump all people who answered 100 plus/minus 10 together into the "100" bin

2. Now if you plot this and squint you basically have a function f(x) where f is the number of clans and x is the size of the clan. f(x) is essentially the probability of finding a clan of size x.

3. Now we want to calculate the "expectation value" of x, or the average value of the size of the clan. This is usually done with an integrable function like this

\int^{\infty}_{-\infty} x * f(x) dx

Since we don't have a continuous function, you can use the summation:

\sum x P(x)

Where P(x) is the probability of finding a clan of size x. Then you sum over all x's.
 
You can just weight each respondent as 1/n, where n is the group size reported.
 
with respect to the first response, how is that not the same as a weighted average,
then, since your sample will have more members of larger clans you will still be overestimating the frequency of the larger clans,

for the second, if you weight each response n by 1/n aren't you simple just scaling all the responses to 1, I'm not 100% sure I understand your approach, could you try to be a bit more explicit?
 
zut837 said:
with respect to the first response, how is that not the same as a weighted average,
then, since your sample will have more members of larger clans you will still be overestimating the frequency of the larger clans,

for the second, if you weight each response n by 1/n aren't you simple just scaling all the responses to 1, I'm not 100% sure I understand your approach, could you try to be a bit more explicit?

If I understand this correctly you have two numbers, n and N.. For the number of people in a given clan you have a label n(i) in terms of sampling order for each member i of the clan. There are j clans so every individual in the sample space can be labeled n(i,j). Consider an array of i rows and j columns. What is a summation over columns, over rows and over the whole array?

Note: In a true random sample, every individual will have an equal probability of being selected, so the proportional size of the clans in the sample will approach the true proportions as N grows large.
 
Last edited:
A weighted average (i.e. the proportion of respondents per response multiplied by the magnitude of the respone, where the "response" is size of the respondent's clan) would absolutely work, given two assumptions:
1)No two clans are the same size.
2)Answers are 100% accurate, not estimated/rounded.

Difficulties may arise if either of the preceding two conditions are violated. Consider the following: of 3 persons surveyed, 2 respond that they come from a clan with 100 total members, while the remaining person responds that his/her clan is comprised of 50. Now, the first 2 individuals may or may not be from the same clan, and neither alternative is outside the realm of statistical possibility. Consequently, average clan size could be either (100+100+50)/3 = 83.33 ppl or (100+50)/2 = 75 ppl. We simply don't know... UNLESS we know the total population of the country. If the country has only 150 people, we know our second calculation must be correct.

Similarly, we must know the total population in order to determine whether there are multiple clans of identical size, which would otherwise obfuscate our calculations. Given a sample size of 1000, the survey should "even itself out," and the percentage of respondents from a given clan should be roughly equivalent to the percentage of the national population that is composed of members from that clan. That is: R/1000 = S/T, where R is the number of respondents from clan A, S is the size of clan A, and T is the total population of the country. Given this information, we can determine the number of clans of a given size by dividing (R/1000)/(S/T), where R is the subset of respondents claiming to have an identical clan size. If 1/4th the respondents come from a clan that supposedly constitutes 1/12th the population, this is a good indication that there are 3 clans of this size. Depending on the number of different reported sizes, this may be cumbersome to do by hand, but it should give you the right answer.
# CLANS = (R1/1000)/(S1/T) + (R2/1000)/(S2/T)+ (R3/1000)/(S3/T) + ...
Once you have the number of clans, you can divide the total population by number of clans to get the average size!
PPL/CLAN = T/(# CLANS)
 
I was reading documentation about the soundness and completeness of logic formal systems. Consider the following $$\vdash_S \phi$$ where ##S## is the proof-system making part the formal system and ##\phi## is a wff (well formed formula) of the formal language. Note the blank on left of the turnstile symbol ##\vdash_S##, as far as I can tell it actually represents the empty set. So what does it mean ? I guess it actually means ##\phi## is a theorem of the formal system, i.e. there is a...
Back
Top