How Can You Accurately Estimate Average Clan Size from Survey Data?

  • Context: Undergrad 
  • Thread starter Thread starter zut837
  • Start date Start date
  • Tags Tags
    Group Sample size
Click For Summary

Discussion Overview

The discussion revolves around estimating the average size of clans based on survey data collected from a randomly selected group of individuals. Participants explore various methods and considerations for accurately calculating this average, addressing potential biases and assumptions inherent in the sampling process.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests that a simple average of the surveyed clan sizes would overestimate the average due to larger clans being overrepresented in the sample.
  • Another proposes creating a histogram of clan sizes to derive a probability function, then calculating the expectation value using integration or summation methods.
  • It is mentioned that weighting each respondent by the inverse of their reported clan size could help adjust for overrepresentation, though this approach is questioned by others.
  • Concerns are raised about the validity of using a weighted average without knowing if clans are of different sizes or if responses are accurate, highlighting the ambiguity in the results based on sample size and clan distribution.
  • A participant elaborates on the need for knowledge of the total population to accurately interpret survey results and calculate the number of clans of a given size, suggesting a formula to estimate average clan size based on respondent proportions.

Areas of Agreement / Disagreement

Participants express differing views on the methods for estimating average clan size, with no consensus reached on the best approach. There are ongoing debates about the implications of weighting responses and the assumptions required for accurate calculations.

Contextual Notes

Limitations include the dependence on the accuracy of reported clan sizes, the assumption of equal probability in sampling, and the potential for multiple clans of the same size, which complicates the estimation process.

zut837
Messages
3
Reaction score
0
I'm working on this puzzle:

The people in a country are partitioned into clans. In order to estimate the average size of a clan, a survey is conducted where 1000 randomly selected people are asked to state the size of the clan to which they belong. How does one compute an estimate average clan size from the data collected?

And am a bit stuck. If you take a pure average of the surveyed people you will overestimate the group size because you will have more representatives from the larger clans. Thus each sampled variable needs to be downweighted in some way -- to factor out multiple samples from the same clan.

Any ideas?
 
Physics news on Phys.org
1. Create a histogram and group the results together to whatever degree of accuracy you want (i.e. lump all people who answered 100 plus/minus 10 together into the "100" bin

2. Now if you plot this and squint you basically have a function f(x) where f is the number of clans and x is the size of the clan. f(x) is essentially the probability of finding a clan of size x.

3. Now we want to calculate the "expectation value" of x, or the average value of the size of the clan. This is usually done with an integrable function like this

\int^{\infty}_{-\infty} x * f(x) dx

Since we don't have a continuous function, you can use the summation:

\sum x P(x)

Where P(x) is the probability of finding a clan of size x. Then you sum over all x's.
 
You can just weight each respondent as 1/n, where n is the group size reported.
 
with respect to the first response, how is that not the same as a weighted average,
then, since your sample will have more members of larger clans you will still be overestimating the frequency of the larger clans,

for the second, if you weight each response n by 1/n aren't you simple just scaling all the responses to 1, I'm not 100% sure I understand your approach, could you try to be a bit more explicit?
 
zut837 said:
with respect to the first response, how is that not the same as a weighted average,
then, since your sample will have more members of larger clans you will still be overestimating the frequency of the larger clans,

for the second, if you weight each response n by 1/n aren't you simple just scaling all the responses to 1, I'm not 100% sure I understand your approach, could you try to be a bit more explicit?

If I understand this correctly you have two numbers, n and N.. For the number of people in a given clan you have a label n(i) in terms of sampling order for each member i of the clan. There are j clans so every individual in the sample space can be labeled n(i,j). Consider an array of i rows and j columns. What is a summation over columns, over rows and over the whole array?

Note: In a true random sample, every individual will have an equal probability of being selected, so the proportional size of the clans in the sample will approach the true proportions as N grows large.
 
Last edited:
A weighted average (i.e. the proportion of respondents per response multiplied by the magnitude of the respone, where the "response" is size of the respondent's clan) would absolutely work, given two assumptions:
1)No two clans are the same size.
2)Answers are 100% accurate, not estimated/rounded.

Difficulties may arise if either of the preceding two conditions are violated. Consider the following: of 3 persons surveyed, 2 respond that they come from a clan with 100 total members, while the remaining person responds that his/her clan is comprised of 50. Now, the first 2 individuals may or may not be from the same clan, and neither alternative is outside the realm of statistical possibility. Consequently, average clan size could be either (100+100+50)/3 = 83.33 ppl or (100+50)/2 = 75 ppl. We simply don't know... UNLESS we know the total population of the country. If the country has only 150 people, we know our second calculation must be correct.

Similarly, we must know the total population in order to determine whether there are multiple clans of identical size, which would otherwise obfuscate our calculations. Given a sample size of 1000, the survey should "even itself out," and the percentage of respondents from a given clan should be roughly equivalent to the percentage of the national population that is composed of members from that clan. That is: R/1000 = S/T, where R is the number of respondents from clan A, S is the size of clan A, and T is the total population of the country. Given this information, we can determine the number of clans of a given size by dividing (R/1000)/(S/T), where R is the subset of respondents claiming to have an identical clan size. If 1/4th the respondents come from a clan that supposedly constitutes 1/12th the population, this is a good indication that there are 3 clans of this size. Depending on the number of different reported sizes, this may be cumbersome to do by hand, but it should give you the right answer.
# CLANS = (R1/1000)/(S1/T) + (R2/1000)/(S2/T)+ (R3/1000)/(S3/T) + ...
Once you have the number of clans, you can divide the total population by number of clans to get the average size!
PPL/CLAN = T/(# CLANS)
 

Similar threads

  • · Replies 7 ·
Replies
7
Views
3K
Replies
6
Views
2K
  • · Replies 18 ·
Replies
18
Views
3K
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
7K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K