# Practical problem: need a distribution!

Homework Helper
This should be an easy question, but I can't think of how to answer it! I never did take enough stats.

I'm looking at n = 77 returned surveys which come from a population of size 627. (At the moment I'm assuming that response is uncorrelated with the answers; it's actually somewhat reasonable in this case, and I don't have the background I'd need to assume otherwise!)

Each survey contains count data: I have
[ ] 0
[ ] 1
[ ] 2
[ ] 3
foos (for various foo). From this I can of course determine the number of foos in the sample, and the BLUE for the total across the population. But what sort of distribution should I use to determine a (say) 90% confidence range? I was toying with misusing a Poisson model here (each respondent acting like a time interval), but even so I wasn't able to determine a CI (must have been doing something very wrong; when I did a normal approximation of the Poisson I came up with a negative lower bound!). In summary:
1. What sort of distribution is appropriate? A simpler one would be better.
2. Very briefly (one sentence or just drop in a link; I'll work out the details) how do I find a CI with that distribution?

Hurkyl
Staff Emeritus
Gold Member
I think... what you want to do is to estimate the probability distribution over {0, 1, 2, 3} and then take the mean of that probability distribution?

So, you have the unknown variables X = P(0), Y = P(1), Z = P(2), from which you can compute the quantities:
P(My poll of N people got the frequencies A, B, C, (N-A-B-C) | X = x, Y = y, Z = z)

You could probably do some Bayesian thing to estimate

P(X = x, Y = y, Z = z | My poll of N people got the frequencies A, B, C, (N-A-B-C))

from which you can compute a probability distribution on

mu = y + 2z + 3(1-x-y-z)

At least, these are my first instincts. I don't know if this is the right way to go about it.

If it was the distribution over {0, 1, 2, 3} you wanted, you could do a Chi-square test with three degrees of freedom if each bucket has enough samples. [URL [Broken] link[/url]

Maybe the right way to get a distribution on the mean is to start with this?

Last edited by a moderator:
Homework Helper
At least, these are my first instincts. I don't know if this is the right way to go about it.

You have no idea how far that went in making me feel non-stupid for asking the question.

Hurkyl
Staff Emeritus
Gold Member
You have no idea how far that went in making me feel non-stupid for asking the question.
Heh! Oh, if you didn't notice, I added something about doing a goodness-of-fit test to the end of my previous post.

Homework Helper
If it was the distribution over {0, 1, 2, 3} you wanted, you could do a Chi-square test with three degrees of freedom if each bucket has enough samples.

I have a {0, 1, 2, 3, 4, 5} distribution (number of objects: discrete), a {0, 2.5, 4, 6, 12, 50, 250} (frequency: continuous, but broken into blocks for the survey), and several categorical or binary distributions.

I don't understand what I'd do with a chi-square. I though that was when you wanted to test if a prior distribution was reasonable given sample data, but I have no prior idea what the distribution was before starting.

Hurkyl
Staff Emeritus
Gold Member
Well, the thing I'm hoping will work out is that you can somehow obtain a distribution on the mean by integrating over all possible distributions on {0,1,2,3,4,5} that would yield that mean. Maybe goodness-of-fit isn't the way to go about it, but I decided it was a place to start thinking.

For n=77, it's probably safe to assume that the mean number of foo is approximately normally distributed, and it's also probably safe to use the sample estimate of the variance when calculating the confidence interval for the mean.

Homework Helper
For n=77, it's probably safe to assume that the mean number of foo is approximately normally distributed, and it's also probably safe to use the sample estimate of the variance when calculating the confidence interval for the mean.

Unfortunately that won't do. The sample standard deviation I calculated is greater than the mean, so to get even a 70% confidence interval I'd have to include 0. But the lower bound is the most important part of the estimation, and it can't be that low.

You know that you divide the sample std. dev. by sqrt(77) to get the mean std. dev?

Homework Helper
You know that you divide the sample std. dev. by sqrt(77) to get the mean std. dev?

My calculation was

S = (0 - mean)^2 * [# of 0 responses] + (1 - mean)^2 * [# of 1 responses] + (2 - mean)^2 * [# of 2 responses]
sigma = sqrt(S/76)

where [# of 0 responses] + [# of 1 responses] + [# of 2 responses] = 77.

Edit: Actually this is simplified, since I have more possibilities than {0, 1, 2}, but you get what I mean. I used 76 rather than 77 because this is a sample and not the population.

If that doesn't give you a small enough confidence interval, I don't know then.

Homework Helper
If that doesn't give you a small enough confidence interval, I don't know then.

My mean is just under 1, since most people report 0. My standard deviation is a touch over 1.

I tend to think that getting results like this means my approach is wrong and I need to model it differently. What do you think of my proposed (mis)use of the Poisson distribution here?

Homework Helper
I tried using R (my first time!) to calculate the 5% and 95% using a Poisson distribution:
Code:
> qpois(.05*(1:19),0.96104 * 627)
[1] 562 571 577 582 586 590 593 596 599 602 605 609 612 615 619 623 628 634 643

I don't think this is a great approach, since it doesn't take into account the degree to which the sample will randomly deviate from the population. But at least it gives a reasonable bound: 562 to 643 with 90% confidence. (The true 90% bound should then be wider, though I don't know how much.)

Hurkyl
Staff Emeritus
Gold Member
I'm mildly confused about what you're actually trying to compute. (Something got lost of obfuscated in the abstraction) But, assuming I have a good idea about it....

One thing to consider is that maybe you just don't have enough data to compute what you want.

Since you're interested in lower bounds, maybe you shouldn't be doing confidence intervals, but instead one-sided tests; i.e. find a 95% confidence interval of the form "A < X" rather than one of the form "A < X < B".

Since you've revealed that your actual data is mostly zeroes, a few ones, and sporadic higher values, it seems more plausible that the Poisson could be used. Does the thing you're actually testing for have qualities that suggest Poisson is accurate? You could always do a goodness-of-fit test to see if a Poisson distribution with the right mean is a decent description of the data.

Homework Helper
Yes, that says the use the BLUE, which I was already using.

Homework Helper
One thing to consider is that maybe you just don't have enough data to compute what you want.

I hope not. I think the main problem is my lack of statistical experience in choosing good models and techniques.

But at least now I have an estimate, even if it is narrower than I think is justified. There's at least one modification I could do to the test, but that would make the problem worse (narrow the range): reducing the population size by the sample size and adding in the known values.