Practical problem: need a distribution

  • Context: Undergrad 
  • Thread starter Thread starter CRGreathouse
  • Start date Start date
  • Tags Tags
    Distribution Practical
Click For Summary

Discussion Overview

The discussion revolves around determining an appropriate statistical distribution for analyzing count data from returned surveys. Participants explore methods for estimating confidence intervals and the suitability of various distributions, including Poisson and normal distributions, while addressing the challenges posed by the data characteristics.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant seeks guidance on the appropriate distribution for count data from surveys and expresses uncertainty about using a Poisson model.
  • Another participant suggests estimating the probability distribution over the observed counts and proposes a Bayesian approach to derive the mean.
  • Concerns are raised about the validity of using a Chi-square test without a prior distribution.
  • Some participants discuss the implications of having a sample standard deviation greater than the mean, complicating confidence interval calculations.
  • There is mention of using R to calculate confidence bounds with a Poisson distribution, though doubts are expressed about its adequacy in reflecting sample variability.
  • One participant proposes considering one-sided tests instead of two-sided confidence intervals due to the nature of the data, which is mostly zeroes.
  • Discussion includes the potential for goodness-of-fit tests to assess the appropriateness of the Poisson distribution for the data.
  • Participants express varying levels of confidence in their statistical approaches and the adequacy of their data for the intended analysis.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the best approach to model the data or the appropriate distribution to use. Multiple competing views and uncertainties remain regarding the statistical methods and interpretations.

Contextual Notes

Limitations include the small sample size relative to the population, the nature of the count data, and the challenges in calculating confidence intervals given the observed distributions.

CRGreathouse
Science Advisor
Homework Helper
Messages
2,832
Reaction score
0
This should be an easy question, but I can't think of how to answer it! I never did take enough stats.

I'm looking at n = 77 returned surveys which come from a population of size 627. (At the moment I'm assuming that response is uncorrelated with the answers; it's actually somewhat reasonable in this case, and I don't have the background I'd need to assume otherwise!)

Each survey contains count data: I have
[ ] 0
[ ] 1
[ ] 2
[ ] 3
foos (for various foo). From this I can of course determine the number of foos in the sample, and the BLUE for the total across the population. But what sort of distribution should I use to determine a (say) 90% confidence range? I was toying with misusing a Poisson model here (each respondent acting like a time interval), but even so I wasn't able to determine a CI (must have been doing something very wrong; when I did a normal approximation of the Poisson I came up with a negative lower bound!). In summary:
1. What sort of distribution is appropriate? A simpler one would be better.
2. Very briefly (one sentence or just drop in a link; I'll work out the details) how do I find a CI with that distribution?
 
Physics news on Phys.org
I think... what you want to do is to estimate the probability distribution over {0, 1, 2, 3} and then take the mean of that probability distribution?

So, you have the unknown variables X = P(0), Y = P(1), Z = P(2), from which you can compute the quantities:
P(My poll of N people got the frequencies A, B, C, (N-A-B-C) | X = x, Y = y, Z = z)

You could probably do some Bayesian thing to estimate

P(X = x, Y = y, Z = z | My poll of N people got the frequencies A, B, C, (N-A-B-C))

from which you can compute a probability distribution on

mu = y + 2z + 3(1-x-y-z)



At least, these are my first instincts. I don't know if this is the right way to go about it.


If it was the distribution over {0, 1, 2, 3} you wanted, you could do a Chi-square test with three degrees of freedom if each bucket has enough samples. link[/url]

Maybe the right way to get a distribution on the mean is to start with this?
 
Last edited by a moderator:
Hurkyl said:
At least, these are my first instincts. I don't know if this is the right way to go about it.

You have no idea how far that went in making me feel non-stupid for asking the question.
 
CRGreathouse said:
You have no idea how far that went in making me feel non-stupid for asking the question.
Heh! Oh, if you didn't notice, I added something about doing a goodness-of-fit test to the end of my previous post.
 
Hurkyl said:
If it was the distribution over {0, 1, 2, 3} you wanted, you could do a Chi-square test with three degrees of freedom if each bucket has enough samples.

I have a {0, 1, 2, 3, 4, 5} distribution (number of objects: discrete), a {0, 2.5, 4, 6, 12, 50, 250} (frequency: continuous, but broken into blocks for the survey), and several categorical or binary distributions.

I don't understand what I'd do with a chi-square. I though that was when you wanted to test if a prior distribution was reasonable given sample data, but I have no prior idea what the distribution was before starting.
 
Well, the thing I'm hoping will work out is that you can somehow obtain a distribution on the mean by integrating over all possible distributions on {0,1,2,3,4,5} that would yield that mean. Maybe goodness-of-fit isn't the way to go about it, but I decided it was a place to start thinking.
 
For n=77, it's probably safe to assume that the mean number of foo is approximately normally distributed, and it's also probably safe to use the sample estimate of the variance when calculating the confidence interval for the mean.
 
mXSCNT said:
For n=77, it's probably safe to assume that the mean number of foo is approximately normally distributed, and it's also probably safe to use the sample estimate of the variance when calculating the confidence interval for the mean.

Unfortunately that won't do. The sample standard deviation I calculated is greater than the mean, so to get even a 70% confidence interval I'd have to include 0. But the lower bound is the most important part of the estimation, and it can't be that low.
 
You know that you divide the sample std. dev. by sqrt(77) to get the mean std. dev?
 
  • #10
mXSCNT said:
You know that you divide the sample std. dev. by sqrt(77) to get the mean std. dev?

My calculation was

S = (0 - mean)^2 * [# of 0 responses] + (1 - mean)^2 * [# of 1 responses] + (2 - mean)^2 * [# of 2 responses]
sigma = sqrt(S/76)

where [# of 0 responses] + [# of 1 responses] + [# of 2 responses] = 77.Edit: Actually this is simplified, since I have more possibilities than {0, 1, 2}, but you get what I mean. I used 76 rather than 77 because this is a sample and not the population.
 
  • #11
If that doesn't give you a small enough confidence interval, I don't know then.
 
  • #12
mXSCNT said:
If that doesn't give you a small enough confidence interval, I don't know then.

My mean is just under 1, since most people report 0. My standard deviation is a touch over 1.

I tend to think that getting results like this means my approach is wrong and I need to model it differently. What do you think of my proposed (mis)use of the Poisson distribution here?
 
  • #13
I tried using R (my first time!) to calculate the 5% and 95% using a Poisson distribution:
Code:
> qpois(.05*(1:19),0.96104 * 627)
[1] 562 571 577 582 586 590 593 596 599 602 605 609 612 615 619 623 628 634 643

I don't think this is a great approach, since it doesn't take into account the degree to which the sample will randomly deviate from the population. But at least it gives a reasonable bound: 562 to 643 with 90% confidence. (The true 90% bound should then be wider, though I don't know how much.)
 
  • #14
I'm mildly confused about what you're actually trying to compute. (Something got lost of obfuscated in the abstraction) But, assuming I have a good idea about it...



One thing to consider is that maybe you just don't have enough data to compute what you want.


Since you're interested in lower bounds, maybe you shouldn't be doing confidence intervals, but instead one-sided tests; i.e. find a 95% confidence interval of the form "A < X" rather than one of the form "A < X < B".


Since you've revealed that your actual data is mostly zeroes, a few ones, and sporadic higher values, it seems more plausible that the Poisson could be used. Does the thing you're actually testing for have qualities that suggest Poisson is accurate? You could always do a goodness-of-fit test to see if a Poisson distribution with the right mean is a decent description of the data.
 
  • #16
Yes, that says the use the BLUE, which I was already using. :wink:
 
  • #17
Hurkyl said:
One thing to consider is that maybe you just don't have enough data to compute what you want.

I hope not. I think the main problem is my lack of statistical experience in choosing good models and techniques.

But at least now I have an estimate, even if it is narrower than I think is justified. There's at least one modification I could do to the test, but that would make the problem worse (narrow the range): reducing the population size by the sample size and adding in the known values.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 22 ·
Replies
22
Views
4K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 15 ·
Replies
15
Views
3K
  • · Replies 7 ·
Replies
7
Views
4K
  • · Replies 8 ·
Replies
8
Views
5K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
2
Views
3K