Practical problem: need a distribution

In summary: Unfortunately that won't do. The sample standard deviation I calculated is greater than the mean, so to get even a 70% confidence interval I'd have to include 0. But the lower bound is the most important part of the estimation, and it can't be that low.
  • #1
CRGreathouse
Science Advisor
Homework Helper
2,844
0
This should be an easy question, but I can't think of how to answer it! I never did take enough stats.

I'm looking at n = 77 returned surveys which come from a population of size 627. (At the moment I'm assuming that response is uncorrelated with the answers; it's actually somewhat reasonable in this case, and I don't have the background I'd need to assume otherwise!)

Each survey contains count data: I have
[ ] 0
[ ] 1
[ ] 2
[ ] 3
foos (for various foo). From this I can of course determine the number of foos in the sample, and the BLUE for the total across the population. But what sort of distribution should I use to determine a (say) 90% confidence range? I was toying with misusing a Poisson model here (each respondent acting like a time interval), but even so I wasn't able to determine a CI (must have been doing something very wrong; when I did a normal approximation of the Poisson I came up with a negative lower bound!). In summary:
1. What sort of distribution is appropriate? A simpler one would be better.
2. Very briefly (one sentence or just drop in a link; I'll work out the details) how do I find a CI with that distribution?
 
Physics news on Phys.org
  • #2
I think... what you want to do is to estimate the probability distribution over {0, 1, 2, 3} and then take the mean of that probability distribution?

So, you have the unknown variables X = P(0), Y = P(1), Z = P(2), from which you can compute the quantities:
P(My poll of N people got the frequencies A, B, C, (N-A-B-C) | X = x, Y = y, Z = z)

You could probably do some Bayesian thing to estimate

P(X = x, Y = y, Z = z | My poll of N people got the frequencies A, B, C, (N-A-B-C))

from which you can compute a probability distribution on

mu = y + 2z + 3(1-x-y-z)



At least, these are my first instincts. I don't know if this is the right way to go about it.


If it was the distribution over {0, 1, 2, 3} you wanted, you could do a Chi-square test with three degrees of freedom if each bucket has enough samples. link[/url]

Maybe the right way to get a distribution on the mean is to start with this?
 
Last edited by a moderator:
  • #3
Hurkyl said:
At least, these are my first instincts. I don't know if this is the right way to go about it.

You have no idea how far that went in making me feel non-stupid for asking the question.
 
  • #4
CRGreathouse said:
You have no idea how far that went in making me feel non-stupid for asking the question.
Heh! Oh, if you didn't notice, I added something about doing a goodness-of-fit test to the end of my previous post.
 
  • #5
Hurkyl said:
If it was the distribution over {0, 1, 2, 3} you wanted, you could do a Chi-square test with three degrees of freedom if each bucket has enough samples.

I have a {0, 1, 2, 3, 4, 5} distribution (number of objects: discrete), a {0, 2.5, 4, 6, 12, 50, 250} (frequency: continuous, but broken into blocks for the survey), and several categorical or binary distributions.

I don't understand what I'd do with a chi-square. I though that was when you wanted to test if a prior distribution was reasonable given sample data, but I have no prior idea what the distribution was before starting.
 
  • #6
Well, the thing I'm hoping will work out is that you can somehow obtain a distribution on the mean by integrating over all possible distributions on {0,1,2,3,4,5} that would yield that mean. Maybe goodness-of-fit isn't the way to go about it, but I decided it was a place to start thinking.
 
  • #7
For n=77, it's probably safe to assume that the mean number of foo is approximately normally distributed, and it's also probably safe to use the sample estimate of the variance when calculating the confidence interval for the mean.
 
  • #8
mXSCNT said:
For n=77, it's probably safe to assume that the mean number of foo is approximately normally distributed, and it's also probably safe to use the sample estimate of the variance when calculating the confidence interval for the mean.

Unfortunately that won't do. The sample standard deviation I calculated is greater than the mean, so to get even a 70% confidence interval I'd have to include 0. But the lower bound is the most important part of the estimation, and it can't be that low.
 
  • #9
You know that you divide the sample std. dev. by sqrt(77) to get the mean std. dev?
 
  • #10
mXSCNT said:
You know that you divide the sample std. dev. by sqrt(77) to get the mean std. dev?

My calculation was

S = (0 - mean)^2 * [# of 0 responses] + (1 - mean)^2 * [# of 1 responses] + (2 - mean)^2 * [# of 2 responses]
sigma = sqrt(S/76)

where [# of 0 responses] + [# of 1 responses] + [# of 2 responses] = 77.Edit: Actually this is simplified, since I have more possibilities than {0, 1, 2}, but you get what I mean. I used 76 rather than 77 because this is a sample and not the population.
 
  • #11
If that doesn't give you a small enough confidence interval, I don't know then.
 
  • #12
mXSCNT said:
If that doesn't give you a small enough confidence interval, I don't know then.

My mean is just under 1, since most people report 0. My standard deviation is a touch over 1.

I tend to think that getting results like this means my approach is wrong and I need to model it differently. What do you think of my proposed (mis)use of the Poisson distribution here?
 
  • #13
I tried using R (my first time!) to calculate the 5% and 95% using a Poisson distribution:
Code:
> qpois(.05*(1:19),0.96104 * 627)
[1] 562 571 577 582 586 590 593 596 599 602 605 609 612 615 619 623 628 634 643

I don't think this is a great approach, since it doesn't take into account the degree to which the sample will randomly deviate from the population. But at least it gives a reasonable bound: 562 to 643 with 90% confidence. (The true 90% bound should then be wider, though I don't know how much.)
 
  • #14
I'm mildly confused about what you're actually trying to compute. (Something got lost of obfuscated in the abstraction) But, assuming I have a good idea about it...



One thing to consider is that maybe you just don't have enough data to compute what you want.


Since you're interested in lower bounds, maybe you shouldn't be doing confidence intervals, but instead one-sided tests; i.e. find a 95% confidence interval of the form "A < X" rather than one of the form "A < X < B".


Since you've revealed that your actual data is mostly zeroes, a few ones, and sporadic higher values, it seems more plausible that the Poisson could be used. Does the thing you're actually testing for have qualities that suggest Poisson is accurate? You could always do a goodness-of-fit test to see if a Poisson distribution with the right mean is a decent description of the data.
 
  • #16
Yes, that says the use the BLUE, which I was already using. :wink:
 
  • #17
Hurkyl said:
One thing to consider is that maybe you just don't have enough data to compute what you want.

I hope not. I think the main problem is my lack of statistical experience in choosing good models and techniques.

But at least now I have an estimate, even if it is narrower than I think is justified. There's at least one modification I could do to the test, but that would make the problem worse (narrow the range): reducing the population size by the sample size and adding in the known values.
 

1. What is a distribution?

A distribution refers to the way in which a set of data values is spread out or distributed. It provides information about the frequency and pattern of the data, and is often represented graphically as a histogram or a curve.

2. Why is understanding distribution important in scientific research?

Understanding the distribution of data is crucial for making accurate conclusions in scientific research. It helps to identify patterns and trends in the data, and allows researchers to make predictions and draw conclusions about the population being studied.

3. What factors affect the shape of a distribution?

The shape of a distribution can be affected by several factors, including the sample size, the type of data being collected, and the characteristics of the population being studied. It can also be influenced by outliers or extreme values in the data.

4. How do you determine the appropriate distribution for your data?

The appropriate distribution for your data depends on the type of data you have collected and the research question you are trying to answer. Some common distributions used in scientific research include normal, binomial, and Poisson distributions. It is important to carefully assess your data and consult with a statistician if necessary to determine the best fit distribution.

5. Can distributions be skewed and why is this important?

Yes, distributions can be skewed, meaning they are asymmetrical and do not have a normal bell-shaped curve. This is important because it can affect the interpretation of the data and the validity of statistical tests. Skewed distributions can also indicate the presence of outliers or unusual data points that may need to be investigated further.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
15
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • General Math
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
7K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
7K
Back
Top