# Statistics - data follows x distribution

• Deimantas
In summary, the data contains 2500 integers with frequencies of zeroes: 1240, ones: 1014, and twos: 246. The mean and variance of the data are 0.6024 and 0.436314, respectively. After considering Poisson, geometric, and binomial distributions, it is determined that the data does not follow a Poisson or geometric distribution based on theoretical values and Chi-squared tests. When considering the binomial distribution, using the sample mean and variance to estimate n and p yields nonsensical values. Additionally, when generating random binomial values with estimated parameters, the results differ significantly from the observed frequencies in the data. Therefore, it is inconclusive which distribution the data
Deimantas

## Homework Statement

The data contains 2500 integers, each is either a 0, 1 or 2:

zeroes: 1240
ones: 1014
two's: 246

Does the data follow Poisson, geometric, binomial or negative-binomial distribution?

## The Attempt at a Solution

The mean of the data is 0.6024 and the variance is 0.436314

Negative-binomial distribution is supposed to have greater variance than mean, so I only consider Poisson, binomial and geometric distributions.

Poisson is supposed to have it's mean equal to it's variance. I don't know if I should reject Poisson though, after using method of moments and setting λ=0.6024 I get these theoretical values of distribution:

zeroes:1369
ones:825
twos:248

It's not really that far off. However, Chi-squared test gives me a value of χ2≈55 which is very large and tells me the hypothesis that my data follows Poisson distribution should be rejected.
I tried generating random Poisson distribution values with λ=0.6024 and got

zeroes:1348
ones:880
twos:214

which gives χ2≈33. Closer, but still too large.

As for binomial distribution, using method of moments I get 2500*p=0.6024; p=0.000241
With this estimator, using theoretical formulas for calculating binomial probabilities I end up with these values:

zeroes:1369
ones:825
twos:248

These are identical to the theoretical Poisson distribution values. However, when I try to generate 2500 binomial distribution random values with p=0.000241 I get very different results, something like:

zeroes:1850
ones:531
twos:97

I don't really know why it differs.

Finally, geometric distribution. I really did not know what estimators I should use for this one. I tried using (1-p)/p=0.6024 which gives theoretical values of

zeroes:1506
ones:599
twos:238

The randomly generated values were very close to these, but it's quite far off from my data.

So, after all, I still have no clue which distribution does my data follow. Could you help me with that?

Deimantas said:

## Homework Statement

The data contains 2500 integers, each is either a 0, 1 or 2:

zeroes: 1240
ones: 1014
two's: 246

Does the data follow Poisson, geometric, binomial or negative-binomial distribution?

## The Attempt at a Solution

The mean of the data is 0.6024 and the variance is 0.436314

Negative-binomial distribution is supposed to have greater variance than mean, so I only consider Poisson, binomial and geometric distributions.

Poisson is supposed to have it's mean equal to it's variance. I don't know if I should reject Poisson though, after using method of moments and setting λ=0.6024 I get these theoretical values of distribution:

zeroes:1369
ones:825
twos:248

It's not really that far off. However, Chi-squared test gives me a value of χ2≈55 which is very large and tells me the hypothesis that my data follows Poisson distribution should be rejected.
I tried generating random Poisson distribution values with λ=0.6024 and got

zeroes:1348
ones:880
twos:214

which gives χ2≈33. Closer, but still too large.

As for binomial distribution, using method of moments I get 2500*p=0.6024; p=0.000241
With this estimator, using theoretical formulas for calculating binomial probabilities I end up with these values:

zeroes:1369
ones:825
twos:248

These are identical to the theoretical Poisson distribution values. However, when I try to generate 2500 binomial distribution random values with p=0.000241 I get very different results, something like:

zeroes:1850
ones:531
twos:97

I don't really know why it differs.

Finally, geometric distribution. I really did not know what estimators I should use for this one. I tried using (1-p)/p=0.6024 which gives theoretical values of

zeroes:1506
ones:599
twos:238

The randomly generated values were very close to these, but it's quite far off from my data.

So, after all, I still have no clue which distribution does my data follow. Could you help me with that?

For the binomial, why do you take n = 2500? Presumably, the data is a random sample of 2500 independent draws from a binomial distribution B(n,p), with both n and p unknown and to be estimated.

I don't think I could solve it with both n and p being unknown. I just took n=2500. Then it's easy to see that p=0.000241

Probablities of B(2500,0.000241):

probability of a 0: 0.547
probability of a 1: 0.33
probability of a 2: 0.099

then I multiply each probability by n=2500, to get these theoretical frequencies

zeroes: 1369
ones: 825
twos: 248

So the theoretical frequencies are somewhat close.

I don't think I'm supposed to consider both n and p unknown, because I have been only introduced to an exercise of finding p when n is known... So instead of saying that my sample is from some data that follows Binomial distribution, I declare that the sample itself is following Binomial distribution and take n as equal to 2500. So I guess what I'm doing here is wrong? How should I evaluate both n and p? I'd appreciate your input :)

Deimantas said:
I don't think I could solve it with both n and p being unknown. I just took n=2500. Then it's easy to see that p=0.000241

Probablities of B(2500,0.000241):

probability of a 0: 0.547
probability of a 1: 0.33
probability of a 2: 0.099

then I multiply each probability by n=2500, to get these theoretical frequencies

zeroes: 1369
ones: 825
twos: 248

So the theoretical frequencies are somewhat close.

I don't think I'm supposed to consider both n and p unknown, because I have been only introduced to an exercise of finding p when n is known... So instead of saying that my sample is from some data that follows Binomial distribution, I declare that the sample itself is following Binomial distribution and take n as equal to 2500. So I guess what I'm doing here is wrong? How should I evaluate both n and p? I'd appreciate your input :)

Yes, I would say it is wrong. You have a sample (numerical) mean and variance, and for bin(n,p) you have formulas for those in terms of n and p.

Hello,

If I estimate n and p using formulas
mean=n*p
variance=n*p*(1-p)
it yields p=0.275707172 and n=2.18492684. I don't see a way to use this, sample n is 2500, so this estimate of n would be nonsense. I've read more about estimating binomial parameters when both n and p are unknown, and using other (more complex) formulas I still get p≈0.29 and n≈2.2. What's wrong?

Another question:
With n=2500 and p=0.00028, probabilities of x=0,1,2 would be
0: 0.4965366317
1: 0.347673
2: 0.1216709
each probability multiplied by n=2500, I get approximately
0: 1241
1: 869
2: 304
However, when I generate random binomial numbers with the SAME parameters, I get results that are VERY different. For example
0: 1857
1: 495
2: 113
Why?

Deimantas said:
Hello,

If I estimate n and p using formulas
mean=n*p
variance=n*p*(1-p)
it yields p=0.275707172 and n=2.18492684. I don't see a way to use this, sample n is 2500, so this estimate of n would be nonsense. I've read more about estimating binomial parameters when both n and p are unknown, and using other (more complex) formulas I still get p≈0.29 and n≈2.2. What's wrong?

Another question:
With n=2500 and p=0.00028, probabilities of x=0,1,2 would be
0: 0.4965366317
1: 0.347673
2: 0.1216709
each probability multiplied by n=2500, I get approximately
0: 1241
1: 869
2: 304
However, when I generate random binomial numbers with the SAME parameters, I get results that are VERY different. For example
0: 1857
1: 495
2: 113
Why?

No, what you are doing is nonsense.

The best estimate of n from the data is n = 2; anyway, this matches the fact that each experiment yields one of only three outcomes (k = 0,1,2 in the binomial with n = 2). If you used a binomial with n = 2500, you would absolutely have to find some outcomes > 2.

I will repeat once more: you seem to have a binomial with n = 2, and with some p that you can give a best estimate for. You are looking at a random sample of 2500 independent draws from that binomial.

I will not repeat myself any more; at this point I am out of here.

1 person
Ray Vickson said:
I will repeat once more: you seem to have a binomial with n = 2, and with some p that you can give a best estimate for. You are looking at a random sample of 2500 independent draws from that binomial.

Thanks for your help, I now realize my perception of binomial distribution was a bit off. This made me understand.

## 1. What is a distribution in statistics?

A distribution in statistics refers to the way in which a set of data is spread out or organized. It shows the frequency or likelihood of different values occurring within a dataset. Common types of distributions include normal, binomial, and poisson distributions.

## 2. What does it mean when data follows a normal distribution?

When data follows a normal distribution, it means that the frequency of values in a dataset is symmetrically distributed around the mean, creating a bell-shaped curve. This type of distribution is commonly used in statistics because many natural phenomena tend to follow this pattern.

## 3. Can data follow more than one distribution?

Yes, data can follow more than one distribution. In some cases, a dataset may contain subgroups that follow different distributions, or it may not fit perfectly into any one distribution. In these situations, it may be necessary to use multiple distributions to accurately represent the data.

## 4. How do you determine which distribution a dataset follows?

There are a few methods for determining which distribution a dataset follows. One approach is to visually inspect the data and see if it resembles a known distribution. Another method is to use statistical tests, such as the Kolmogorov-Smirnov test, to compare the data to different distributions and see which one fits best.

## 5. Why is it important to know the distribution of your data?

Knowing the distribution of your data is important because it can inform the type of statistical analysis or model that is most appropriate for your data. Different distributions have different properties and assumptions, so using the wrong distribution can lead to inaccurate results. Additionally, understanding the distribution of your data can help you identify any outliers or unusual patterns that may affect your analysis.

Replies
8
Views
2K
Replies
21
Views
1K
Replies
2
Views
1K
Replies
4
Views
1K
Replies
10
Views
2K
Replies
7
Views
1K
Replies
2
Views
1K
Replies
15
Views
2K
Replies
30
Views
4K
Replies
30
Views
4K