Fitting a geometric distribution to data

In summary: Another is to ask...what is the range of values that the number could take on? In summary, you want to compare the number of heads that occur in the data (generated by the geometric distribution) to the number of heads that occur in the data when you toss a coin until it comes up tails.
  • #1
madness
815
70
Let's say I have a series of 100 coin tosses, heads or tails. In fact (for my actual data) I don't know if subsequent trials are correlated or what the actual probabilities of getting heads or tails are. Nevertheless, I want to fit a geometric distribution, which gives me the distribution of the number of tails seen before a head come up.

Now I'm unsure how to actually approach this in practice. Can I take each point in the sequence and calculate how many tails come before a head, or would this overcount by using overlapping sequences. For example 4 heads in a row would be counted once as 4, then as 3 and then as 2 and 1 and 0 if i used this scheme. Alternatively do I take random samples as starting points or do I start each time the series alternates between heads and tails? If the series were uncorrelated (which the geometric distribution models it as) then it shouldn't matter which of these schemes I choose.

Any advice? Thanks.
 
Physics news on Phys.org
  • #2
madness;3952707I said:
don't know if subsequent trials are correlated or what the actual probabilities of getting heads or tails are. Nevertheless, I want to fit a geometric distribution

Can I take each point in the sequence and calculate how many tails come before a head, or would this overcount by using overlapping sequences.

Such questions don't have mathematical answers unless enough information is given. You say that you want to "fit" a geometric distribution but not that your are willing to do the fit using the assumption that the data is really from a geometric distribution.

If we assume the data is from a geometric distribution then, in the jargon of statistics, you are asking what "estimator" to use for the parameter of the geometric distribution. This still doesn't define what your would consider a "good" estimator is, but it tells us that what you should look up on the web is the topic: "estimators of the parameter of a geometric distribution".

Without looking that up myself, I suspect that the simple estimator [itex] \hat{p} = [/itex] (number of successes)/ (total number of trials) is best at attaining the things people usually want from an estimator ( small bias, small variance). I'm too lazy at the moment to confirm this, but we can investigate this further if it will answer your question.

Let suppose don't want to asume the data is from a geometric distribution. To determine waht estimator to use, you have to say something more definite about what family of distributions generate the data. Merely saying trials are correlated isn't enough to specify a family of probability distributions, even if you were to state the coefficient of correlation.

If you want to know what's done "in practice" you need to describe what your data is. Someone who has analyzed similar data might know. The practice for one kind of success-fail data is not necessarily the same as the practice for another kind.
 
  • #3
I understand what you're saying here. What I really wanted to do was to count up the number of sequences of heads of each length in the data and compare it to that generated by the geometric distribution. I decided to simply find the indices in the sequence which came up tails and count the gaps inbetween. This is the same as if I had tossed the coin until it came up tails, marked the number of heads that had come up, and then started again. And by the way, I'm doing this to try to replicate some analysis in a paper that was not at all clear.
 
  • #4
madness said:
What I really wanted to do was to count up the number of sequences of heads of each length in the data and compare it to that generated by the geometric distribution.

You're saying that you want to reduce specific statistics from the data and compare them to statistics that would be generated by the geometric distribution. What do have in my when you say "compare"?

There are rwo common meanings of the word "statistic". One meaning is that a statistic is a specific numerical result of an algorithm (like 38.63 ) that is computed from given numerical data in a sample.

Another meaning of statistic is that it is an algorithm for computing a value from a sample of data. For example, computing the sample mean can be defined as an algorithm. In this sense of the word "statistic", a statistic is a random variable because the inputs to the algorithm are random samples. The statistics generated by a geometric distribution are random variables.

How can we compare a specific number to a random variable? There are various ways. One is to ask the probability that the random variable is equal to the specific number, or within plus or minus some delta of it. Another way is to ask whether the mean of the random variable is equal to the specific number or within plus or minus of some delta of it. If the analysis you are checking makes a comparision, how is it done?
 
  • #5
I'm looking at the frequency of each result of the geometric series (0 heads, 1 head, 2 heads, ... before a tail comes up) in the actual data and comparing it to that predicted by the geometric series. The main problem I had originally is that the series does come in the form 1,3,2,4,... etc (the outcomes of the geometric series) but rather 0,1,0,0,0,1,0,0,1,... (the results of the binomial distribution). I needed to figure out how to parse the sequence basically so that I extracted the correct subsequences and didn't overcount things. I believe I have now solved the issue.
 
  • #6
madness said:
I'm looking at the frequency of each result of the geometric series (0 heads, 1 head, 2 heads, ... before a tail comes up) in the actual data and comparing it to that predicted by the geometric series. The main problem I had originally is that the series does come in the form 1,3,2,4,... etc (the outcomes of the geometric series) but rather 0,1,0,0,0,1,0,0,1,... (the results of the binomial distribution). I needed to figure out how to parse the sequence basically so that I extracted the correct subsequences and didn't overcount things. I believe I have now solved the issue.

Hi madness,

In your example you parse 0,1,0,0,0,1,0,0,1, as 2,4,3. To check if if your series follow a Geometric distribution you can use (among others) a χ2 goodness of fit test.
 

1. What is a geometric distribution?

A geometric distribution is a probability distribution that represents the number of trials needed to achieve a success in a sequence of independent trials, where each trial has a constant probability of success.

2. How is a geometric distribution fitted to data?

To fit a geometric distribution to data, the data must first be sorted in ascending order. Then, the probability of success (p) is calculated by dividing the number of successes by the total number of trials. Finally, the geometric distribution can be plotted and compared to the data to determine if it is a good fit.

3. What are the assumptions of fitting a geometric distribution to data?

The assumptions of fitting a geometric distribution to data include: the data must be from a sequence of independent trials, each trial must have a constant probability of success, and the outcome of one trial must not affect the outcome of another trial.

4. What is the purpose of fitting a geometric distribution to data?

Fitting a geometric distribution to data allows us to determine the probability of success for each trial and make predictions about future trials. It also helps us understand the distribution of successes in a sequence of independent trials.

5. Can a geometric distribution be used for continuous data?

No, a geometric distribution is only applicable for discrete data where the number of trials is finite and the outcome is either a success or a failure. For continuous data, other probability distributions such as the normal distribution may be more appropriate.

Similar threads

  • Set Theory, Logic, Probability, Statistics
2
Replies
45
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
20
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
21
Views
3K
  • Programming and Computer Science
Replies
10
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
2K
  • Quantum Interpretations and Foundations
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • General Math
Replies
6
Views
779
Back
Top