Fitting a geometric distribution to data

Click For Summary

Discussion Overview

The discussion revolves around fitting a geometric distribution to a series of coin tosses, specifically focusing on the number of tails observed before a head appears. Participants explore various methods for analyzing the data, addressing concerns about correlation between trials and the appropriate statistical estimators to use.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant expresses uncertainty about how to count the number of tails before a head, questioning whether overlapping sequences would lead to overcounting.
  • Another participant suggests that without assuming the data follows a geometric distribution, it is necessary to define the family of distributions generating the data to determine an appropriate estimator.
  • A participant proposes counting sequences of heads and comparing them to those predicted by the geometric distribution, indicating a method of finding indices of tails to avoid overcounting.
  • There is a discussion about the meaning of "statistic" and how to compare specific numerical results from the data to the random variables generated by the geometric distribution.
  • One participant clarifies that they are looking at the frequency of results from the geometric series and comparing it to actual data, noting challenges in parsing the sequence correctly.
  • A later reply mentions the use of a χ2 goodness of fit test as a method to check if the series follows a geometric distribution.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the best approach to analyze the data or the assumptions regarding the distribution. Multiple competing views on how to handle the data and the statistical methods to apply remain present throughout the discussion.

Contextual Notes

Participants highlight limitations in defining the underlying distribution of the data and the potential for correlation between trials, which complicates the fitting process. There is also ambiguity in the terminology used regarding statistics and their comparison.

Who May Find This Useful

This discussion may be useful for individuals interested in statistical analysis of sequences, particularly in the context of fitting distributions to empirical data in probabilistic experiments.

madness
Messages
813
Reaction score
69
Let's say I have a series of 100 coin tosses, heads or tails. In fact (for my actual data) I don't know if subsequent trials are correlated or what the actual probabilities of getting heads or tails are. Nevertheless, I want to fit a geometric distribution, which gives me the distribution of the number of tails seen before a head come up.

Now I'm unsure how to actually approach this in practice. Can I take each point in the sequence and calculate how many tails come before a head, or would this overcount by using overlapping sequences. For example 4 heads in a row would be counted once as 4, then as 3 and then as 2 and 1 and 0 if i used this scheme. Alternatively do I take random samples as starting points or do I start each time the series alternates between heads and tails? If the series were uncorrelated (which the geometric distribution models it as) then it shouldn't matter which of these schemes I choose.

Any advice? Thanks.
 
Physics news on Phys.org
madness;3952707I said:
don't know if subsequent trials are correlated or what the actual probabilities of getting heads or tails are. Nevertheless, I want to fit a geometric distribution

Can I take each point in the sequence and calculate how many tails come before a head, or would this overcount by using overlapping sequences.

Such questions don't have mathematical answers unless enough information is given. You say that you want to "fit" a geometric distribution but not that your are willing to do the fit using the assumption that the data is really from a geometric distribution.

If we assume the data is from a geometric distribution then, in the jargon of statistics, you are asking what "estimator" to use for the parameter of the geometric distribution. This still doesn't define what your would consider a "good" estimator is, but it tells us that what you should look up on the web is the topic: "estimators of the parameter of a geometric distribution".

Without looking that up myself, I suspect that the simple estimator \hat{p} = (number of successes)/ (total number of trials) is best at attaining the things people usually want from an estimator ( small bias, small variance). I'm too lazy at the moment to confirm this, but we can investigate this further if it will answer your question.

Let suppose don't want to asume the data is from a geometric distribution. To determine what estimator to use, you have to say something more definite about what family of distributions generate the data. Merely saying trials are correlated isn't enough to specify a family of probability distributions, even if you were to state the coefficient of correlation.

If you want to know what's done "in practice" you need to describe what your data is. Someone who has analyzed similar data might know. The practice for one kind of success-fail data is not necessarily the same as the practice for another kind.
 
I understand what you're saying here. What I really wanted to do was to count up the number of sequences of heads of each length in the data and compare it to that generated by the geometric distribution. I decided to simply find the indices in the sequence which came up tails and count the gaps inbetween. This is the same as if I had tossed the coin until it came up tails, marked the number of heads that had come up, and then started again. And by the way, I'm doing this to try to replicate some analysis in a paper that was not at all clear.
 
madness said:
What I really wanted to do was to count up the number of sequences of heads of each length in the data and compare it to that generated by the geometric distribution.

You're saying that you want to reduce specific statistics from the data and compare them to statistics that would be generated by the geometric distribution. What do have in my when you say "compare"?

There are rwo common meanings of the word "statistic". One meaning is that a statistic is a specific numerical result of an algorithm (like 38.63 ) that is computed from given numerical data in a sample.

Another meaning of statistic is that it is an algorithm for computing a value from a sample of data. For example, computing the sample mean can be defined as an algorithm. In this sense of the word "statistic", a statistic is a random variable because the inputs to the algorithm are random samples. The statistics generated by a geometric distribution are random variables.

How can we compare a specific number to a random variable? There are various ways. One is to ask the probability that the random variable is equal to the specific number, or within plus or minus some delta of it. Another way is to ask whether the mean of the random variable is equal to the specific number or within plus or minus of some delta of it. If the analysis you are checking makes a comparision, how is it done?
 
I'm looking at the frequency of each result of the geometric series (0 heads, 1 head, 2 heads, ... before a tail comes up) in the actual data and comparing it to that predicted by the geometric series. The main problem I had originally is that the series does come in the form 1,3,2,4,... etc (the outcomes of the geometric series) but rather 0,1,0,0,0,1,0,0,1,... (the results of the binomial distribution). I needed to figure out how to parse the sequence basically so that I extracted the correct subsequences and didn't overcount things. I believe I have now solved the issue.
 
madness said:
I'm looking at the frequency of each result of the geometric series (0 heads, 1 head, 2 heads, ... before a tail comes up) in the actual data and comparing it to that predicted by the geometric series. The main problem I had originally is that the series does come in the form 1,3,2,4,... etc (the outcomes of the geometric series) but rather 0,1,0,0,0,1,0,0,1,... (the results of the binomial distribution). I needed to figure out how to parse the sequence basically so that I extracted the correct subsequences and didn't overcount things. I believe I have now solved the issue.

Hi madness,

In your example you parse 0,1,0,0,0,1,0,0,1, as 2,4,3. To check if if your series follow a Geometric distribution you can use (among others) a χ2 goodness of fit test.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 57 ·
2
Replies
57
Views
7K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 45 ·
2
Replies
45
Views
6K
  • · Replies 20 ·
Replies
20
Views
6K
  • · Replies 15 ·
Replies
15
Views
2K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 21 ·
Replies
21
Views
4K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K