Not quite Poisson - how do i derive a transform pdf/cdf?

In summary: A binomial distribution is a function of two things: the number of successes, and the probability of success. You don't want a binomial distribution, because you are making up the trials and success rates. You want a Poisson distribution.
  • #1
iambasil
14
0
Hello,

I'm looking at some sporting data (similar to goals in a match) and trying to figure out what distribution applies to their count per match.

Typically, Poisson is used in the industry to model the distribution. When I look at the historical events, poisson isn't too bad, but tends to over estimate the lower numbers and the higher numbers. The reality is that the distribution is a tighter fit around the mean number of 'goals' in the match and the kurtosis of the pdf derived from observations is higher than poisson would suggest.

I've attached a worksheet (had to zip it as it was 125k) with data and analysis I did on this - my conclusion above was based on 4 years worth of data (804 events). However, in the attached I also broke this down into each year - and the general conclusion about the error in Poisson seems to hold true for each and every year.

What I would really like to do is to learn how to create a pdf based on a transformed Poisson - transformed based on my learnings looking at the historical data. I'll actually be applying the distribution to various other scenarios with different criteria, hence a transform of Poisson would be more useful to me than simply using the derived pdf based on historical observations around a single mean (e.g. E(x) is very different when it is raining).

It was many many years ago that I was a student but working on shaking off the rust. Forgive me if I've mis-used words such as 'transform'. Really appreciate your help!

Many thanks,

Basil
 

Attachments

  • DistributionData.zip
    24.7 KB · Views: 257
Physics news on Phys.org
  • #2
iambasil said:
my conclusion above was based on 4 years worth of data (804 events). However, in the attached I also broke this down into each year - and the general conclusion about the error in Poisson seems to hold true for each and every year.

Basil

The Poisson works best with a large number of observations n and a small number of events k. You might try the binomial distribution and see how well it describes your data. As you seem to understand, any common distribution will not likely be a perfect fit to any particular data set, but will simply be a limiting theoretical distribution over many data sets. Since you say you have a persistent "error", the more general binomial distribution would be my next choice.

EDIT: Your data suggest a non-random component (a systematic error). In that case you can't expect any probability distribution to be fully satisfactory. However, I would still suggest a binomial approximation to see if it's a better fit.
 
Last edited:
  • #3
Thanks so much for the reply and thoughts Stephen.

The binomial does help to reduce the error - but in fact flips it in the other direction (using the assumptions I applied) - results somewhere between poisson and binomial is what is desired.

I already use the binomial to model other elements of sport (an example would be field goals in an American Football match, since there is a known success rate, and that the number of trials is known as it is a function of the number of touchdowns expected). In this instance however, I'm effectively completely making up the number of trials and probability of success (eg the number of touchdowns in a match - the trials vary in every match - and the definition of a trial is open to interpretation - as does the probability of success). The Poisson is theoretically the more appropriate distribution to use isn't it?

The error is systematic - and I totally understand why some reasons would apply - such as the psychology of sport and winning. e.g. 0 events is less likely than statistical theory would suggest.

For my purposes in the real world, a generic approximation including known errors won't really do - there is a clear distribution in the data that is quite consistent... and this is what I need to model.

Is there not a way I can create my own custom distribution function based on my data - ideally applying a transformation to an established function such as Poisson?

Many thanks for your help with this Stephen.

Basil
 
  • #4
iambasil said:
The error is systematic - and I totally understand why some reasons would apply - such as the psychology of sport and winning. e.g. 0 events is less likely than statistical theory would suggest.

For my purposes in the real world, a generic approximation including known errors won't really do - there is a clear distribution in the data that is quite consistent... and this is what I need to model.

Is there not a way I can create my own custom distribution function based on my data - ideally applying a transformation to an established function such as Poisson?

Many thanks for your help with this Stephen.

Basil

If its a probability distribution, then you are dealing with the values that a variable x can take under a function F(x), where F is a probability mass function (PMF) in your case because it's a discrete distribution. You could use a PDF for a continuous approximation and integrate, and that's often done with the binomial.

The basic problem here is that x needs to be a random variable for a PMF or PDF. You apparently don't have that, at least for certain observations. Each observation has to be independent. Troublesome observations like yours can occur randomly for single samples and they are called "outliers". In curve fitting, they can be often ignored and this is a legitimate technique in many situations. However, when such outliers persist with repeated sampling, you must question the model.

Note, you can always fit curves to data. Polynomial regression can fit most any data set that can be described by an algebraic function just by adding terms. I don't think that's what you want. Why not just show the graphs and explain it as the shape of the data? There are known or unknown associations among certain observations that appear to violate the assumptions of a single probability distribution.
 
Last edited:
  • #5
Thanks for supporting Stephen. Indeed, I meant to have been stating PMF rather than PDF (it's been a long time since I studied stats!).

Regarding the point:
SW VandeCarr said:
Why not just show the graphs and explain it as the shape of the data?
This isn't actually an academic piece of work - I'm actually looking to use the derived distribution to model the probability of the occurrence of events under numerous scenarios/criteria, hence the need to have a model that is applicable under various scenarios. The key input variables that I will have to hand for each scenario is the mean number of events expected (lambda) and the number of events for which the probability is being returned (k) - hence would need my model to be driven by these metrics (per poisson).

As such, I was investigating applying a coefficient within the poisson formula to meet observed distributions (ugly, and I'm not proud) or looking at further methods such as MCMC or alternative distributions.

However, in playing around with the binomial further (thanks to your encouragement), I did find that by assuming:
- trials (n) = 33
- P(success) (p) = 7.26/33 (7.26 being the expected number of events per match),
then this binomial distribution very closely matches the observed distribution across all years, as well as in each individual year (adjusting the 7.26 mean for each year's value).

I subsequently tried to apply this to the separated teams. Typically:
- team A expects 4.9 events per game
- team B expects 2.4 events per game

My logic would state that the teams would keep roughly equal trials, but team A would have a higher p value than team B. However, whether I:
- split trials equally and adjusted p for each team based on E(events), or
- Keep p equal for both teams, but distributed n (33) between the teams based on E(events),
Both methods still resulted in P(0) and P(1) being overestimated for both teams (although it is not for the total events distribution). Also, P(Team A 0 events) * P(Team B 0 events) <> P(0 Total events) regardless of what I do. Confusing.

In your viewpoint, am I on a valid track with any of the above areas?
  • I'm conscious that I know little about MCMC and other potential distributions/transformations
  • The figure of 33 trials I applied to the binomial has no basis other than the fact that it makes the distributions closely match. And I obviously have the problems run into when trying to split the PMF into the separate teams.

Many thanks for your help today Stephen, it has really sped up my learning... I just hope I'm on the right track with the above. If I'm getting somewhere with the binomial, would really appreciate some pointers on the splitting by team issue I encountered.

Basil
 
  • #6
iambasil said:
  • I'm conscious that I know little about MCMC and other potential distributions/transformations
  • The figure of 33 trials I applied to the binomial has no basis other than the fact that it makes the distributions closely match. And I obviously have the problems run into when trying to split the PMF into the separate teams.

Many thanks for your help today Stephen, it has really sped up my learning... I just hope I'm on the right track with the above. If I'm getting somewhere with the binomial, would really appreciate some pointers on the splitting by team issue I encountered.

Basil

I thought you were working from data. Now you're saying that, essentially, you're interested in simulation. For that, you specify the parameters of your distributions and then you typically run random number generators (RNGs) to "sample" from your (fake) parametrized distributions. There are programs that will do this. So for a random Poisson with a specified [itex]\lambda[/itex], the RNG will generate a 'sample' that obeys your parameter for some k. You can also use the parameters from a real data set, but you better keep your simulated samples well defined and not mixed with real data. People tend to frown on mixing observed and simulated "samples".

One can use Markov chains together with Monte Carlo simulation (MCMC), but since I really don't know what your objectives are, I can't say whether you're on the right track. The following link may provide some useful information. Note that Monte Carlo simulations are usually generated from uniform distributions.

http://www.lexjansen.com/pharmasug/2007/sp/sp07.pdf
 
  • #7
Thanks for the feedback and link... it's been 13 years since I saw SAS code!

Sorry for not being clearer on my intended use of the data. I think I can clear that up pretty quickly...

Let's say the 'events' I've been referring to and have data for represent the number of touchdowns in a US Football game.

My end goal is to be able to estimate the % likelihood of there being x touchdowns in future matches when I know that E(x) is a known y (where y can be a decimal expectation).

The starting point was Poisson. This would allow me simply to enter in y (which changes whether I'm looking at team A, team B or total touchdowns - and the value for each will change for each match). I will always have a value for y (based other analysis I already have accurately modeled).
- but comparing historical results Poisson was shown to not be as accurate as I'd hoped (overestimating likelihood of very few and very high numbers of touchdowns)
- the errors in poisson seemed to be very consistent, whether I looked at Team A, Team B or Total Touchdowns, and no matter what period of time I looked at
- this is where I started my research into the distributions and posted on the physics forum.

I now find that the binomial can accurately reflect the distributions observed in count of touchdowns per match. However, this requires parameters of n (count of trials) and p (probability of success).
- I know the value of n*p (which equals y - the expected number of touchdowns)
- I have no idea how to split the y between n and p (how does one ascertain the number of trials a team have in getting a touchdown? They are trying for every minute of the game)

For the total touchdowns, using n of 33 in binomial (p is now derived since p = y/n) gave an accurate reflection of the observed distribution of touchdowns per match.
- Looking at data, on average there were 40-50 bouts of team possession per match, which leads me to believe that the 33 is a believable statistic for average 'trials' at scoring a touchdown.
- Running this distribution in yearly blocks of data (keeping n at 33 and varying y (hence p) based on the mean touchdowns scored in the year, the binomial with these parameters was consistently an accurate model of the distribution

However, unlike Poisson, where I could split y between home and away team and run the same distribution on each team to provide related results, this doesn't seem to be applicable to the binomial.
- For team A I know that y = E(touchdowns) = 0.66*E(Total Touchdowns)
- Splitting n of Total Touchdowns equally between the two teams and deriving p from y/n for each team, the distributions are no longer as accurate (although still better than Poisson)
- Worse is that unlike Poisson, the multiples of the derived probabilities for teams A and B do not equal the binomial estimates of Total Touchdowns.


So, now I find myself trying to figure out where I'm going wrong. I've attached an updated summary of my calculations.


P.S. The reason that I have left 'events' as being generic is that there are items other than touchdowns which have similar properties/distributions and I would like to replicate the methods used for touchdowns with those items also. The actual data I'm using isn't related to Touchdowns - but it may as well be as the scenario is identical.
 

Attachments

  • Updated DistributionData.zip
    94.7 KB · Views: 265
  • #8
I guess I don't follow your problem in "splitting" expected values. Instead of just touchdowns, let's just look at actual NFL stats. To keep it simple let's just look at the "points scored." stats each team in the AFC (published in the sports section of newspapers or online). Divide by the number of games to get the expectation of points scored for each game for each team. You calculate the variance from the actual data of past games. I would use the normal approximation and these parameters and set up the random normal simulation parametrized accordingly. This would give you a very basic simulation for generating scores. You can also add weights for "points against" to take into account the defensive capabilities of each team. These weights could be applied in the calculation of expected value of points scored for each team in a given game. You run the simulation to get a simulated score. In general, simulations based on past data will reflect the past, but you can revise the parameters based recent performance, etc. This is where a Markov chain model comes in as are you are basing your expectation on the probability calculation after the most recent game. Note, these types of simulations are not models for the reality. They are simply attempts at "realistic" simulated outcomes mostly for fantasy games.

This is about all I can say regarding what I believe you're trying to do. Perhaps someone else can comment if I'm not getting it.
 
Last edited:
  • #9
Thanks Stephen. Really appreciate you getting deep into this problem and specialised topic.

I've tried to add a bit to your response below to help clarify for yourself or anybody else who is interested.
SW VandeCarr said:
I guess I don't follow your problem in "splitting" expected values.
I'll try to simplify:
- I have a distribution (X) that works (binomial, n=33) for total touchdowns
- Total Touchdowns = Team A Touchdowns + Team B Touchdowns
- I can't find a way to use X (or a variant of it) to model accurate distributions for Teams A and B
- For any valid distribution model can be applied across these groups, I would expect P(0 Team A)*P(0 Team B) to equal P(0 Total Touchdowns)
SW VandeCarr said:
lets just look at the "points scored." stats each team in the AFC (published in the sports section of newspapers or online). Divide by the number of games to get the expectation of points scored for each game for each team. You calculate the variance from the actual data of past games. I would use the normal approximation and these parameters and set up the random normal simulation parametrized accordingly. This would give you a very basic simulation for generating scores. You can also add weights for "points against" to take into account the defensive capabilities of each team. These weights could be applied in the calculation of expected value of points scored for each team in a given game. You run the simulation to get a simulated score. In general, simulations based on past data will reflect the past, but you can revise the parameters based recent performance, etc.
I have a complex model already that models points and takes into account all sorts of variables including recent form, weather, even down to player data. For NFL and many other sports, scoring is not uniform, hence in order to predict the probability of any individual score, I consider how every single score can be reached - the combination of the different scoring types (which have different point values) and probabilities of each occurring. Essentially I'm building up the probabilities of the total score from the likelihood of each team getting the combinations of touchdowns/conversions(1/2 pt)/field goals/safeties etc.

By breaking the points down to its scoring make up items, the expected values are generally very low, below 5 each scoring type, hence I'm not sure the normal approximation would be valid. Poisson doesn't reflect the historic observations (overestimating small and big figures and underestimating figures around the mean).

As I mentioned above, having found a distribution (binomial, n=33) that accurately works for eg total Touchdowns, the problem now is that I can't relate this accurately to Team A or Team B Touchdowns (or work in the other direction to derive total touchdowns accurately from combining Team A and Team B distributions).

The details of these lower order scoring are valid and used in their own right (they do not exist solely to return probabilities of total points) - in fact, the intention is to include even more low level detail into the (already complex) model, for example by using similar distributions to model touchdowns by quarter. Small numbers!
SW VandeCarr said:
This is where a Markov chain model comes in as are you basing your expectation on the probability calculation after the most recent game. Note, these types of simulations are not models for the reality. They are simply attempts at "realistic" simulated outcomes mostly for fantasy games.
This is what I'm looking to get to, but potentially for this small detailed aspects of games, not just match results/points. It's going to take me a while to learn about all this and be confident enough that I'm applying the stats/theories correctly.
SW VandeCarr said:
This is about all I can say regarding what I believe you're trying to do. Perhaps someone else can comment if I'm not getting it.
Thank you so much for your help on a detailed problem and with a relative novice! I would really like to return the kind effort to yourself or the forum. Please let me know if there is anything I can do.

Basil
 
  • #10
iambasil said:
Thank you so much for your help on a detailed problem and with a relative novice! I would really like to return the kind effort to yourself or the forum. Please let me know if there is anything I can do.

Basil

No problem. Good luck.
 

FAQ: Not quite Poisson - how do i derive a transform pdf/cdf?

1. What is a transform pdf/cdf?

A transform pdf (probability density function) or cdf (cumulative distribution function) is a mathematical function used to describe the probability distribution of a random variable that has been transformed by a given function.

2. How is a transform pdf/cdf different from a regular pdf/cdf?

A regular pdf/cdf describes the probability distribution of a random variable in its original form, while a transform pdf/cdf describes the probability distribution of a random variable that has been transformed by a given function. This allows for more flexibility in modeling complex distributions.

3. How do I derive a transform pdf/cdf?

To derive a transform pdf/cdf, you must first determine the transformation function, then apply the appropriate mathematical operations to the original pdf/cdf. This can be done using methods such as the transformation rule for pdfs/cdfs or the method of substitution.

4. What is the purpose of using a transform pdf/cdf?

Using a transform pdf/cdf allows for more accurate and flexible modeling of complex probability distributions. It also allows for easier calculations and analysis of these distributions.

5. Can a transform pdf/cdf be used for any type of random variable?

Yes, a transform pdf/cdf can be used for any type of random variable as long as an appropriate transformation function can be determined. However, it is most commonly used for continuous random variables.

Back
Top