|Mar11-12, 01:33 AM||#1|
Not quite Poisson - how do i derive a transform pdf/cdf?
I'm looking at some sporting data (similar to goals in a match) and trying to figure out what distribution applies to their count per match.
Typically, Poisson is used in the industry to model the distribution. When I look at the historical events, poisson isn't too bad, but tends to over estimate the lower numbers and the higher numbers. The reality is that the distribution is a tighter fit around the mean number of 'goals' in the match and the kurtosis of the pdf derived from observations is higher than poisson would suggest.
I've attached a worksheet (had to zip it as it was 125k) with data and analysis I did on this - my conclusion above was based on 4 years worth of data (804 events). However, in the attached I also broke this down into each year - and the general conclusion about the error in Poisson seems to hold true for each and every year.
What I would really like to do is to learn how to create a pdf based on a transformed Poisson - transformed based on my learnings looking at the historical data. I'll actually be applying the distribution to various other scenarios with different criteria, hence a transform of Poisson would be more useful to me than simply using the derived pdf based on historical observations around a single mean (e.g. E(x) is very different when it is raining).
It was many many years ago that I was a student but working on shaking off the rust. Forgive me if I've mis-used words such as 'transform'. Really appreciate your help!
|Mar11-12, 01:53 PM||#2|
EDIT: Your data suggest a non-random component (a systematic error). In that case you can't expect any probability distribution to be fully satisfactory. However, I would still suggest a binomial approximation to see if it's a better fit.
|Mar11-12, 05:35 PM||#3|
Thanks so much for the reply and thoughts Stephen.
The binomial does help to reduce the error - but in fact flips it in the other direction (using the assumptions I applied) - results somewhere between poisson and binomial is what is desired.
I already use the binomial to model other elements of sport (an example would be field goals in an American Football match, since there is a known success rate, and that the number of trials is known as it is a function of the number of touchdowns expected). In this instance however, I'm effectively completely making up the number of trials and probability of success (eg the number of touchdowns in a match - the trials vary in every match - and the definition of a trial is open to interpretation - as does the probability of success). The Poisson is theoretically the more appropriate distribution to use isn't it?
The error is systematic - and I totally understand why some reasons would apply - such as the psychology of sport and winning. e.g. 0 events is less likely than statistical theory would suggest.
For my purposes in the real world, a generic approximation including known errors won't really do - there is a clear distribution in the data that is quite consistent... and this is what I need to model.
Is there not a way I can create my own custom distribution function based on my data - ideally applying a transformation to an established function such as Poisson?
Many thanks for your help with this Stephen.
|Mar11-12, 06:40 PM||#4|
Not quite Poisson - how do i derive a transform pdf/cdf?
The basic problem here is that x needs to be a random variable for a PMF or PDF. You apparently don't have that, at least for certain observations. Each observation has to be independent. Troublesome observations like yours can occur randomly for single samples and they are called "outliers". In curve fitting, they can be often ignored and this is a legitimate technique in many situations. However, when such outliers persist with repeated sampling, you must question the model.
Note, you can always fit curves to data. Polynomial regression can fit most any data set that can be described by an algebraic function just by adding terms. I don't think that's what you want. Why not just show the graphs and explain it as the shape of the data? There are known or unknown associations among certain observations that appear to violate the assumptions of a single probability distribution.
|Mar12-12, 12:45 AM||#5|
Thanks for supporting Stephen. Indeed, I meant to have been stating PMF rather than PDF (it's been a long time since I studied stats!).
Regarding the point:
As such, I was investigating applying a coefficient within the poisson formula to meet observed distributions (ugly, and I'm not proud) or looking at further methods such as MCMC or alternative distributions.
However, in playing around with the binomial further (thanks to your encouragement), I did find that by assuming:
- trials (n) = 33
- P(success) (p) = 7.26/33 (7.26 being the expected number of events per match),
then this binomial distribution very closely matches the observed distribution across all years, as well as in each individual year (adjusting the 7.26 mean for each year's value).
I subsequently tried to apply this to the separated teams. Typically:
- team A expects 4.9 events per game
- team B expects 2.4 events per game
My logic would state that the teams would keep roughly equal trials, but team A would have a higher p value than team B. However, whether I:
- split trials equally and adjusted p for each team based on E(events), or
- Keep p equal for both teams, but distributed n (33) between the teams based on E(events),
Both methods still resulted in P(0) and P(1) being overestimated for both teams (although it is not for the total events distribution). Also, P(Team A 0 events) * P(Team B 0 events) <> P(0 Total events) regardless of what I do. Confusing.
In your viewpoint, am I on a valid track with any of the above areas?
Many thanks for your help today Stephen, it has really sped up my learning... I just hope I'm on the right track with the above. If I'm getting somewhere with the binomial, would really appreciate some pointers on the splitting by team issue I encountered.
|Mar12-12, 11:09 AM||#6|
One can use Markov chains together with Monte Carlo simulation (MCMC), but since I really don't know what your objectives are, I can't say whether you're on the right track. The following link may provide some useful information. Note that Monte Carlo simulations are usually generated from uniform distributions.
|Mar12-12, 01:19 PM||#7|
Thanks for the feedback and link... it's been 13 years since I saw SAS code!
Sorry for not being clearer on my intended use of the data. I think I can clear that up pretty quickly...
Let's say the 'events' I've been referring to and have data for represent the number of touchdowns in a US Football game.
My end goal is to be able to estimate the % likelihood of there being x touchdowns in future matches when I know that E(x) is a known y (where y can be a decimal expectation).
The starting point was Poisson. This would allow me simply to enter in y (which changes whether I'm looking at team A, team B or total touchdowns - and the value for each will change for each match). I will always have a value for y (based other analysis I already have accurately modelled).
- but comparing historical results Poisson was shown to not be as accurate as I'd hoped (overestimating likelihood of very few and very high numbers of touchdowns)
- the errors in poisson seemed to be very consistent, whether I looked at Team A, Team B or Total Touchdowns, and no matter what period of time I looked at
- this is where I started my research into the distributions and posted on the physics forum.
I now find that the binomial can accurately reflect the distributions observed in count of touchdowns per match. However, this requires parameters of n (count of trials) and p (probability of success).
- I know the value of n*p (which equals y - the expected number of touchdowns)
- I have no idea how to split the y between n and p (how does one ascertain the number of trials a team have in getting a touchdown? They are trying for every minute of the game)
For the total touchdowns, using n of 33 in binomial (p is now derived since p = y/n) gave an accurate reflection of the observed distribution of touchdowns per match.
- Looking at data, on average there were 40-50 bouts of team possession per match, which leads me to believe that the 33 is a believable statistic for average 'trials' at scoring a touchdown.
- Running this distribution in yearly blocks of data (keeping n at 33 and varying y (hence p) based on the mean touchdowns scored in the year, the binomial with these parameters was consistently an accurate model of the distribution
However, unlike Poisson, where I could split y between home and away team and run the same distribution on each team to provide related results, this doesn't seem to be applicable to the binomial.
- For team A I know that y = E(touchdowns) = 0.66*E(Total Touchdowns)
- Splitting n of Total Touchdowns equally between the two teams and deriving p from y/n for each team, the distributions are no longer as accurate (although still better than Poisson)
- Worse is that unlike Poisson, the multiples of the derived probabilities for teams A and B do not equal the binomial estimates of Total Touchdowns.
So, now I find myself trying to figure out where I'm going wrong. I've attached an updated summary of my calculations.
P.S. The reason that I have left 'events' as being generic is that there are items other than touchdowns which have similar properties/distributions and I would like to replicate the methods used for touchdowns with those items also. The actual data I'm using isn't related to Touchdowns - but it may as well be as the scenario is identical.
|Mar12-12, 02:22 PM||#8|
I guess I don't follow your problem in "splitting" expected values. Instead of just touchdowns, let's just look at actual NFL stats. To keep it simple lets just look at the "points scored." stats each team in the AFC (published in the sports section of newspapers or online). Divide by the number of games to get the expectation of points scored for each game for each team. You calculate the variance from the actual data of past games. I would use the normal approximation and these parameters and set up the random normal simulation parametrized accordingly. This would give you a very basic simulation for generating scores. You can also add weights for "points against" to take into account the defensive capabilities of each team. These weights could be applied in the calculation of expected value of points scored for each team in a given game. You run the simulation to get a simulated score. In general, simulations based on past data will reflect the past, but you can revise the parameters based recent performance, etc. This is where a Markov chain model comes in as are you are basing your expectation on the probability calculation after the most recent game. Note, these types of simulations are not models for the reality. They are simply attempts at "realistic" simulated outcomes mostly for fantasy games.
This is about all I can say regarding what I believe you're trying to do. Perhaps someone else can comment if I'm not getting it.
|Mar12-12, 03:55 PM||#9|
Thanks Stephen. Really appreciate you getting deep into this problem and specialised topic.
I've tried to add a bit to your response below to help clarify for yourself or anybody else who is interested.
- I have a distribution (X) that works (binomial, n=33) for total touchdowns
- Total Touchdowns = Team A Touchdowns + Team B Touchdowns
- I can't find a way to use X (or a variant of it) to model accurate distributions for Teams A and B
- For any valid distribution model can be applied across these groups, I would expect P(0 Team A)*P(0 Team B) to equal P(0 Total Touchdowns)
By breaking the points down to its scoring make up items, the expected values are generally very low, below 5 each scoring type, hence I'm not sure the normal approximation would be valid. Poisson doesn't reflect the historic observations (overestimating small and big figures and underestimating figures around the mean).
As I mentioned above, having found a distribution (binomial, n=33) that accurately works for eg total Touchdowns, the problem now is that I can't relate this accurately to Team A or Team B Touchdowns (or work in the other direction to derive total touchdowns accurately from combining Team A and Team B distributions).
The details of these lower order scoring are valid and used in their own right (they do not exist solely to return probabilities of total points) - in fact, the intention is to include even more low level detail into the (already complex) model, for example by using similar distributions to model touchdowns by quarter. Small numbers!
|Mar12-12, 04:01 PM||#10|
|distributions, poisson, statistics, transform|
|Similar Threads for: Not quite Poisson - how do i derive a transform pdf/cdf?|
|Fourier transform solution to electrostatics Poisson equation?||Calculus||6|
|CDF of the ratio of Poisson and possibly-Poisson R.V.||Set Theory, Logic, Probability, Statistics||0|
|Java Walsh Hadamard transform O'Connor transform||Programming & Comp Sci||0|
|matrix transform vs linear transform||Calculus & Beyond Homework||1|
|Is it true that unitary transform in QM corresponds to canonical transform||Quantum Physics||1|