Combination probability of variables that are not independent

Click For Summary
SUMMARY

This discussion focuses on creating a forecasting model for predicting the supremacy between two sports teams, specifically in US Football, by analyzing two dependent variables: team A's and team B's point hauls. The user, Basil, seeks to calculate the probability of various supremacy outcomes, acknowledging that traditional independent modeling methods yield poor results due to the interdependence of team performances. Recommendations include using logistic regression and generalized linear models (GLMs) with Poisson or binomial distributions to better account for the relationships between the teams' scoring. Tools mentioned for analysis include R and MATLAB, with a suggestion to explore the Expectation Maximization algorithm for fitting data.

PREREQUISITES
  • Understanding of logistic regression and generalized linear models (GLMs)
  • Familiarity with Poisson and binomial distributions
  • Basic knowledge of statistical computational tools such as R and MATLAB
  • Experience with data analysis and regression modeling techniques
NEXT STEPS
  • Learn about logistic regression and its applications in sports forecasting
  • Explore Poisson and binomial distributions in the context of statistical modeling
  • Investigate the Expectation Maximization algorithm for fitting statistical models
  • Download and practice using R for statistical analysis and modeling
USEFUL FOR

Data analysts, sports statisticians, and anyone involved in predictive modeling for sports outcomes will benefit from this discussion, particularly those looking to improve forecasting accuracy in competitive sports scenarios.

iambasil
Messages
14
Reaction score
0
Hello,

I'm hoping I might be able to get some help in creating a forecasting model (in sports) looking at 2 variables that are not independent of each other.

I'll take US Football (same applies to rugby) as an example. The specific forecast I'm interested in here is the expected supremacy between two teams at the end of a match (Team B points minus Team A points).

There's a fair bit out there others have done looking at how to forecast the most likely supremacy outcome (the 'line' which generally isn't just stats, but involves looking at prices set by the betting world).

However, what I'm most interested in is how to create a forecasted probability of supremacies that are different to the line. As an example, if it is assessed that the E(line) is -4.3, I want to work out the probability that the supremacy would actually be any of:
-10, -9, -8, ..., 0, 1, 2, ...9, 10 (etc)

There is obviously error in the line itself, which needs to be taken into account, so I looked at historic data as a guide. As you might expect, combining the expectation of team A's point haul with that of team B's based on the line but independent of each other does not return a good enough fit (not negatively skewed enough and kurtosis too high) - team A's point haul will generally have an affect on team B's (and vice versa) - they are not independent of each other.

Are you able to please share some ideas on how to adjust for the fact that team A and team B are related in calculating supremacy. Really appreciate your help!

Basil
 
Physics news on Phys.org
You are saying that team As haul is conditional on team B's ... so you are looking for something involving conditional probabilities and Beyes' theorem.
 
Thanks for responding.

Well, it's not entirely conditional, but it is affected by it to varying degrees.

So what I'm struggling with is how to measure this affect between the two, and how to apply the measure to the statistics. Any ideas on that?

Many thanks.
 
Hey iambasil and welcome to the forums.

The first thing you will have to do is to come up with a model for your regression.

If you have correlations between observations in time you will need to consider a longitudinal form of analyses.

Given that you are measuring probabilities, you will probably have to use some form of logistic regression involving a generalized linear model.

Typically in simple linear models, your estimator at given point for your independent variables (i.e. not the predictor) has a t-distribution or a Normal distribution.

But if you have conditional distributions, then the analyses will be a lot more complicated (a lot more)
 
Thanks very much for your response chiro.

I'll be honest - a lot of what you wrote is a bit beyond my understanding, even after looking up the terms you used.

I've attached some data and analysis I did as an example (zipped as was over 100k).

I modeled touchdowns of home and away teams based on poisson distributions on the means of all games - and then summed up the probabilities of the supremacies based on any outcome. This the closest I could get to matching the observed results, but the error is still beyond tolerable levels.

This method treats the home and away team touchdowns as being independent, which they aren't fully.

It would be great if you could suggest how to do the regressions/distributions based on what you see in the data? Forgive me for being unfamiliar and out of touch with methods.
 

Attachments

So are you trying to find a model to predict supremacy in terms of Away Touch-downs and Home Touch-downs?

If so you should look into regressions involving logistic types with Poisson models for the independent variables (i.e. the away and home touchdowns).

What kind of experience do you have with statistical computational tools?

You can download R for free at http://www.r-project.org

Take a look at this:

http://www.lisa.stat.vt.edu/sites/default/files/Poisson.and_.Logistic.Regression.pdf

(Scroll down to Poisson regression)
 
Thanks Chiro,

My thought was that if I could get the touch downs right, I could then do it for field goals and any other scoring types similarly. By combining each of these and multiplying by points value, I could effectively evaluate the supremacy in total points.

A friend can help with MatLab, I can download R too. I used SAS, Minitab, Maple 13 years ago and don't remember much! I'm nifty on Excel and know VBA.

Thanks for the link. I'll spend some time to understand this. One thing to highlight though, although I used Poisson in the example (because it gave the best results), the actual best fit (and it's very good!) for total/team total touch downs (as opposed to supremacy) in this case was binomial.
 
You'll probably have to look up what kind of link functions are supported in the GLM procedure for your particular package.

If they don't have direct support, then you will probably have to code a fitting procedure and use something like the Expectation Maximization algorithm (EM) or some other similar fitting algorithm to fit the data to some parametric distribution (which will be Poisson or Binomial).

I would be surprised if R didn't already have the functionality built in and I know from experience that SAS has a lot of built in options as well.
 
Thanks Chiro, sounds like I have a lot of reading and learning to do!

I'll check back into this thread with an update (and maybe another query if ok/needed?)

Thanks very much!

Basil
 

Similar threads

  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 8 ·
Replies
8
Views
1K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 4 ·
Replies
4
Views
5K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 15 ·
Replies
15
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K