# Combination probability of variables that are not independent

Hello,

I'm hoping I might be able to get some help in creating a forecasting model (in sports) looking at 2 variables that are not independent of each other.

I'll take US Football (same applies to rugby) as an example. The specific forecast I'm interested in here is the expected supremacy between two teams at the end of a match (Team B points minus Team A points).

There's a fair bit out there others have done looking at how to forecast the most likely supremacy outcome (the 'line' which generally isn't just stats, but involves looking at prices set by the betting world).

However, what I'm most interested in is how to create a forecasted probability of supremacies that are different to the line. As an example, if it is assessed that the E(line) is -4.3, I want to work out the probability that the supremacy would actually be any of:
-10, -9, -8, ....., 0, 1, 2, ...9, 10 (etc)

There is obviously error in the line itself, which needs to be taken into account, so I looked at historic data as a guide. As you might expect, combining the expectation of team A's point haul with that of team B's based on the line but independent of each other does not return a good enough fit (not negatively skewed enough and kurtosis too high) - team A's point haul will generally have an affect on team B's (and vice versa) - they are not independent of each other.

Are you able to please share some ideas on how to adjust for the fact that team A and team B are related in calculating supremacy. Really appreciate your help!

Basil

Simon Bridge
Homework Helper
You are saying that team As haul is conditional on team B's ... so you are looking for something involving conditional probabilities and Beyes' theorem.

Thanks for responding.

Well, it's not entirely conditional, but it is affected by it to varying degrees.

So what I'm struggling with is how to measure this affect between the two, and how to apply the measure to the statistics. Any ideas on that?

Many thanks.

chiro
Hey iambasil and welcome to the forums.

The first thing you will have to do is to come up with a model for your regression.

If you have correlations between observations in time you will need to consider a longitudinal form of analyses.

Given that you are measuring probabilities, you will probably have to use some form of logistic regression involving a generalized linear model.

Typically in simple linear models, your estimator at given point for your independent variables (i.e. not the predictor) has a t-distribution or a Normal distribution.

But if you have conditional distributions, then the analyses will be a lot more complicated (a lot more)

Thanks very much for your response chiro.

I'll be honest - a lot of what you wrote is a bit beyond my understanding, even after looking up the terms you used.

I've attached some data and analysis I did as an example (zipped as was over 100k).

I modelled touchdowns of home and away teams based on poisson distributions on the means of all games - and then summed up the probabilities of the supremacies based on any outcome. This the closest I could get to matching the observed results, but the error is still beyond tolerable levels.

This method treats the home and away team touchdowns as being independent, which they aren't fully.

It would be great if you could suggest how to do the regressions/distributions based on what you see in the data? Forgive me for being unfamiliar and out of touch with methods.

#### Attachments

• Example Analysis.xls.zip
32.5 KB · Views: 140
chiro
So are you trying to find a model to predict supremacy in terms of Away Touch-downs and Home Touch-downs?

If so you should look into regressions involving logistic types with Poisson models for the independent variables (i.e. the away and home touchdowns).

What kind of experience do you have with statistical computational tools?

Take a look at this:

http://www.lisa.stat.vt.edu/sites/default/files/Poisson.and_.Logistic.Regression.pdf

(Scroll down to Poisson regression)

Thanks Chiro,

My thought was that if I could get the touch downs right, I could then do it for field goals and any other scoring types similarly. By combining each of these and multiplying by points value, I could effectively evaluate the supremacy in total points.

A friend can help with MatLab, I can download R too. I used SAS, Minitab, Maple 13 years ago and don't remember much! I'm nifty on Excel and know VBA.

Thanks for the link. I'll spend some time to understand this. One thing to highlight though, although I used Poisson in the example (because it gave the best results), the actual best fit (and it's very good!) for total/team total touch downs (as opposed to supremacy) in this case was binomial.

chiro
You'll probably have to look up what kind of link functions are supported in the GLM procedure for your particular package.

If they don't have direct support, then you will probably have to code a fitting procedure and use something like the Expectation Maximization algorithm (EM) or some other similar fitting algorithm to fit the data to some parametric distribution (which will be Poisson or Binomial).

I would be surprised if R didn't already have the functionality built in and I know from experience that SAS has a lot of built in options as well.

Thanks Chiro, sounds like I have a lot of reading and learning to do!

I'll check back in to this thread with an update (and maybe another query if ok/needed?)

Thanks very much!

Basil