# Modelling distribution and deriving probability of occurance being within a value

## Main Question or Discussion Point

Hiya,

I have been looking at a problem, and just can’t get my head round the right way to tackle it.

I have a whole heap of data for an airline showing the taxy duration of each of their flights (this is the time taken to go from the gate to actual lift off). As I see it, there are 3 variables that can affect this time:
1. Flight number (this takes account of both departure station and time of day)
2. Aircraft type
3. Month of year

Now, having had a look at the data, it does not follow a simple normal or chi-squared distribution – it actually varies. Some airports have two runways – one miles from the terminal and the other right next it – hence you get two clusters.

What I need to have as an output (monthly) is:
Flight number: (set by user)
Significance: (set by user) – probability that a taxi duration will fall within the output taxy time duration (below)
Taxy duration: (derived)

What I have is years worth of data. The data obviously changes year to year, so whilst I want the month of year variable considered, it must also be in relation to how the previous month compared to the previous month a year ago (or some other way eg deviation of month from annual mean?).

From a statistical perspective, how would I assess what type of distribution the data falls into, and how would I handle the sample to be used (sample size/the different variables)?

The purpose of analysis is for improved aircraft fuel management – currently a simple average taxy time of the previous month (by flight number) is used.

Thanks,

Basil

Related Set Theory, Logic, Probability, Statistics News on Phys.org
EnumaElish
Homework Helper
What you want to do is to run a regression with taxiing duration on the left hand side and all other variables that may affect that duration on the right hand side. Since you don't really care about a "theory" that explains that relationship (am I correct that what you are really looking for is an empirical relationship?) I would start with "the kitchen sink" on the right hand side. That is, put in every conceivable variable on the right hand side and maximize the R-squared, and/or the total F statistic for overall significance; i.e., maximize "equation significance" as opposed to "individual variable significance" (measured by individual t-statistic values). And, you want to start with the Ordinary Least Squares regression technique -- you can always refine it (graduate to more complex techniques if needed) further down the road.

Thank you so much for the response.

Sorry for being naive on this, but I don't quite understand what you mean by 'on the right' - are there any examples on the web of something similar being done?

Basil

EnumaElish