Modelling distribution and deriving probability of occurance being within a value


I have been looking at a problem, and just can’t get my head round the right way to tackle it.

I have a whole heap of data for an airline showing the taxy duration of each of their flights (this is the time taken to go from the gate to actual lift off). As I see it, there are 3 variables that can affect this time:
1. Flight number (this takes account of both departure station and time of day)
2. Aircraft type
3. Month of year

Now, having had a look at the data, it does not follow a simple normal or chi-squared distribution – it actually varies. Some airports have two runways – one miles from the terminal and the other right next it – hence you get two clusters.

What I need to have as an output (monthly) is:
Flight number: (set by user)
Significance: (set by user) – probability that a taxi duration will fall within the output taxy time duration (below)
Taxy duration: (derived)

What I have is years worth of data. The data obviously changes year to year, so whilst I want the month of year variable considered, it must also be in relation to how the previous month compared to the previous month a year ago (or some other way eg deviation of month from annual mean?).

From a statistical perspective, how would I assess what type of distribution the data falls into, and how would I handle the sample to be used (sample size/the different variables)?

Would really appreciate your help.

The purpose of analysis is for improved aircraft fuel management – currently a simple average taxy time of the previous month (by flight number) is used.




Science Advisor
Homework Helper
What you want to do is to run a regression with taxiing duration on the left hand side and all other variables that may affect that duration on the right hand side. Since you don't really care about a "theory" that explains that relationship (am I correct that what you are really looking for is an empirical relationship?) I would start with "the kitchen sink" on the right hand side. That is, put in every conceivable variable on the right hand side and maximize the R-squared, and/or the total F statistic for overall significance; i.e., maximize "equation significance" as opposed to "individual variable significance" (measured by individual t-statistic values). And, you want to start with the Ordinary Least Squares regression technique -- you can always refine it (graduate to more complex techniques if needed) further down the road.
Thank you so much for the response.

Sorry for being naive on this, but I don't quite understand what you mean by 'on the right' - are there any examples on the web of something similar being done?



Science Advisor
Homework Helper
Check this out. In a simple (single x) regression equation y = a + b x + u, y is the dependent or explained variable "on the left," and x is the independent or explanatory variable "on the right."

"a" and "b" are model coefficients (parameters) to be estimated; "u" is the error term.

A multiple regression equation has multiple variables on the right: y = a + b1x1 + ... + bkxk + u.

For example, x1 may be departure station, x2 may be time of day, x3 may be type of equipment (somehow numericized, e.g. large = 1, small = 0 and then x4 might be coded jet = 1, prop = 0, etc.), x5 may be month of the year, and x6 might be "the number of pizza slices served in the airport pizzeria during the flight day." :smile:

The Physics Forums Way

We Value Quality
• Topics based on mainstream science
• Proper English grammar and spelling
We Value Civility
• Positive and compassionate attitudes
• Patience while debating
We Value Productivity
• Disciplined to remain on-topic
• Recognition of own weaknesses
• Solo and co-op problem solving