Modelling Taxy Duration Distribution & Deriving Probability of Occurrence

Click For Summary

Discussion Overview

The discussion revolves around modeling the distribution of taxi duration for airline flights, focusing on how various factors such as flight number, aircraft type, and month of the year influence this duration. Participants explore statistical methods for analyzing the data, including regression techniques and the challenges of fitting a distribution to the observed data.

Discussion Character

  • Exploratory
  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • Basil presents a dataset of taxi durations and identifies three key variables that may affect these durations: flight number, aircraft type, and month of the year.
  • Basil notes that the data does not conform to a simple normal or chi-squared distribution and mentions the presence of clusters due to varying airport layouts.
  • One participant suggests running a regression analysis with taxi duration as the dependent variable and all other relevant factors as independent variables, emphasizing an empirical approach to establish relationships.
  • The same participant recommends starting with Ordinary Least Squares regression to maximize overall significance rather than focusing on individual variable significance.
  • Basil expresses confusion regarding the regression terminology and seeks clarification on the structure of regression equations.
  • Another participant provides an explanation of regression equations, illustrating the distinction between dependent and independent variables and offering a humorous example involving unrelated variables.

Areas of Agreement / Disagreement

The discussion includes varying levels of understanding regarding regression analysis, with Basil seeking clarification on the proposed methods. There is no consensus on the best approach to modeling the taxi duration distribution, and multiple viewpoints on the analysis techniques remain present.

Contextual Notes

Participants have not fully explored the implications of the data's non-standard distribution, and there are unresolved questions about the appropriate sample size and handling of variables in the regression analysis.

iambasil
Messages
14
Reaction score
0
Hiya,

I have been looking at a problem, and just can’t get my head round the right way to tackle it.

I have a whole heap of data for an airline showing the taxy duration of each of their flights (this is the time taken to go from the gate to actual lift off). As I see it, there are 3 variables that can affect this time:
1. Flight number (this takes account of both departure station and time of day)
2. Aircraft type
3. Month of year

Now, having had a look at the data, it does not follow a simple normal or chi-squared distribution – it actually varies. Some airports have two runways – one miles from the terminal and the other right next it – hence you get two clusters.

What I need to have as an output (monthly) is:
Flight number: (set by user)
Significance: (set by user) – probability that a taxi duration will fall within the output taxy time duration (below)
Taxy duration: (derived)

What I have is years worth of data. The data obviously changes year to year, so whilst I want the month of year variable considered, it must also be in relation to how the previous month compared to the previous month a year ago (or some other way eg deviation of month from annual mean?).

From a statistical perspective, how would I assess what type of distribution the data falls into, and how would I handle the sample to be used (sample size/the different variables)?

Would really appreciate your help.

The purpose of analysis is for improved aircraft fuel management – currently a simple average taxy time of the previous month (by flight number) is used.

Thanks,

Basil
 
Physics news on Phys.org
What you want to do is to run a regression with taxiing duration on the left hand side and all other variables that may affect that duration on the right hand side. Since you don't really care about a "theory" that explains that relationship (am I correct that what you are really looking for is an empirical relationship?) I would start with "the kitchen sink" on the right hand side. That is, put in every conceivable variable on the right hand side and maximize the R-squared, and/or the total F statistic for overall significance; i.e., maximize "equation significance" as opposed to "individual variable significance" (measured by individual t-statistic values). And, you want to start with the Ordinary Least Squares regression technique -- you can always refine it (graduate to more complex techniques if needed) further down the road.
 
Thank you so much for the response.

Sorry for being naive on this, but I don't quite understand what you mean by 'on the right' - are there any examples on the web of something similar being done?

Basil
 
Check this out. In a simple (single x) regression equation y = a + b x + u, y is the dependent or explained variable "on the left," and x is the independent or explanatory variable "on the right."

"a" and "b" are model coefficients (parameters) to be estimated; "u" is the error term.

A multiple regression equation has multiple variables on the right: y = a + b1x1 + ... + bkxk + u.

For example, x1 may be departure station, x2 may be time of day, x3 may be type of equipment (somehow numericized, e.g. large = 1, small = 0 and then x4 might be coded jet = 1, prop = 0, etc.), x5 may be month of the year, and x6 might be "the number of pizza slices served in the airport pizzeria during the flight day." :smile:
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K