Simple statistical analysis of a limited data set

DaveC426913 · Mar 23, 2015

I should have studies statistics in school.

I have been collecting some trip data from you commute to and from work, and I want to determine what factors are influencing trip duration. There are multiple factors, which I cannot easily isolate.

The factors I'm most interested in are:
- duration of trip as a function of start time
- duration of trip as a function of day-of-week
- duration of trip as a function of route taken

What I want to do is to understand how best to analyze the data. For instance: if I want to examine a single factor, can I meaningfully 'normalize' the other factors?

Here is a snippet of the data I've collected.

Date AM/PM Start Finish Temp Route Notes (Duration)

Feb 11 Th A 0814 0857 -9 QEW Cold Dry 43m
Feb 11 Th P 1823 1859 -9 QEW Cold Dry 36m
Feb 12 Fr A 0813 0859 -18 QEW Freezing 46m
Feb 12 Fr P 1748 1820 -8 QEW Cold 32m
Feb 19 Th A 0811 0847 -16 Lakeshore Dry 36m
Feb 24 Tu A 0804 0849 -12 QEW Dry 45m
Mar 10 Tu A 0821 0904 +1 Lakeshore Wet 43mI've only got about 2 dozen entries, so it may be problematic to chop up the data into small sections and analyze each factor in isolation. Is there a way of averaging the data to make better use of it?

For example, when I'm analyzing duration as a function of day-of-week, the AM/PM parameter is a confounding factor. (Notice that PM times are significantly shorter than AM times, which could throw off my results.) But even if the absolute value of PM times is off, surely the trend is still there. Do I have to throw away all the PM data when analyzing AM data for day-of-week trends? Or can I somehow normalize the PM data to help me see overall trends over AM and PM?

This will be of particular importance in that I have very little data (< 6 entries) for alternate routes (Lakeshore).

wabbit · Mar 23, 2015

Haven't studied statistics much either but just a couple comments / suggestions

- you have at least 20 different situations (5 days x am/pm) and 24 data points so this may well prove inconclusive. Maybe a first step would be to assume all days are the same (unless you include weekends, then you need at least two categories, weekday/weekend).

Assuming only weekdays, a priori I would expect am and pm to belong to two different distributions and estimate them separately, and same for different routes so 6 different cases to estimate separately.

- what is the best route in the morning (compare average and stdev of driving times separately for each route, ignoring exact hour)

- for a given route in the morning, what is the best starting time (plot driving time as a function of starting time)

- may not work if best starting time is different for each route, but perhaps try to mix normalized times i.e.
Route 1 -> avg1, std1 of am route1 driving times -> t*=(t-avg1)/std1
Then plot the t*s from all routes against starting times

DaveC426913 · Mar 23, 2015

wabbit said:

assume all days are the same (unless you include weekends, then you need at least two categories, weekday/weekend).

Yep, weekdays only.

wabbit said:

Assuming only weekdays, a priori I would expect am and pm to belong to two different distributions and estimate them separately, and same for different routes so 6 different cases to estimate separately.

Yeah, agree. AM versus PM will have virtually no useful relation. Trying to force one would skew the data.

wabbit said:

- what is the best route in the morning (compare average and stdev of driving times separately for each route, ignoring exact hour)

Unfortunately, I have very little data on the alternate routes. My choice of route is not randomly selected; it is strongly biased toward QEW.

wabbit said:

- may not work if best starting time is different for each route, but perhaps try to mix normalized times i.e.
Route 1 -> avg1, std1 of am route1 driving times -> t*=(t-avg1)/std1
Then plot the t*s from all routes against starting times

This is what I'm wondering. Unfortunately, your shorthand formulae elude me.

wabbit · Mar 24, 2015

Ah sorry, ##t^*=\frac{t-avg_1^{AM}}{std_1^{AM}}## is the transformation applied to morning driving times t on route 1, t* the normalized time, ##avg_1^{AM}## the average driving time on route1 in the morning, ##std_1^{AM}## the standard deviation of same.
One transformation per sub-distribution, then mix normalized times.

But you say some routes have very few data points - in that case the above fails (or gives an unreliable result) but I think it just comes from trying to extract information that isn't there. At the extreme, if taking a new route one morning (say one that takes a detour or stops for an errand), the corresponding data point is worthless for estimating usual relations and should just be thrown out.*

* unless you have a model of the difference, e.g. if that single-occurence route differs by a 5-minute stop, then you can just deduce these 5 minutes and include the point.
You could try this strategy more generally too, though for differences of driving times between routes I can't think of a simple relevant model beyond the one implicit in the transformation above, which is $$t=avg(route,1_{AM})+std(route,1_{AM})*t^*$$

DaveC426913 · Mar 24, 2015

wabbit said:

Ah sorry, ##t^*=\frac{t-avg_1^{AM}}{std_1^{AM}}## is the transformation applied to morning driving times t on route 1, t* the normalized time, ##avg_1^{AM}## the average driving time on route1 in the morning, ##std_1^{AM}## the standard deviation of same.
One transformation per sub-distribution, then mix normalized times.

Wow. Er. I'll try to absorb that.

wabbit said:

But you say some routes have very few data points - in that case the above fails (or gives an unreliable result) but I think it just comes from trying to extract information that isn't there. At the extreme, if taking a new route one morning (say one that takes a detour or stops for an errand), the corresponding data point is worthless for estimating usual relations and should just be thrown out.*

Yah. I don't record data for days when I run errands. But for days when the delays are reasonably allotted to traffic, such as detours or accident, they count, since they are a realistic factor in choosing a route.

wabbit said:

* unless you have a model of the difference, e.g. if that single-occurence route differs by a 5-minute stop, then you can just deduce these 5 minutes and include the point.

Yes. Only once did I subtract out a known-duration traffic delay.

wabbit said:

You could try this strategy more generally too, though for differences of driving times between routes I can't think of a simple relevant model beyond the one implicit in the transformation above, which is $$t=avg(route,1_{AM})+std(route,1_{AM})*t^*$$

As above: Wow. Er.

DaveC426913 · Mar 26, 2015

Well, here's my data to-date. And I've made some primitive graphs from it.
High priority:
Break out into separate AM and PM groups.
Generate average data and plot.
Lower priority:
Edit X-axis labels to pull from lookup table in upper right corner. (Or figure out how to group data)

wabbit · Mar 26, 2015

Interesting. I would focus only on the AM data to start, PM is different and has few points.

There seems to be some variation by day of week. If this persits for AM, you might try computing average time separately for each day, and then look at the reduced times obtained by deducting from each drive the average time of the corresponding day,

DaveC426913 · Mar 26, 2015

I think there's variation by start time - that's really the primary factor of interest - especially since it's directly under my control.
It's hard to see because the chart graphs AM and PM together across 24 hours, so the variance is too small to see.

I'd like to figure out how to filter the data used in the charts. I read a lot about filtering the data in the table, but that's problematic if I want to show two filters on the same data.

What I'd like is to apply some sort of conditional on the datasource equation for the chart (eg. Y-axis: $K2-$K32 WHERE $D='A') while leaving the raw data alone. Then I can make another chart WHERE $D='P'.

wabbit · Mar 26, 2015

In Excel at least and I think in other spreadshhet software too you can select the data with column headings and apply a filter here, it hides dome data (also in the charts), but it doesn't lose it, and you can edit the filter to look at different parts of the data, is this what you're looking for? There are also filtered functions such as SUMIF(), etc. which are useful for this sort of thing.

DaveC426913 · Mar 26, 2015

I can apply that to the data, ye,s but that's interactive. What I was hoping to do was to capture multiple, static charts, all from the same data. I can't post an Excel spreadsheet online.

wabbit · Mar 26, 2015

Hmm I see. You can create several dynamically filtered tables and charts, all linked to the master table, using pivot tables. Actually that's something I use a lot when doing this sort of data analysis.

DaveC426913 · Mar 26, 2015

OK, I've filtered out PM entries, alternate route entries and 'special circumstance' entries. Less data but more informative.

One thing I'd like to do is make a 'lookup' table to correlate ordinal day-of-week (1,2,3,4,5) with name.

Some interesting trending happening by weekday...

wabbit · Mar 26, 2015

The duration by starting time looks like it has a decent fit to
duration = d0 + max( 0, k·(t-t0) ) for t0 ~ 0730

DaveC426913 · Mar 26, 2015

So, if I leave the house at 7:30AM, I should be able to make it to work in zero time?

wabbit · Mar 26, 2015

Just have to deal with that pesky d0 and you could :)

DaveC426913 · Mar 26, 2015

How do I formulaically plot average?

Basically, $K1-$Kn WHERE $B=(1,2,3,4,5)

So my DoW chart would have a single average point for each day amongst the other data.

I guess I start sprinkling sub total data throughout my page?

wabbit · Mar 26, 2015

with multiple conditions I think you might need to add an auxiliary column where you put the criterion, say $Gn formula : AND($Bn>=1,Bn<=5)

Then use. SUMIF($K1:$Kn, $G1:$Gn)/COUNTIF($K1:$Kn, $G1:$Gn)

Oh not what you meant - you want 5 sub averages.

Not sure about it, except by adding 5 auxiliary colums, one per day. There must be a better way but can't think of it now.

A pivot table would do that in one shot though.

DaveC426913 · Mar 26, 2015

I've created 5 cells in a corner of my spreadsheet with array formulae:

{=AVERAGE(IF(B2:B32=<1,2,3,4,5>,K2:K32))}

Hm. These do not render on the graph properly.

Awww. The formulae ignore the filters! My averages are calc'ed from ALL rows, regardless of whether or not they are filtered out.

*sigh* got to explicitly apply filters to the 5 formulae:
{=AVERAGE(IF(B2:B32=<1,2,3,4,5>,IF(D2:D32="A",IF(H2: H32=1,IF(J2:J32="",K2:K32)))))}

DaveC426913 · Mar 27, 2015

I find the DoW averages intriguing. I wonder whether it's Wednesday that's the anomaly, or if it's Tuesday and Thursday that are the anomalies.
And I wonder why. A lot of people take Wednesday off?

As for the Departure Time, how can I calculate the trend? What's it called? A 'best fit' curve?
Then I have to figure out if Excel can render it.

DaveC426913 · Mar 27, 2015

Found the trendline feature. Cool!

OK, so here's something I'm wondering. In real life, the commute will not decrease linearly with an earlier departure time, so the best fit should asymptotically approach some value of Y - likely somewhere between 0:35 and 0:30. But the best fit that Excel renders only has two points and a straight line between them. I don't see how I could make that line more representative.

DaveC426913 · Mar 27, 2015

Ah. A breakdown showing Day-of-Week AND Departure time reveals a very different story.
It shows me that I don't have enough data.

I notice that
- Wednesday's durations are biased toward earlier departure times,
- Tuesday's & Thursday's are biased toward later times.
- 2 data points for Monday do not constitute a trend.

wabbit · Mar 27, 2015

Indeed it looks like they might be drawn from the same distribution. So best would be to go back to just ignoring DOW distinctions and look only at other variables.

wabbit · Mar 27, 2015

DaveC426913 said:

Found the trendline feature. Cool!

OK, so here's something I'm wondering. In real life, the commute will not decrease linearly with an earlier departure time, so the best fit should asymptotically approach some value of Y - likely somewhere between 0:35 and 0:30. But the best fit that Excel renders only has two points and a straight line between them. I don't see how I could make that line more representative.

View attachment 81092

Looking at the chart, I don't think you have enough data to discriminate between different models. You can however fit any model that seems reasonably motivated, instead of a trendline (that's the kind of thing I was suggesting above with the d0 model, since where I live, if you leave before the crowd you get no traffic and a constant driving time, then after some tipping point the trip duration increases.)
The levelling off of duration at late time is an expected feature but I do not see it in your chart (it looks like you never leave late enough to encounter it).
In any case, one way to do this is to write your model with some parameters, calculte its standard prediction error, and then use the solver to adjust the parameters so as to minimize this error.
What you describe could be formulated as

duration=d0+k*min(t-t0, tmax - t0)+error

In this formulation t0 is here for convenience only, and can be set as the earliest time or an arbitrary time like 730; d0 and k are then your parameters, and you minimize

Sum(error^2)

DaveC426913 · Mar 27, 2015

wabbit said:

Looking at the chart, I don't think you have enough data to discriminate between different models.

Yeah. That's what I was getting at. When examined by day only, it looks like Wednesday is a fast day. In fact, what the data is really telling me is that I tend to leave earlier on Wednesday than on Tuesdays or Thursdays.

BTW, I had to choose the data sets manually. I don't know how to tell it that, for the Monday Series, only choose where B=1, etc.

wabbit said:

You can however fit any model that seems reasonably motivated, instead of a trendline (that's the kind of thing I was suggesting above with the d0 model, since where I live, if you leave before the crowd you get no traffic and a constant driving time, then after some tipping point the trip duration increases.)

Around here, there is no such thing as no traffic. Rush hour starts at 6AM and goes until - well - 9PM. There's no midday lull. But yeah, it should be worse between 7:30 and ... I don't know ... 11AM?

wabbit said:

The levelling off of duration at late time is an expected feature but I do not see it in your chart (it looks like you never leave late enough to encounter it).
In any case, one way to do this is to write your model with some parameters, calculte its standard prediction error, and then use the solver to adjust the parameters so as to minimize this error.
What you describe could be formulated as

In this formulation t0 is here for convenience only, and can be set as the earliest time or an arbitrary time like 730; d0 and k are then your parameters, and you minimize

I am not sure what I would do with that formula. What is 'error'? Or is that what I'm trying to solve for?

wabbit · Mar 27, 2015

You would calculate the error (difference between model prediction and observation) as

error = ( d0 + k·min(t-t0, tmax-t0) ) - duration

But I don't think your data actually supports that.
This is something you can test:
- fit a model without tmax (or equivalently force tmax = some very large value) : you get a first total error^2
- then fit the model allowing tmax to change : you get a second sum of error^2
If the second total error is significantly* less than the first, than your extra parameter may have earned its keep. Otherwise reject it.

* there are statistical tests for this, but I don't think you need them here. I expect you'll see no reduction in toral error or at best a small one.

DaveC426913 · Mar 27, 2015

Right. Yes. I was working that out while you were posting.

Although, I don't understand it

because, as I confessed, I never studied stats.
I mean, I get the idea of what's it supposed to do, I'm just kind of hazy on the application to the data and chart.
Would I create a second curve, using the formula you provided, drop it on the first, then tweak it until I got a good match? (well,, except there is no first curve).

wabbit · Mar 27, 2015

Yes but use the solver to do the tweaking.

DaveC426913 · Mar 27, 2015

Okay. Thanks.

Alas, my time has run out. I leave for Cuba in 17 hours.Won't be able to revisit this till well into April.

Simple statistical analysis of a limited data set

Attachments

1. What is simple statistical analysis?

2. Why is it important to conduct simple statistical analysis?

3. How do you choose the appropriate statistical test for a limited data set?

4. What are some common challenges when conducting simple statistical analysis?

5. How can I effectively communicate the results of simple statistical analysis?

Similar threads

Hot Threads

Recent Insights