1. Limited time only! Sign up for a free 30min personal tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Simple statistical analysis of a limited data set

  1. Mar 23, 2015 #1

    DaveC426913

    User Avatar
    Gold Member

    I should have studies statistics in school.

    I have been collecting some trip data from you commute to and from work, and I want to determine what factors are influencing trip duration. There are multiple factors, which I cannot easily isolate.

    The factors I'm most interested in are:
    - duration of trip as a function of start time
    - duration of trip as a function of day-of-week
    - duration of trip as a function of route taken

    What I want to do is to understand how best to analyze the data. For instance: if I want to examine a single factor, can I meaningfully 'normalize' the other factors?

    Here is a snippet of the data I've collected.

    Date AM/PM Start Finish Temp Route Notes (Duration)

    Feb 11 Th A 0814 0857 -9 QEW Cold Dry 43m
    Feb 11 Th P 1823 1859 -9 QEW Cold Dry 36m
    Feb 12 Fr A 0813 0859 -18 QEW Freezing 46m
    Feb 12 Fr P 1748 1820 -8 QEW Cold 32m
    Feb 19 Th A 0811 0847 -16 Lakeshore Dry 36m
    Feb 24 Tu A 0804 0849 -12 QEW Dry 45m
    Mar 10 Tu A 0821 0904 +1 Lakeshore Wet 43m



    I've only got about 2 dozen entries, so it may be problematic to chop up the data into small sections and analyze each factor in isolation. Is there a way of averaging the data to make better use of it?

    For example, when I'm analyzing duration as a function of day-of-week, the AM/PM parameter is a confounding factor. (Notice that PM times are significantly shorter than AM times, which could throw off my results.) But even if the absolute value of PM times is off, surely the trend is still there. Do I have to throw away all the PM data when analyzing AM data for day-of-week trends? Or can I somehow normalize the PM data to help me see overall trends over AM and PM?

    This will be of particular importance in that I have very little data (< 6 entries) for alternate routes (Lakeshore).
     
  2. jcsd
  3. Mar 23, 2015 #2

    wabbit

    User Avatar
    Gold Member

    Haven't studied statistics much either but just a couple comments / suggestions

    - you have at least 20 different situations (5 days x am/pm) and 24 data points so this may well prove inconclusive. Maybe a first step would be to assume all days are the same (unless you include weekends, then you need at least two categories, weekday/weekend).

    Assuming only weekdays, a priori I would expect am and pm to belong to two different distributions and estimate them separately, and same for different routes so 6 different cases to estimate separately.

    - what is the best route in the morning (compare average and stdev of driving times separately for each route, ignoring exact hour)

    - for a given route in the morning, what is the best starting time (plot driving time as a function of starting time)

    - may not work if best starting time is different for each route, but perhaps try to mix normalized times i.e.
    Route 1 -> avg1, std1 of am route1 driving times -> t*=(t-avg1)/std1
    Then plot the t*s from all routes against starting times
     
  4. Mar 23, 2015 #3

    DaveC426913

    User Avatar
    Gold Member

    Yep, weekdays only.

    Yeah, agree. AM versus PM will have virtually no useful relation. Trying to force one would skew the data.

    Unfortunately, I have very little data on the alternate routes. My choice of route is not randomly selected; it is strongly biased toward QEW.

    This is what I'm wondering. Unfortunately, your shorthand formulae elude me.
     
  5. Mar 24, 2015 #4

    wabbit

    User Avatar
    Gold Member

    Ah sorry, ##t^*=\frac{t-avg_1^{AM}}{std_1^{AM}}## is the transformation applied to morning driving times t on route 1, t* the normalized time, ##avg_1^{AM}## the average driving time on route1 in the morning, ##std_1^{AM}## the standard deviation of same.
    One transformation per sub-distribution, then mix normalized times.

    But you say some routes have very few data points - in that case the above fails (or gives an unreliable result) but I think it just comes from trying to extract information that isn't there. At the extreme, if taking a new route one morning (say one that takes a detour or stops for an errand), the corresponding data point is worthless for estimating usual relations and should just be thrown out.*

    * unless you have a model of the difference, e.g. if that single-occurence route differs by a 5-minute stop, then you can just deduce these 5 minutes and include the point.
    You could try this strategy more generally too, though for differences of driving times between routes I can't think of a simple relevant model beyond the one implicit in the transformation above, which is $$t=avg(route,1_{AM})+std(route,1_{AM})*t^*$$
     
    Last edited: Mar 24, 2015
  6. Mar 24, 2015 #5

    DaveC426913

    User Avatar
    Gold Member

    Wow. Er. I'll try to absorb that.

    Yah. I don't record data for days when I run errands. But for days when the delays are reasonably allotted to traffic, such as detours or accident, they count, since they are a realistic factor in choosing a route.

    Yes. Only once did I subtract out a known-duration traffic delay.

    As above: Wow. Er.
     
  7. Mar 26, 2015 #6

    DaveC426913

    User Avatar
    Gold Member

    Well, here's my data to-date. And I've made some primitive graphs from it.
    High priority:
    Break out into separate AM and PM groups.
    Generate average data and plot.
    Lower priority:
    Edit X-axis labels to pull from lookup table in upper right corner. (Or figure out how to group data)

    excel.png
     
  8. Mar 26, 2015 #7

    wabbit

    User Avatar
    Gold Member

    Interesting. I would focus only on the AM data to start, PM is different and has few points.

    There seems to be some variation by day of week. If this persits for AM, you might try computing average time separately for each day, and then look at the reduced times obtained by deducting from each drive the average time of the corresponding day,
     
  9. Mar 26, 2015 #8

    DaveC426913

    User Avatar
    Gold Member

    I think there's variation by start time - that's really the primary factor of interest - especially since it's directly under my control.
    It's hard to see because the chart graphs AM and PM together across 24 hours, so the variance is too small to see.

    I'd like to figure out how to filter the data used in the charts. I read a lot about filtering the data in the table, but that's problematic if I want to show two filters on the same data.

    What I'd like is to apply some sort of conditional on the datasource equation for the chart (eg. Y-axis: $K2-$K32 WHERE $D='A') while leaving the raw data alone. Then I can make another chart WHERE $D='P'.
     
  10. Mar 26, 2015 #9

    wabbit

    User Avatar
    Gold Member

    In Excel at least and I think in other spreadshhet software too you can select the data with column headings and apply a filter here, it hides dome data (also in the charts), but it doesn't lose it, and you can edit the filter to look at different parts of the data, is this what you're looking for? There are also filtered functions such as SUMIF(), etc. which are useful for this sort of thing.
     
  11. Mar 26, 2015 #10

    DaveC426913

    User Avatar
    Gold Member

    I can apply that to the data, ye,s but that's interactive. What I was hoping to do was to capture multiple, static charts, all from the same data. I can't post an Excel spreadsheet online.
     
  12. Mar 26, 2015 #11

    wabbit

    User Avatar
    Gold Member

    Hmm I see. You can create several dynamically filtered tables and charts, all linked to the master table, using pivot tables. Actually that's something I use a lot when doing this sort of data analysis.
     
  13. Mar 26, 2015 #12

    DaveC426913

    User Avatar
    Gold Member

    OK, I've filtered out PM entries, alternate route entries and 'special circumstance' entries. Less data but more informative.

    One thing I'd like to do is make a 'lookup' table to correlate ordinal day-of-week (1,2,3,4,5) with name.

    Some interesting trending happening by weekday...
     

    Attached Files:

    Last edited: Mar 26, 2015
  14. Mar 26, 2015 #13

    wabbit

    User Avatar
    Gold Member

    The duration by starting time looks like it has a decent fit to
    duration = d0 + max( 0, k·(t-t0) ) for t0 ~ 0730
     
  15. Mar 26, 2015 #14

    DaveC426913

    User Avatar
    Gold Member

    So, if I leave the house at 7:30AM, I should be able to make it to work in zero time? :biggrin:
     
  16. Mar 26, 2015 #15

    wabbit

    User Avatar
    Gold Member

    Just have to deal with that pesky d0 and you could :)
     
  17. Mar 26, 2015 #16

    DaveC426913

    User Avatar
    Gold Member

    How do I formulaically plot average?

    Basically, $K1-$Kn WHERE $B=(1,2,3,4,5)

    So my DoW chart would have a single average point for each day amongst the other data.

    I guess I start sprinkling sub total data throughout my page?
     
  18. Mar 26, 2015 #17

    wabbit

    User Avatar
    Gold Member

    with multiple conditions I think you might need to add an auxiliary column where you put the criterion, say $Gn formula : AND($Bn>=1,Bn<=5)

    Then use. SUMIF($K1:$Kn, $G1:$Gn)/COUNTIF($K1:$Kn, $G1:$Gn)

    Oh not what you meant - you want 5 sub averages.

    Not sure about it, except by adding 5 auxiliary colums, one per day. There must be a better way but can't think of it now.

    A pivot table would do that in one shot though.
     
    Last edited: Mar 26, 2015
  19. Mar 26, 2015 #18

    DaveC426913

    User Avatar
    Gold Member

    I've created 5 cells in a corner of my spreadsheet with array formulae:

    {=AVERAGE(IF(B2:B32=<1,2,3,4,5>,K2:K32))}

    Hm. These do not render on the graph properly.

    Awww. The formulae ignore the filters! My averages are calc'ed from ALL rows, regardless of whether or not they are filtered out.

    *sigh* gotta explicitly apply filters to the 5 formulae:
    {=AVERAGE(IF(B2:B32=<1,2,3,4,5>,IF(D2:D32="A",IF(H2: H32=1,IF(J2:J32="",K2:K32)))))}
     
    Last edited: Mar 26, 2015
  20. Mar 27, 2015 #19

    DaveC426913

    User Avatar
    Gold Member

    I find the DoW averages intriguing. I wonder whether it's Wednesday that's the anomaly, or if it's Tuesday and Thursday that are the anomalies.
    And I wonder why. A lot of people take Wednesday off?

    As for the Departure Time, how can I calculate the trend? What's it called? A 'best fit' curve?
    Then I have to figure out if Excel can render it.

    commute.PNG
     
  21. Mar 27, 2015 #20

    DaveC426913

    User Avatar
    Gold Member

    Found the trendline feature. Cool!

    OK, so here's something I'm wondering. In real life, the commute will not decrease linearly with an earlier departure time, so the best fit should asymptotically approach some value of Y - likely somewhere between 0:35 and 0:30. But the best fit that Excel renders only has two points and a straight line between them. I don't see how I could make that line more representative.

    commute.PNG
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: Simple statistical analysis of a limited data set
  1. Data analysis (Replies: 2)

  2. Excel data analysis (Replies: 3)

Loading...