1. Limited time only! Sign up for a free 30min personal tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

B Extrapolating from data

  1. Jan 12, 2018 #1

    DaveC426913

    User Avatar
    Gold Member

    I've got a collection of data that contains observations over time. I want to predict when a given future observation is likely to occur.

    As a simple example: Say I'm watching billiard balls drop into a pocket. The billiard balls drop in with approximate regularity.

    My dataset:
    1 ball 0:00
    2 ball 0:57
    3 ball 2:02
    4 ball 2:48
    5 ball (not observed)
    6 ball (not observed)
    7 ball 5:55
    8 ball (not observed)
    9 ball 7:40
    ...
    n ball ?

    Understand that when the 9 ball drops at 7:40, I know it is the 9 ball. This means I know the 8 ball (and 5 and 6) went in the pocket, even though I didn't observe it and don't know when it did so.

    These observations are ongoing, so, after each ball, I will update the prediction, getting ever more accurate as I get more data.

    I want to create an algorithm that will predict when ball n (say, ball 25) is expected to drop.

    BTW, I am better at programming than math, so I am more comfortable with an algorithm (sequential steps) that a formula.
     
  2. jcsd
  3. Jan 12, 2018 #2

    DaveC426913

    User Avatar
    Gold Member

    I'm wondering if it is simpler than I'm expecting.

    After ball 2 drops, the average is 57 seconds. So ball 15 will drop at 14x57 s.
    After ball 3 drops the average is (2:02 / 2 = ) 61 seconds. So ball 15 will drop at 14x61 s.

    Ball 7 dropped 3m7s after ball 4. That's (187 / 3 = 62.3) on average for those 3 balls...

    Is it possible that I can ignore all the intermediate observations and simply say 9 balls dropped in 7m40s - then project that forward? Does that lead to the most accurate prediction for ball n?
     
  4. Jan 12, 2018 #3

    StoneTemplePython

    User Avatar
    Science Advisor
    Gold Member

    Yes you could take this approach, i.e. by estimating the mean with the sample mean. This approach has theoretical merits, and has merit for being quite simple.

    (Bayesian thought: is there prior information about balls dropping that is being left out here?)

    This is where it gets tricky. What does it mean to be most accurate? In programming, this typically comes down to coming up with a cost function and figuring out how to minimize it. If you are looking at minimizing squared errors, then that corresponds to the mean. There are many different cost functions you can choose from. (Another common one is that L1 norms correspond to medians, but there are many others.)
     
  5. Jan 13, 2018 at 1:43 AM #4

    mfb

    User Avatar
    2017 Award

    Staff: Mentor

    A proper linear fit will typically give a better estimate for the next balls. There are tools doing this, but if you are not interested in the uncertainty on the estimate there are fixed formulas.
     
  6. Jan 14, 2018 at 6:32 PM #5

    DaveC426913

    User Avatar
    Gold Member

    A simple best single prediction will suffice.
    As n approaches, the margin of error will approach zero.

    And I'm gonna do it as an algorithm, rather than a formula.

    So,
    n=2, the predicted latency is 57s.
    n=3, the predicted latency is (57+65)/2=61s.
    n=4 :: (57+65+45)/3 = 52s..

    When I get to n=7 I'm not sure if I calculate it as (5:55-2:48=) 187s/3. That would be giving equal weight to latencies that are derived rather than observed.
     
    Last edited: Jan 14, 2018 at 6:40 PM
  7. Jan 14, 2018 at 9:39 PM #6

    mfb

    User Avatar
    2017 Award

    Staff: Mentor

    If you sum all differences, your sum is just the difference between the last and first observed event. You give zero weight to the events in between (you throw away most of the information), and if the first or last event are a bit off that has a big impact on your extrapolation.
     
  8. Jan 15, 2018 at 2:22 PM #7

    jbriggs444

    User Avatar
    Science Advisor

    It would be good to know the rules of the game before playing. Do we have, for instance, a parameterized distribution of outcomes where we are trying to estimate the parameters based on the outcomes? Maybe something like "each ball drops in a normal distribution with mean m and standard deviation s centered on a nominal drop time at a regular interval i"?
     
  9. Jan 15, 2018 at 5:57 PM #8

    DaveC426913

    User Avatar
    Gold Member

    I ... don't know.
     
  10. Jan 15, 2018 at 6:04 PM #9

    DaveC426913

    User Avatar
    Gold Member

    I think the linear fit / least squares thing would do the trick.

    In looking it up, I see most places suggest you'd be wanting a calculating machine (i.e. computer) just to do the calcs, since it's not simple, even for the simplest data sets.

    I was hoping my billiard balls example would elicit a simple answer so simple even I could grasp it.

    I'll show you what I'm actually doing:

    http://www.davesbrain.ca/science/plates/ (It takes ten seconds or so to render).

    I'd like to predict when a given plate (such as CRY) can be expected to be observed.

    My primitive calculation function simply averages the delay between each subsequent data point. (some are even negative).
     
  11. Jan 16, 2018 at 8:02 PM #10

    mfb

    User Avatar
    2017 Award

    Staff: Mentor

    A linear fit should work.

    Things to consider:
    - the rate of license plates handed out will vary with time, especially with the seasons, but also with some long-term trend (population changes, cars per capita and so on). Fit over a long timespan to get rid of fluctuations, but don't fit linearly over 10+ years.
    - the sighting efficiency might vary with time
    - if you have precise dates (like your own license plate) they can be treated differently because you know a date that is certainly in the introduction range.
    - you can estimate the time from introduction to spotting a plate by looking at the spread of the data points, or by looking how often you see a few given plates where you know their introduction period is over (e.g. all CC).
     
  12. Jan 16, 2018 at 8:43 PM #11

    DaveC426913

    User Avatar
    Gold Member

    Yep. My current primitive algorithm predicts my desired target in 2024. Seven years.

    Definitely.
    I can never know if a given plate is the first one I've encountered, but not seen. So, my diligence in observing has a direct impact on the data.

    I started this when my commute was 50km each way. Now it's 15.

    Yep. I wonder if it's possible to analyze the data to make an educated guess at the average delay from introduction to observation.
     
  13. Jan 16, 2018 at 8:58 PM #12

    mfb

    User Avatar
    2017 Award

    Staff: Mentor

    You know Ontario gives out vanity plates, right? ;)

    Do you have the data in a more accessible format than this?
     
  14. Jan 18, 2018 at 7:34 PM #13

    DaveC426913

    User Avatar
    Gold Member

    Yes. But that would be a false alarm. Heralding the Dawn of a New Age is not something one wants to cheat at.


    Whachoo talkin' bout Willis? JSON is one of the most flexible, universal formats there are
    .
     
  15. Jan 18, 2018 at 9:04 PM #14

    mfb

    User Avatar
    2017 Award

    Staff: Mentor

    There is no numbering that would directly correspond to the available plates, but I converted it to csv, deleted the unused rows and introduced a new numbering now. Three entries are odd:

    CCG 2017-06-04 unused - what is this?
    What about CCK, where an observation has been missing for a very long time now?
    CBH has a very odd observation date.
     
    Last edited: Jan 18, 2018 at 11:27 PM
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted