# B Extrapolating from data

1. Jan 12, 2018

### DaveC426913

I've got a collection of data that contains observations over time. I want to predict when a given future observation is likely to occur.

As a simple example: Say I'm watching billiard balls drop into a pocket. The billiard balls drop in with approximate regularity.

My dataset:
1 ball 0:00
2 ball 0:57
3 ball 2:02
4 ball 2:48
5 ball (not observed)
6 ball (not observed)
7 ball 5:55
8 ball (not observed)
9 ball 7:40
...
n ball ?

Understand that when the 9 ball drops at 7:40, I know it is the 9 ball. This means I know the 8 ball (and 5 and 6) went in the pocket, even though I didn't observe it and don't know when it did so.

These observations are ongoing, so, after each ball, I will update the prediction, getting ever more accurate as I get more data.

I want to create an algorithm that will predict when ball n (say, ball 25) is expected to drop.

BTW, I am better at programming than math, so I am more comfortable with an algorithm (sequential steps) that a formula.

2. Jan 12, 2018

### DaveC426913

I'm wondering if it is simpler than I'm expecting.

After ball 2 drops, the average is 57 seconds. So ball 15 will drop at 14x57 s.
After ball 3 drops the average is (2:02 / 2 = ) 61 seconds. So ball 15 will drop at 14x61 s.

Ball 7 dropped 3m7s after ball 4. That's (187 / 3 = 62.3) on average for those 3 balls...

Is it possible that I can ignore all the intermediate observations and simply say 9 balls dropped in 7m40s - then project that forward? Does that lead to the most accurate prediction for ball n?

3. Jan 12, 2018

### StoneTemplePython

Yes you could take this approach, i.e. by estimating the mean with the sample mean. This approach has theoretical merits, and has merit for being quite simple.

(Bayesian thought: is there prior information about balls dropping that is being left out here?)

This is where it gets tricky. What does it mean to be most accurate? In programming, this typically comes down to coming up with a cost function and figuring out how to minimize it. If you are looking at minimizing squared errors, then that corresponds to the mean. There are many different cost functions you can choose from. (Another common one is that L1 norms correspond to medians, but there are many others.)

4. Jan 13, 2018

### Staff: Mentor

A proper linear fit will typically give a better estimate for the next balls. There are tools doing this, but if you are not interested in the uncertainty on the estimate there are fixed formulas.

5. Jan 14, 2018

### DaveC426913

A simple best single prediction will suffice.
As n approaches, the margin of error will approach zero.

And I'm gonna do it as an algorithm, rather than a formula.

So,
n=2, the predicted latency is 57s.
n=3, the predicted latency is (57+65)/2=61s.
n=4 :: (57+65+45)/3 = 52s..

When I get to n=7 I'm not sure if I calculate it as (5:55-2:48=) 187s/3. That would be giving equal weight to latencies that are derived rather than observed.

Last edited: Jan 14, 2018
6. Jan 14, 2018

### Staff: Mentor

If you sum all differences, your sum is just the difference between the last and first observed event. You give zero weight to the events in between (you throw away most of the information), and if the first or last event are a bit off that has a big impact on your extrapolation.

7. Jan 15, 2018

### jbriggs444

It would be good to know the rules of the game before playing. Do we have, for instance, a parameterized distribution of outcomes where we are trying to estimate the parameters based on the outcomes? Maybe something like "each ball drops in a normal distribution with mean m and standard deviation s centered on a nominal drop time at a regular interval i"?

8. Jan 15, 2018

### DaveC426913

I ... don't know.

9. Jan 15, 2018

### DaveC426913

I think the linear fit / least squares thing would do the trick.

In looking it up, I see most places suggest you'd be wanting a calculating machine (i.e. computer) just to do the calcs, since it's not simple, even for the simplest data sets.

I was hoping my billiard balls example would elicit a simple answer so simple even I could grasp it.

I'll show you what I'm actually doing:

http://www.davesbrain.ca/science/plates/ (It takes ten seconds or so to render).

I'd like to predict when a given plate (such as CRY) can be expected to be observed.

My primitive calculation function simply averages the delay between each subsequent data point. (some are even negative).

10. Jan 16, 2018

### Staff: Mentor

A linear fit should work.

Things to consider:
- the rate of license plates handed out will vary with time, especially with the seasons, but also with some long-term trend (population changes, cars per capita and so on). Fit over a long timespan to get rid of fluctuations, but don't fit linearly over 10+ years.
- the sighting efficiency might vary with time
- if you have precise dates (like your own license plate) they can be treated differently because you know a date that is certainly in the introduction range.
- you can estimate the time from introduction to spotting a plate by looking at the spread of the data points, or by looking how often you see a few given plates where you know their introduction period is over (e.g. all CC).

11. Jan 16, 2018

### DaveC426913

Yep. My current primitive algorithm predicts my desired target in 2024. Seven years.

Definitely.
I can never know if a given plate is the first one I've encountered, but not seen. So, my diligence in observing has a direct impact on the data.

I started this when my commute was 50km each way. Now it's 15.

Yep. I wonder if it's possible to analyze the data to make an educated guess at the average delay from introduction to observation.

12. Jan 16, 2018

### Staff: Mentor

You know Ontario gives out vanity plates, right? ;)

Do you have the data in a more accessible format than this?

13. Jan 18, 2018

### DaveC426913

Yes. But that would be a false alarm. Heralding the Dawn of a New Age is not something one wants to cheat at.

Whachoo talkin' bout Willis? JSON is one of the most flexible, universal formats there are
.

14. Jan 18, 2018

### Staff: Mentor

There is no numbering that would directly correspond to the available plates, but I converted it to csv, deleted the unused rows and introduced a new numbering now. Three entries are odd:

CCG 2017-06-04 unused - what is this?
What about CCK, where an observation has been missing for a very long time now?
CBH has a very odd observation date.

Last edited: Jan 18, 2018
15. Jan 19, 2018

### DaveC426913

Architectural and data integrity.

I can never be certain that the 'unused' plates will never be observed. I have strong theory why they aren't observed (see next item) but my data should not assume such a bias. For all I know, the rules could change and those patterns could start showing up. Or current patterns could change and disappear.

CSV is OK if you're just dealing with a list, and all the items in the list are structurally identical.
But it doesn't deal well with:
• non-list items (such as extra meta-data that is not part of the data array itself. Maybe the footnotes could be another object in the json.) CSV would need another file.
• items of indeterminate length (some rows have no tags, some have an array of multiple tags, JSON easily accommodates this).
• changes in functionality. I can always add more functionality to some rows as I expand the features of the app. New properties on a few rows will not hurt the existing functionality of other rows. (A CSV would require a new column for every single row, even if 99% of them have a null value).
Finally, the numbering column you added is not part of the data, as I see it, so I don't think it should be in the data. It can be derived from the data easily enough, since it's going to end up in an array, which means it will be indexed, but without imposing an artificial constraint on the raw data. (It's just another bias that might have to be re-engineered if the app changes.)

See the footnote:
Some plates (notably --G, --I, --O, --Q and --U) are never put into circulation. Presumably, they are disqualified as being too ambiguous and easily confused with other letters.
And indeed, I have yet to see any of these.

One of the most interesting. It is not on the standard list of disqualified combinations., yet I am certain it will never be seen.
I'll wager a large sum that it is disqualified by rules about profanity on plates.

Why? Because it lags behind its forebears? It's not the only one. CBA, CBX, CCA and several others are late to the table as well.
This is statistically unsurprising in such a scenario.

CDE has never been spotted to-date. That's quite possible - considering my observation method is far from perfect.

Last edited: Jan 19, 2018
16. Jan 19, 2018

### Staff: Mentor

I don't say anything against json for the website, but for fitting a csv or similar without unused rows is just much more practical. That's why I asked. Anyway, I converted it now.
I read the footnote, but CCG has a date. Did you observe it (making the footnote wrong) or not (making the entry wrong)?

I don't find profanity for CCK (or anything else interesting), but let's assume it is unused for some reason.
I found the issue with CBH. You have two CBH in your data, the second one should be CDH. To get rid of unused items I sorted by plate, so CDH, which was introduced 8 months after CBH, appeared between the other CBH and CBJ, making it a massive outlier.

Okay, now everything is cleaned up apart from the CCG observation date issue.

17. Jan 19, 2018

### DaveC426913

I think the unused rows are almost as important as the used ones from a posterity POV. I'd hate to destroy valid information by making the data ambiguous..

Ah. That would be a lazy copy-paste error. Thanks for catching that.

It's the license plate version of a rude word. No deeper than that.

Ah! Thank you again!

18. Jan 19, 2018

### Staff: Mentor

Cleaning up more:

A simple fit suggests date(x) = 6.290 * x + 42585 where x is the plate index (CAA being 1, unused plates don't count) and date is the number of days since the 30th December 1999 (thanks, LibreOffice?). I started the fit at CAT to exclude the first plates. Starting it at CAZ leads to date'(x) = 6.068 * x + 42598. Excluding the very last data point I get date''(x) = 6.349 * x + 42583.

Excluding everything before CBJ which has surprisingly early dates and then a few jumps I get date'''(x) = 5.784 * x + 42615. As you can see there is still some large uncertainty in the fit. The last fit shows the smallest deviations for most of the range and ignores the initial data. I'll use that one.

In the fitted range, the RMS of fit minus observation date is 7.8 days, with some observations much later but not many much earlier, as we expect it.
CDW is an outlier (19.5 days early), but if two more plates in between turn out to be unused it gets normal. Probably something to remove as well as long as the status of the plates in between is unclear.
CDA is an odd outlier, could the month be wrong there? One month later it would fit in nicely.
CDE is probably another one that didn't get used. A few months is more than we expect for non-observations by chance.

19. Jan 19, 2018

### DaveC426913

D'oh! Forgive my obtuseness for not realizing earlier that you were prepping the data as a prelude to helping me do the calcs. I'd have been a lot more gracious!

No, this is probably accurate.

This is an artifact of my technique. I am usually only looking for the latest few plates. It's very possible that I would not notice CDE once I'm looking for CDJ and its descendants. After a certain length of time, there's no point in putting a data point in even if I find it.

Which brings me to another issue:
The document is meant to update itself as I provide new data. Which means, rather than taking a fixed set of data and cleaning that up, it will always be working on the latest dataset. So, whatever algorithm I implement will give a fresh estimate every time the page is loaded - there's no data processing phase, which means no data getting thrown out.

I can't just ignore those outliers, because that's a human decision. Best I could do is either
• add a condition to the algorthm itself to reject outliers
• add another flag - say a 'lousy data point' flag, and then add that manually to certain records. Unfortunately, as has become apparent, I'm not analyzing the data sufficiently to spot errors - let alone outliers.
And I still gotta turn this into an algorithm...

Last edited: Jan 19, 2018
20. Jan 20, 2018

### DaveC426913

I'm using a Javascript plugin to do the linear regression.I don't care if it's accurate yet, just getting a line that approximates the data.

This is the result I'm getting:
• intercept:-67.21099273332828
• r2:0.986173087681175
• slope:4.988511467716297
I assume that means:
• first data point is at -67,0
• each datapoint index i on the x axis is .98 of a tick (i.e. for i =10, x = 9.8)
• a given datapoint's
• x value will be r2 * i
• y value will be at slope * r2 * i
(I'm not sure why the intercept is negative but doesn't render as negative in the chart.)

So, if I am to render the extrapolation line, I need to provide two data points.
One will be [-67,0] and the other will be [iΩ, y]
where y is iΩ * slope

Last edited: Jan 20, 2018