# Approximating a data set

Not sure if "General Math" is the best place for this, although I'm honestly not sure which sub-forum would be right.

So, I've got a data set. It looks like it's a standard exponential curve, but I honestly don't remember how to figure out an equation that will approximate it well. Actually, I guess I DO remember how to do an Nth degree polynomial given N data points, but I don't trust the standard polynomial form to do the job here, since I want to predict the data set a ways out.

The 125 data points I have currently are:

33.67
36.8
39.6
50.92
52.8
54.72
55.2
64.68
72.52
76.72
85.47
87.2
99.96
106.78
123.2
132
145.36
147.2
166.1
175.95
204.37
212.38
226.6
230.42
271.22
283.08
315.1
358.6
391.6
416.9
440
461.1
532.4
565.8
622.4
652
697.23
789.95
813.78
832
912
957.84
1155.08
1255.8
1277.3
1474
1601.3
1676.22
1782.73
2034.12
2097.6
2307.24
2647.84
2683.64
2964
3402.6
3622.6
4040.4
4296.4
4605.3
4803.5
5863.7
6259
6509
7378.4
7711.2
8432
8903
9694.2
10488
11144.1
12198
13727
14739.2
16148.2
18921
20608.9
21128
21660
25281
26319.7
30084
32050.8
32554.2
35431.2 <=== It's possible that somewhere around here, the function changes!
36432
40404
40510.2
44484
47424
51604
55624
61759
66670
72228
78880
85042
94242
100080
111240
121040
129456
139840
152613
171600
181440
197776
215644
233280
258750
279900
302820
328510
357280
388750
429000
462300
506350
535300
590400
638400
701800
753960
810250
980900

I'd like to be able to approximate the next 50 or so points (the next 47 to be precise). I've tried playing around with the basics of e^Ax+B or x^A+B, but these don't seem to give me the right curve. Also, there may be TWO growth formulas, I'm not sure. The first two-thirds or so might follow one pattern, and the latter one-third or so might follow another pattern. So really, I'm more interested in the latter one-third, in the event that there really ARE two different formulas.

Ideas anyone on how to go about approximating this? Is my best bet really to do some crazy 40th order polynomial (I sure hope not)?

DaveE

hotvette
Homework Helper

1. Extrapolation is very tricky business unless you have a very good handle on the functional relationship that describes the data and have confidence that the functional relationship holds outside the range you have actual data

2. Clues can often be obtained by knowing the source of the data and what it represents. If some physical phenomenon, there may be known or accepted functional relationships that can be used.

3. Accuracy of fit vs simplicity of the function and consequence of inaccuracy are also considerations.

Having said that, I took a quick look and it sure looks exponential to me. If you plot the data on a log scale, it is remarkably linear, which suggests a function of the form ln(y) = ax + b or y = exp(ax + b) would come pretty close. You can get a 1st order approximation by just using the first and last data points. A least squares approach would result in a better approximation for a & b that minimizes the square error.

1. Extrapolation is very tricky business unless you have a very good handle on the functional relationship that describes the data and have confidence that the functional relationship holds outside the range you have actual data

2. Clues can often be obtained by knowing the source of the data and what it represents. If some physical phenomenon, there may be known or accepted functional relationships that can be used.
In this case, we're pretty sure that the pattern holds for the first 80-or-so data points and holds similarly for 80-or-so data points beyond that. The system in question is actually data that's been collected for an online game. The first iteration of the game featured about 80-or-so different "monsters" of increasing difficulty whose stats are represented first. Later, the game was expanded with an additional 80-or-so monsters with additional stats. So we (the players) know a bit what to expect, but we're curious how difficult the monsters are GOING to get in the future. It takes quite a while for people to progress, and the question is now: can anyone ever even hope to get to the top tier of monsters?

Anyway, suffice to say that there's a good chance that the math involved will be relatively basic and consistent. It's not a real world system that's subject to some crazy system dynamics model or anything that would crash and burn after experiencing exponential growth or anything like that. It's entirely theoretical.

However, I don't expect it to be perfect-- there are two components to the data given, which, individually, sort of rise randomly, but when multiplied together provide this set of data, which is VERY striking of a more simplistic mathematical formula. Hence, that's what I'm hoping to find, but each of the two sub-components may suffer from some rounding errors or other slight human-level tweaking.

Having said that, I took a quick look and it sure looks exponential to me. If you plot the data on a log scale, it is remarkably linear, which suggests a function of the form ln(y) = ax + b or y = exp(ax + b) would come pretty close. You can get a 1st order approximation by just using the first and last data points. A least squares approach would result in a better approximation for a & b that minimizes the square error.
Ahhh, thanks! I had played around with exp(ax)+b, but not with exp(ax+b), since I guess it's been too long for me to remember which constants are significant in which form. I'll give that a try and see if I can get something that works.

DaveE

daniel_i_l
Gold Member
You can solve ln(y) = ax+b using discrete least squares.

hotvette
Homework Helper
You can solve ln(y) = ax+b using discrete least squares.
True and easy to do, but recognize that it does solve a different problem than the nonlinear version y = c*exp(a*x), where c = exp(b). The linear version will give a worse fit in the latter data points than the nonlinear version.

HallsofIvy
Homework Helper
You can solve ln(y) = ax+b using discrete least squares.
True and easy to do, but recognize that it does solve a different problem than the nonlinear version y = c*exp(a*x), where c = exp(b). The linear version will give a worse fit in the latter data points than the nonlinear version.
Why would that be true? y= c exp(ax) and ln(y)= ax+ b are exactly the same equation.

hotvette
Homework Helper
Even though the equations are mathematically equivalent, the least squares formulations aren't. In one case, f1 = ax + b and the objective is to minimize F1=sum(ln(y)-f1)2 whereas in the other case, f2 = c*exp(a*x) and the objective is to minimize F2=sum(y-f2)2. They are different problems with different results.

Last edited: