Fitting functions based on imperfect data

In summary, the conversation revolves around finding functions to fit a set of data. The data is rounded down to the nearest significant digit and has a linear relationship for columns v1 and v3, while column v2 has a higher order relationship. The input parameters are always one order of magnitude less than the results. The suggested method for fitting the data is linear regression, particularly using Excel's LINEST tool or other statistical packages. There is also a mention of using a parabola or taking the log of v2 values for a better fit.
  • #1
156
0
I have a set of values and I'm trying to come up with functions to fit that data.
Code:
x    v1    v2    v3
1    112    14.4    52
2    120    15    53
3    128    15.5    54
4    136    16.2    55
5    144    16.9    56
6    152    17.6    57
7    160    18.5    58
8    168    19.4    59
9    176    20.4    60
10    184    21.5    61

Here is what I know about the data:
It is rounded down / floored to the nearest significant digit (i.e. 1 for v1 and v3, 0.1 for v2).
Columns v1 and v3 look linear (e.g. first order polynomial).
Column v2 looks higher order(probably second, but not sure).
I think (but am not sure) that the input parameters are always 1 order of magnitude less than the results.

Now, for v1 and v3, this makes things fairly straightforward.
Our function is
y = ax + b
a = (vmax - vmin) / 9
b = vmin - a

Because the values are floor, a could potentially vary. The biggest possible difference will be within 1 of our values, so we have
a = (vmax - vmin) / 9 ±(1/9)

2 things give me trouble:
1. Column v2
2. A good way for making the function fit the data exactly without having to brute force it, i.e. mess around until I find a value for a and b which fit exactly.

Notes:
  • This isn't a homework question.
  • I don't have a strong mathematical background (I know the basics of calculus, but not enough for practical use). That said, not averse to trying to tackle complex subjects.
 
Mathematics news on Phys.org
  • #2
Excel has the necessary functions built in (slope and intercept).
 
  • #3
And LINEST.

Also, do you remember that old Microsoft helicopter joke?

Technically true, and decidedly unhelpful.
 
  • #4
The truncation of numbers on average reduces the number by half of the next place. If a value in the table is 5.6, I would use the value 5.65. Always place 5 in the first truncated digit. You do not say if the x values are truncated or if they are exact. Deal with them appropriately.

Other than that, this looks like a standard linear regression problem. The Excel LINEST tool can be used. There are several other statistical packages with linear regression tools. IMHO, they would be much more acceptable and standard than the process that you suggest.
 
  • #5
FactChecker said:
The truncation of numbers on average reduces the number by half of the next place. If a value in the table is 5.6, I would use the value 5.65. Always place 5 in the first truncated digit. You do not say if the x values are truncated or if they are exact. Deal with them appropriately.

Other than that, this looks like a standard linear regression problem. The Excel LINEST tool can be used. There are several other statistical packages with linear regression tools. IMHO, they would be much more acceptable and standard than the process that you suggest.
I get it already. Linear regression. Already using LINEST (seems to give more consistent results than my method).

On x: Exact integers from 1 to 10.

My problem isn't the linear relationship. There's tools, it has a name and a long wiki article.

My biggest problem is that one column that doesn't have a linear relationship.
 
  • #6
You can try taking the log of the v2 values or including a second variable in the linear regression whose values are the x values squared. Either approach may give a better fit to the curved v2 data.

PS. I just looked at the plot of ##x^2## and [CORRECTION] I don't think it will be helpful. But log(v2) looks better. it looks good.
 
Last edited:
  • #7
Better in what way? Sorry, as I said, I don't have a good math background.
 
  • #8
FactChecker said:
You can try taking the log of the v2 values or including a second variable in the linear regression whose values are the x values squared. Either approach may give a better fit to the curved v2 data.

PS. I just looked at the plot of ##x^2## and I don't think it will be helpful. But log(v2) looks better.
What is wrong with a parabola? It can fit to all numbers nicely within the rounding errors.
 
  • Like
Likes FactChecker
  • #9
mfb said:
What is wrong with a parabola? It can fit to all numbers nicely within the rounding errors.
I stand corrected. I did the regression and I agree that it looks good. I guess that trying to "eyeball" it was a bad mistake.
 
  • #10
Below curve fit for X vs V2 was done with SIECLATOR Ver. 4.1 The original hosting site is gone but a Google search seems to find some that I haven't tried.
As of April, 2009 the authors email was: Email: revelator.programs@hotmail.com

Coefficients of Fit
1.40133333333334e+01 . X ^0
3.97272727272719e-01 . X ^1
3.48484848484855e-02 . X ^2
# Ydata Ycalc. Diff.
1 1.44000e+01 1.44454e+01 4.5e-02
2 1.50000e+01 1.49472e+01 -5.2e-02
3 1.55000e+01 1.55187e+01 1.8e-02
4 1.62000e+01 1.61600e+01 -4.0e-02
5 1.69000e+01 1.68709e+01 -2.9e-02
6 1.76000e+01 1.76515e+01 5.1e-02
7 1.85000e+01 1.85018e+01 1.8e-03
8 1.94000e+01 1.94218e+01 2.1e-02
9 2.04000e+01 2.04115e+01 1.1e-02
10 2.15000e+01 2.14709e+01 -2.9e-02

Cheers,
Tom
 
  • Like
Likes FactChecker
  • #11
Tom.G said:
Coefficients of Fit
1.40133333333334e+01 . X ^0
3.97272727272719e-01 . X ^1
3.48484848484855e-02 . X ^2
Good. My only recomended change would be to add 0.005 to the constant term. That will split the errors caused by the truncation of the data to two digits after the decimal point. If @martix plans to do much work like this, I recommend that he get a good statistical package. I like R, which is free and well established. R has a learning curve.
 
  • #12
Tom.G said:
Below curve fit for X vs V2 was done with SIECLATOR Ver. 4.1 The original hosting site is gone but a Google search seems to find some that I haven't tried.
As of April, 2009 the authors email was: Email: revelator.programs@hotmail.com

Coefficients of Fit
1.40133333333334e+01 . X ^0
3.97272727272719e-01 . X ^1
3.48484848484855e-02 . X ^2
# Ydata Ycalc. Diff.
1 1.44000e+01 1.44454e+01 4.5e-02
2 1.50000e+01 1.49472e+01 -5.2e-02
3 1.55000e+01 1.55187e+01 1.8e-02
4 1.62000e+01 1.61600e+01 -4.0e-02
5 1.69000e+01 1.68709e+01 -2.9e-02
6 1.76000e+01 1.76515e+01 5.1e-02
7 1.85000e+01 1.85018e+01 1.8e-03
8 1.94000e+01 1.94218e+01 2.1e-02
9 2.04000e+01 2.04115e+01 1.1e-02
10 2.15000e+01 2.14709e+01 -2.9e-02

Cheers,
Tom
I'll take a look at the program, it will probably be useful in general. And I've been reading up on "least squares regression", which is what we're using here.
But yes, Wolfram Alpha reports the exact same result. And has the benefit of offering a neat visualisation and alternate forms.
As @FactChecker said, I need to add a bit to the constant, though I get best result with +0.05.
I just can't quite make it fit, there always remains a small difference on one or two datapoints...
 
  • #13
martix said:
though I get best result with +0.05.
I just can't quite make it fit, there always remains a small difference on one or two datapoints...
Do you mean 0.005? That is what I would expect. Since the amount that is truncated is only known to average 0.005, there is no way to be exact. If the "small diference" is less than 0.005, that is as good as you can hope for.
 
  • #14
FactChecker said:
Do you mean 0.005? That is what I would expect. Since the amount that is truncated is only known to average 0.005, there is no way to be exact. If the "small diference" is less than 0.005, that is as good as you can hope for.
No. Look at the column (v2) - the values are truncated to the first decimal place, meaning the truncated amount averages to 0.05.
Otherwise you are correct.

And excel(LINEST) can do higher-order polynomials as well, as I found out here:
https://stackoverflow.com/questions/10857252/quadratic-and-cubic-regression-in-excel
I'm having some difficulty deciphering the inner workings of excel, but it does work.
 
  • #15
martix said:
No. Look at the column (v2) - the values are truncated to the first decimal place, meaning the truncated amount averages to 0.05.
Oh, sorry. I must have been looking at some other data.
And excel(LINEST) can do higher-order polynomials as well, as I found out here:
https://stackoverflow.com/questions/10857252/quadratic-and-cubic-regression-in-excel
I'm having some difficulty deciphering the inner workings of excel, but it does work.
Yes. Practically any statistical package can do multiple linear regression. Some hand-held calculators may not. I am not very familiar with the EXCEL LINEST, but it looks like it does what is called stepwise regression with backward elimination. (see https://support.office.com/en-us/article/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d and https://en.wikipedia.org/wiki/Stepwise_regression )
 
  • #16
That SIECLATOR program originally ran with a 21-term polynomial fit! After looking at the plot and playing with the number of terms, I posted the 3-term result; seemed "close enough." :oldeyes:

Cheers,
Tom
 
  • #17
Tom.G said:
That SIECLATOR program originally ran with a 21-term polynomial fit!
That is kind of disturbing since there are only 10 data points.
 
  • #18
In addition to the data points it also fitted a cat.

All the different tools here give the same fit results for the parabola because I would expect all of them to use the same analytic formula.
 
  • Like
Likes FactChecker
  • #19
Ha! Funny cat data fit.

I agree that the different tools would give the same results for the parabola with the same independent variables. But if they use the different criteria and methods to determine which independent variables to include, the results may be different. SIECLATOR must be very different if it considered a polynomial of degree 20.
 
Last edited:
  • #20
That 21 term fit in SIECLATOR was probably a leftover from the last time it was used. The documentation states the default as 4 terms; much more reasonable in the real world!
 
  • Like
Likes FactChecker

Suggested for: Fitting functions based on imperfect data

Replies
6
Views
655
Replies
22
Views
3K
Replies
5
Views
1K
Replies
1
Views
854
Replies
7
Views
551
Replies
3
Views
1K
Replies
4
Views
704
Replies
2
Views
527
Back
Top