Fitting functions based on imperfect data

  • I
  • Thread starter martix
  • Start date
  • #1
152
0
I have a set of values and I'm trying to come up with functions to fit that data.
Code:
x    v1    v2    v3
1    112    14.4    52
2    120    15    53
3    128    15.5    54
4    136    16.2    55
5    144    16.9    56
6    152    17.6    57
7    160    18.5    58
8    168    19.4    59
9    176    20.4    60
10    184    21.5    61

Here is what I know about the data:
It is rounded down / floored to the nearest significant digit (i.e. 1 for v1 and v3, 0.1 for v2).
Columns v1 and v3 look linear (e.g. first order polynomial).
Column v2 looks higher order(probably second, but not sure).
I think (but am not sure) that the input parameters are always 1 order of magnitude less than the results.

Now, for v1 and v3, this makes things fairly straightforward.
Our function is
y = ax + b
a = (vmax - vmin) / 9
b = vmin - a

Because the values are floor, a could potentially vary. The biggest possible difference will be within 1 of our values, so we have
a = (vmax - vmin) / 9 ±(1/9)

2 things give me trouble:
1. Column v2
2. A good way for making the function fit the data exactly without having to brute force it, i.e. mess around until I find a value for a and b which fit exactly.

Notes:
  • This isn't a homework question.
  • I don't have a strong mathematical background (I know the basics of calculus, but not enough for practical use). That said, not averse to trying to tackle complex subjects.
 

Answers and Replies

  • #2
Svein
Science Advisor
Insights Author
2,176
711
Excel has the necessary functions built in (slope and intercept).
 
  • #3
152
0
And LINEST.

Also, do you remember that old Microsoft helicopter joke?

Technically true, and decidedly unhelpful.
 
  • #4
FactChecker
Science Advisor
Gold Member
6,371
2,517
The truncation of numbers on average reduces the number by half of the next place. If a value in the table is 5.6, I would use the value 5.65. Always place 5 in the first truncated digit. You do not say if the x values are truncated or if they are exact. Deal with them appropriately.

Other than that, this looks like a standard linear regression problem. The Excel LINEST tool can be used. There are several other statistical packages with linear regression tools. IMHO, they would be much more acceptable and standard than the process that you suggest.
 
  • #5
152
0
The truncation of numbers on average reduces the number by half of the next place. If a value in the table is 5.6, I would use the value 5.65. Always place 5 in the first truncated digit. You do not say if the x values are truncated or if they are exact. Deal with them appropriately.

Other than that, this looks like a standard linear regression problem. The Excel LINEST tool can be used. There are several other statistical packages with linear regression tools. IMHO, they would be much more acceptable and standard than the process that you suggest.
I get it already. Linear regression. Already using LINEST (seems to give more consistent results than my method).

On x: Exact integers from 1 to 10.

My problem isn't the linear relationship. There's tools, it has a name and a long wiki article.

My biggest problem is that one column that doesn't have a linear relationship.
 
  • #6
FactChecker
Science Advisor
Gold Member
6,371
2,517
You can try taking the log of the v2 values or including a second variable in the linear regression whose values are the x values squared. Either approach may give a better fit to the curved v2 data.

PS. I just looked at the plot of ##x^2## and [CORRECTION] I don't think it will be helpful. But log(v2) looks better. it looks good.
 
Last edited:
  • #7
152
0
Better in what way? Sorry, as I said, I don't have a good math background.
 
  • #8
35,505
11,955
You can try taking the log of the v2 values or including a second variable in the linear regression whose values are the x values squared. Either approach may give a better fit to the curved v2 data.

PS. I just looked at the plot of ##x^2## and I don't think it will be helpful. But log(v2) looks better.
What is wrong with a parabola? It can fit to all numbers nicely within the rounding errors.
 
  • Like
Likes FactChecker
  • #9
FactChecker
Science Advisor
Gold Member
6,371
2,517
What is wrong with a parabola? It can fit to all numbers nicely within the rounding errors.
I stand corrected. I did the regression and I agree that it looks good. I guess that trying to "eyeball" it was a bad mistake.
 
  • #10
Tom.G
Science Advisor
3,949
2,643
Below curve fit for X vs V2 was done with SIECLATOR Ver. 4.1 The original hosting site is gone but a Google search seems to find some that I haven't tried.
As of April, 2009 the authors email was: Email: revelator.programs@hotmail.com

Coefficients of Fit
1.40133333333334e+01 . X ^0
3.97272727272719e-01 . X ^1
3.48484848484855e-02 . X ^2
# Ydata Ycalc. Diff.
1 1.44000e+01 1.44454e+01 4.5e-02
2 1.50000e+01 1.49472e+01 -5.2e-02
3 1.55000e+01 1.55187e+01 1.8e-02
4 1.62000e+01 1.61600e+01 -4.0e-02
5 1.69000e+01 1.68709e+01 -2.9e-02
6 1.76000e+01 1.76515e+01 5.1e-02
7 1.85000e+01 1.85018e+01 1.8e-03
8 1.94000e+01 1.94218e+01 2.1e-02
9 2.04000e+01 2.04115e+01 1.1e-02
10 2.15000e+01 2.14709e+01 -2.9e-02

Cheers,
Tom
 
  • Like
Likes FactChecker
  • #11
FactChecker
Science Advisor
Gold Member
6,371
2,517
Coefficients of Fit
1.40133333333334e+01 . X ^0
3.97272727272719e-01 . X ^1
3.48484848484855e-02 . X ^2
Good. My only recomended change would be to add 0.005 to the constant term. That will split the errors caused by the truncation of the data to two digits after the decimal point. If @martix plans to do much work like this, I recommend that he get a good statistical package. I like R, which is free and well established. R has a learning curve.
 
  • #12
152
0
Below curve fit for X vs V2 was done with SIECLATOR Ver. 4.1 The original hosting site is gone but a Google search seems to find some that I haven't tried.
As of April, 2009 the authors email was: Email: revelator.programs@hotmail.com

Coefficients of Fit
1.40133333333334e+01 . X ^0
3.97272727272719e-01 . X ^1
3.48484848484855e-02 . X ^2
# Ydata Ycalc. Diff.
1 1.44000e+01 1.44454e+01 4.5e-02
2 1.50000e+01 1.49472e+01 -5.2e-02
3 1.55000e+01 1.55187e+01 1.8e-02
4 1.62000e+01 1.61600e+01 -4.0e-02
5 1.69000e+01 1.68709e+01 -2.9e-02
6 1.76000e+01 1.76515e+01 5.1e-02
7 1.85000e+01 1.85018e+01 1.8e-03
8 1.94000e+01 1.94218e+01 2.1e-02
9 2.04000e+01 2.04115e+01 1.1e-02
10 2.15000e+01 2.14709e+01 -2.9e-02

Cheers,
Tom
I'll take a look at the program, it will probably be useful in general. And I've been reading up on "least squares regression", which is what we're using here.
But yes, Wolfram Alpha reports the exact same result. And has the benefit of offering a neat visualisation and alternate forms.
As @FactChecker said, I need to add a bit to the constant, though I get best result with +0.05.
I just can't quite make it fit, there always remains a small difference on one or two datapoints....
 
  • #13
FactChecker
Science Advisor
Gold Member
6,371
2,517
though I get best result with +0.05.
I just can't quite make it fit, there always remains a small difference on one or two datapoints....
Do you mean 0.005? That is what I would expect. Since the amount that is truncated is only known to average 0.005, there is no way to be exact. If the "small diference" is less than 0.005, that is as good as you can hope for.
 
  • #14
152
0
Do you mean 0.005? That is what I would expect. Since the amount that is truncated is only known to average 0.005, there is no way to be exact. If the "small diference" is less than 0.005, that is as good as you can hope for.
No. Look at the column (v2) - the values are truncated to the first decimal place, meaning the truncated amount averages to 0.05.
Otherwise you are correct.

And excel(LINEST) can do higher-order polynomials as well, as I found out here:
https://stackoverflow.com/questions/10857252/quadratic-and-cubic-regression-in-excel
I'm having some difficulty deciphering the inner workings of excel, but it does work.
 
  • #15
FactChecker
Science Advisor
Gold Member
6,371
2,517
No. Look at the column (v2) - the values are truncated to the first decimal place, meaning the truncated amount averages to 0.05.
Oh, sorry. I must have been looking at some other data.
And excel(LINEST) can do higher-order polynomials as well, as I found out here:
https://stackoverflow.com/questions/10857252/quadratic-and-cubic-regression-in-excel
I'm having some difficulty deciphering the inner workings of excel, but it does work.
Yes. Practically any statistical package can do multiple linear regression. Some hand-held calculators may not. I am not very familiar with the EXCEL LINEST, but it looks like it does what is called stepwise regression with backward elimination. (see https://support.office.com/en-us/article/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d and https://en.wikipedia.org/wiki/Stepwise_regression )
 
  • #16
Tom.G
Science Advisor
3,949
2,643
That SIECLATOR program originally ran with a 21-term polynomial fit! After looking at the plot and playing with the number of terms, I posted the 3-term result; seemed "close enough." :oldeyes:

Cheers,
Tom
 
  • #17
FactChecker
Science Advisor
Gold Member
6,371
2,517
That SIECLATOR program originally ran with a 21-term polynomial fit!
That is kind of disturbing since there are only 10 data points.
 
  • #18
35,505
11,955
In addition to the data points it also fitted a cat.

All the different tools here give the same fit results for the parabola because I would expect all of them to use the same analytic formula.
 
  • Like
Likes FactChecker
  • #19
FactChecker
Science Advisor
Gold Member
6,371
2,517
Ha! Funny cat data fit.

I agree that the different tools would give the same results for the parabola with the same independent variables. But if they use the different criteria and methods to determine which independent variables to include, the results may be different. SIECLATOR must be very different if it considered a polynomial of degree 20.
 
Last edited:
  • #20
Tom.G
Science Advisor
3,949
2,643
That 21 term fit in SIECLATOR was probably a leftover from the last time it was used. The documentation states the default as 4 terms; much more reasonable in the real world!
 
  • Like
Likes FactChecker

Related Threads on Fitting functions based on imperfect data

  • Last Post
Replies
2
Views
776
  • Last Post
Replies
14
Views
3K
  • Last Post
Replies
4
Views
1K
  • Last Post
Replies
9
Views
4K
  • Last Post
Replies
3
Views
2K
  • Last Post
Replies
5
Views
941
  • Last Post
Replies
4
Views
19K
Replies
3
Views
618
Replies
6
Views
151
Replies
22
Views
2K
Top