Fitting functions based on imperfect data

In summary, the conversation revolves around finding functions to fit a set of data. The data is rounded down to the nearest significant digit and has a linear relationship for columns v1 and v3, while column v2 has a higher order relationship. The input parameters are always one order of magnitude less than the results. The suggested method for fitting the data is linear regression, particularly using Excel's LINEST tool or other statistical packages. There is also a mention of using a parabola or taking the log of v2 values for a better fit.
  • #1
martix
160
0
I have a set of values and I'm trying to come up with functions to fit that data.
Code:
x    v1    v2    v3
1    112    14.4    52
2    120    15    53
3    128    15.5    54
4    136    16.2    55
5    144    16.9    56
6    152    17.6    57
7    160    18.5    58
8    168    19.4    59
9    176    20.4    60
10    184    21.5    61

Here is what I know about the data:
It is rounded down / floored to the nearest significant digit (i.e. 1 for v1 and v3, 0.1 for v2).
Columns v1 and v3 look linear (e.g. first order polynomial).
Column v2 looks higher order(probably second, but not sure).
I think (but am not sure) that the input parameters are always 1 order of magnitude less than the results.

Now, for v1 and v3, this makes things fairly straightforward.
Our function is
y = ax + b
a = (vmax - vmin) / 9
b = vmin - a

Because the values are floor, a could potentially vary. The biggest possible difference will be within 1 of our values, so we have
a = (vmax - vmin) / 9 ±(1/9)

2 things give me trouble:
1. Column v2
2. A good way for making the function fit the data exactly without having to brute force it, i.e. mess around until I find a value for a and b which fit exactly.

Notes:
  • This isn't a homework question.
  • I don't have a strong mathematical background (I know the basics of calculus, but not enough for practical use). That said, not averse to trying to tackle complex subjects.
 
Mathematics news on Phys.org
  • #2
Excel has the necessary functions built in (slope and intercept).
 
  • #3
And LINEST.

Also, do you remember that old Microsoft helicopter joke?

Technically true, and decidedly unhelpful.
 
  • #4
The truncation of numbers on average reduces the number by half of the next place. If a value in the table is 5.6, I would use the value 5.65. Always place 5 in the first truncated digit. You do not say if the x values are truncated or if they are exact. Deal with them appropriately.

Other than that, this looks like a standard linear regression problem. The Excel LINEST tool can be used. There are several other statistical packages with linear regression tools. IMHO, they would be much more acceptable and standard than the process that you suggest.
 
  • #5
FactChecker said:
The truncation of numbers on average reduces the number by half of the next place. If a value in the table is 5.6, I would use the value 5.65. Always place 5 in the first truncated digit. You do not say if the x values are truncated or if they are exact. Deal with them appropriately.

Other than that, this looks like a standard linear regression problem. The Excel LINEST tool can be used. There are several other statistical packages with linear regression tools. IMHO, they would be much more acceptable and standard than the process that you suggest.
I get it already. Linear regression. Already using LINEST (seems to give more consistent results than my method).

On x: Exact integers from 1 to 10.

My problem isn't the linear relationship. There's tools, it has a name and a long wiki article.

My biggest problem is that one column that doesn't have a linear relationship.
 
  • #6
You can try taking the log of the v2 values or including a second variable in the linear regression whose values are the x values squared. Either approach may give a better fit to the curved v2 data.

PS. I just looked at the plot of ##x^2## and [CORRECTION] I don't think it will be helpful. But log(v2) looks better. it looks good.
 
Last edited:
  • #7
Better in what way? Sorry, as I said, I don't have a good math background.
 
  • #8
FactChecker said:
You can try taking the log of the v2 values or including a second variable in the linear regression whose values are the x values squared. Either approach may give a better fit to the curved v2 data.

PS. I just looked at the plot of ##x^2## and I don't think it will be helpful. But log(v2) looks better.
What is wrong with a parabola? It can fit to all numbers nicely within the rounding errors.
 
  • Like
Likes FactChecker
  • #9
mfb said:
What is wrong with a parabola? It can fit to all numbers nicely within the rounding errors.
I stand corrected. I did the regression and I agree that it looks good. I guess that trying to "eyeball" it was a bad mistake.
 
  • #10
Below curve fit for X vs V2 was done with SIECLATOR Ver. 4.1 The original hosting site is gone but a Google search seems to find some that I haven't tried.
As of April, 2009 the authors email was: Email: revelator.programs@hotmail.com

Coefficients of Fit
1.40133333333334e+01 . X ^0
3.97272727272719e-01 . X ^1
3.48484848484855e-02 . X ^2
# Ydata Ycalc. Diff.
1 1.44000e+01 1.44454e+01 4.5e-02
2 1.50000e+01 1.49472e+01 -5.2e-02
3 1.55000e+01 1.55187e+01 1.8e-02
4 1.62000e+01 1.61600e+01 -4.0e-02
5 1.69000e+01 1.68709e+01 -2.9e-02
6 1.76000e+01 1.76515e+01 5.1e-02
7 1.85000e+01 1.85018e+01 1.8e-03
8 1.94000e+01 1.94218e+01 2.1e-02
9 2.04000e+01 2.04115e+01 1.1e-02
10 2.15000e+01 2.14709e+01 -2.9e-02

Cheers,
Tom
 
  • Like
Likes FactChecker
  • #11
Tom.G said:
Coefficients of Fit
1.40133333333334e+01 . X ^0
3.97272727272719e-01 . X ^1
3.48484848484855e-02 . X ^2
Good. My only recomended change would be to add 0.005 to the constant term. That will split the errors caused by the truncation of the data to two digits after the decimal point. If @martix plans to do much work like this, I recommend that he get a good statistical package. I like R, which is free and well established. R has a learning curve.
 
  • #12
Tom.G said:
Below curve fit for X vs V2 was done with SIECLATOR Ver. 4.1 The original hosting site is gone but a Google search seems to find some that I haven't tried.
As of April, 2009 the authors email was: Email: revelator.programs@hotmail.com

Coefficients of Fit
1.40133333333334e+01 . X ^0
3.97272727272719e-01 . X ^1
3.48484848484855e-02 . X ^2
# Ydata Ycalc. Diff.
1 1.44000e+01 1.44454e+01 4.5e-02
2 1.50000e+01 1.49472e+01 -5.2e-02
3 1.55000e+01 1.55187e+01 1.8e-02
4 1.62000e+01 1.61600e+01 -4.0e-02
5 1.69000e+01 1.68709e+01 -2.9e-02
6 1.76000e+01 1.76515e+01 5.1e-02
7 1.85000e+01 1.85018e+01 1.8e-03
8 1.94000e+01 1.94218e+01 2.1e-02
9 2.04000e+01 2.04115e+01 1.1e-02
10 2.15000e+01 2.14709e+01 -2.9e-02

Cheers,
Tom
I'll take a look at the program, it will probably be useful in general. And I've been reading up on "least squares regression", which is what we're using here.
But yes, Wolfram Alpha reports the exact same result. And has the benefit of offering a neat visualisation and alternate forms.
As @FactChecker said, I need to add a bit to the constant, though I get best result with +0.05.
I just can't quite make it fit, there always remains a small difference on one or two datapoints...
 
  • #13
martix said:
though I get best result with +0.05.
I just can't quite make it fit, there always remains a small difference on one or two datapoints...
Do you mean 0.005? That is what I would expect. Since the amount that is truncated is only known to average 0.005, there is no way to be exact. If the "small diference" is less than 0.005, that is as good as you can hope for.
 
  • #14
FactChecker said:
Do you mean 0.005? That is what I would expect. Since the amount that is truncated is only known to average 0.005, there is no way to be exact. If the "small diference" is less than 0.005, that is as good as you can hope for.
No. Look at the column (v2) - the values are truncated to the first decimal place, meaning the truncated amount averages to 0.05.
Otherwise you are correct.

And excel(LINEST) can do higher-order polynomials as well, as I found out here:
https://stackoverflow.com/questions/10857252/quadratic-and-cubic-regression-in-excel
I'm having some difficulty deciphering the inner workings of excel, but it does work.
 
  • #15
martix said:
No. Look at the column (v2) - the values are truncated to the first decimal place, meaning the truncated amount averages to 0.05.
Oh, sorry. I must have been looking at some other data.
And excel(LINEST) can do higher-order polynomials as well, as I found out here:
https://stackoverflow.com/questions/10857252/quadratic-and-cubic-regression-in-excel
I'm having some difficulty deciphering the inner workings of excel, but it does work.
Yes. Practically any statistical package can do multiple linear regression. Some hand-held calculators may not. I am not very familiar with the EXCEL LINEST, but it looks like it does what is called stepwise regression with backward elimination. (see https://support.office.com/en-us/article/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d and https://en.wikipedia.org/wiki/Stepwise_regression )
 
  • #16
That SIECLATOR program originally ran with a 21-term polynomial fit! After looking at the plot and playing with the number of terms, I posted the 3-term result; seemed "close enough." :oldeyes:

Cheers,
Tom
 
  • #17
Tom.G said:
That SIECLATOR program originally ran with a 21-term polynomial fit!
That is kind of disturbing since there are only 10 data points.
 
  • #18
In addition to the data points it also fitted a cat.

All the different tools here give the same fit results for the parabola because I would expect all of them to use the same analytic formula.
 
  • Like
Likes FactChecker
  • #19
Ha! Funny cat data fit.

I agree that the different tools would give the same results for the parabola with the same independent variables. But if they use the different criteria and methods to determine which independent variables to include, the results may be different. SIECLATOR must be very different if it considered a polynomial of degree 20.
 
Last edited:
  • #20
That 21 term fit in SIECLATOR was probably a leftover from the last time it was used. The documentation states the default as 4 terms; much more reasonable in the real world!
 
  • Like
Likes FactChecker

1. What is the purpose of fitting functions based on imperfect data?

The purpose of fitting functions based on imperfect data is to find a mathematical model that best represents the relationship between the variables in the data. This model can then be used to make predictions and draw conclusions.

2. What are some common methods for fitting functions based on imperfect data?

Some common methods for fitting functions based on imperfect data include linear regression, polynomial regression, and non-linear regression. These methods involve finding the best-fitting curve or line that passes through the data points.

3. How do you determine the best-fitting function for imperfect data?

The best-fitting function for imperfect data is determined by minimizing the error or the difference between the actual data points and the predicted values from the function. This can be done using various statistical techniques such as least squares or maximum likelihood estimation.

4. What are some challenges in fitting functions based on imperfect data?

One of the main challenges in fitting functions based on imperfect data is dealing with outliers or data points that do not follow the general trend. These points can significantly affect the accuracy of the fitted function. Another challenge is choosing the appropriate model and its parameters, as an overly complex model can lead to overfitting while an overly simple model may not capture the true relationship in the data.

5. How do you evaluate the performance of a fitted function on imperfect data?

The performance of a fitted function on imperfect data can be evaluated by calculating the residual sum of squares (RSS) or the root mean square error (RMSE). These metrics measure the overall difference between the actual data points and the predicted values from the function. Additionally, visualizing the fitted function and its predicted values against the actual data can also provide insights into its performance.

Similar threads

  • General Math
Replies
6
Views
749
  • General Math
Replies
1
Views
674
Replies
4
Views
949
Replies
2
Views
1K
  • Linear and Abstract Algebra
Replies
4
Views
823
  • General Math
Replies
6
Views
2K
  • STEM Educators and Teaching
Replies
5
Views
512
  • Atomic and Condensed Matter
Replies
1
Views
1K
Replies
9
Views
1K
Replies
11
Views
2K
Back
Top