Determining the best fit regression for a set of data

Click For Summary

Discussion Overview

The discussion revolves around determining the best fit regression for a set of data, specifically exploring methods to quickly identify whether a linear or non-linear regression model is more appropriate for the relationship between two variables. The conversation includes various statistical approaches and considerations related to model fitting.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests that any curve can be improved upon unless it passes through all observed points, implying that a better fit is always possible.
  • Another participant proposes that plotting the residuals against the predictor variable can indicate whether the relationship is linear or curvilinear.
  • A contribution mentions that nonlinear curves can be fitted using linear regression, and discusses the maximum likelihood fit in relation to Gaussian noise.
  • One participant outlines two approaches for determining the best fit: traditional statistical methods involving assumptions about data distributions and error resampling methods like k-fold cross-validation.
  • There is a mention of a theorem related to balancing variance in estimates and error variance, noting that higher-order fits may yield low error variance but high variability across trials, while lower-order fits may have the opposite characteristics.
  • A later reply speculates that the theorem discussed may relate to the bias-variance trade-off, describing how model complexity affects total error and the potential for overfitting.

Areas of Agreement / Disagreement

Participants express various viewpoints on methods for determining the best fit regression, with no consensus reached on a single approach or method. Multiple competing views remain regarding the effectiveness of different techniques.

Contextual Notes

Some discussions reference specific statistical methods and theorems without providing detailed definitions or explanations, which may limit understanding for those unfamiliar with the concepts. The conversation also highlights the complexity of model fitting and the potential trade-offs involved.

anisotropic
Messages
59
Reaction score
0
determining the "best fit" regression for a set of data

Is there a test one can perform to quickly determine what type of regression (linear vs. non-linear) will best fit the relationship between two variables?

i.e. How can one quickly determine the most probably relationship between two variables (like a sort of "probability of fit" test)?

(Linear vs. non-linear...)
 
Physics news on Phys.org


One can always find a better fit curve (by changing the form, if no other restriction is imposed) than a previously given curve (sounds contradictory?) unless the given curve passes through all the observed points.
 


I think that the easiest and quickest way to determine whether the relationship is linear or not, is to plot the graph of residuals vs the predictor variable. If it looks like a curve, then the relationship is curvilinear.
 


You can fit nonlinear curves using linear regression.

http://en.wikipedia.org/wiki/Linear_regression

Linear regression, will give the maximum likelihood fit your noise Gaussian. If you make a histogram of your estimation errors it should give you some idea of the statistics of the noise distribution. There are of course more advanced statistical tests.
 


anisotropic said:
Is there a test one can perform to quickly determine what type of regression (linear vs. non-linear) will best fit the relationship between two variables?

i.e. How can one quickly determine the most probably relationship between two variables (like a sort of "probability of fit" test)?

(Linear vs. non-linear...)

There are two basic approaches to answering this question:

1. The traditional statistics approach is to make various assumptions about the distributions of the data, errors, etc., and calculate some diagnostic summary of the model.

2. The error resampling method involves holding out data to test the model, either once, as in an out-of-sample test, or multiple times, as in k-fold cross-validation or bootstrapping. Note that with error resampling, the model will not necessarily improve (per the test) simply because the model becomes more complex.

For a longer explanation of error resampling, see:

http://matlabdatamining.blogspot.com/2008/03/validating-predictive-models.html"


-Will Dwinnell
http://matlabdatamining.blogspot.com/"
 
Last edited by a moderator:


Not mentioned yet is their is a a theorem (I forget it's name) that gives a method for striking the best balance between the variance in the estimate and the variance in the error of the fit. If you have a higher order fit then the fit will give a very low variance in the error but the fit will differ a lot from one trial to another. Conversely if you use a lower order fit there may be a lot of variance in the error but the fit will stay relatively constant from one trial to the next.
 


John Creighto said:
Not mentioned yet is their is a a theorem (I forget it's name) that gives a method for striking the best balance between the variance in the estimate and the variance in the error of the fit. If you have a higher order fit then the fit will give a very low variance in the error but the fit will differ a lot from one trial to another. Conversely if you use a lower order fit there may be a lot of variance in the error but the fit will stay relatively constant from one trial to the next.

I imagine that you are referring to the bias-variance trade-off? The total of these components would appear in error resampling, so that as model complexity increased, total error would decrease (as we reduce one component faster than the other increases) until the optimal fit, after which overfitting sets in, and the test error begins to increase again (as we begin to trade one component of error for the other). I have yet to find a non-academic who actually calculates the value of these component sseparately.


-Will Dwinnell
http://matlabdatamining.blogspot.com/"
 
Last edited by a moderator:

Similar threads

  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 11 ·
Replies
11
Views
3K
Replies
3
Views
3K