Analytical linear regression: is it possible?

Click For Summary

Discussion Overview

The discussion revolves around the concept of finding a line of best fit for a dataset using various mathematical methods, particularly focusing on the merits of different definitions of "best" fit. Participants explore linear regression techniques, including least squares and alternative definitions, while questioning the validity of a proposed formula for fitting data.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants assert that there is no perfect mathematical method for obtaining a line of best fit from a population of data.
  • Others discuss the least squares method as a common approach for finding the best linear approximation, emphasizing the concept of orthogonality in this context.
  • There is a suggestion that the definition of "best" fit is subjective and varies based on the criteria used, such as minimizing the square of errors or using total least squares regression.
  • One participant proposes a definition of best fit based on minimizing absolute errors measured perpendicularly to the line, questioning its robustness compared to other methods.
  • Concerns are raised about the applicability of a proposed formula to certain datasets, particularly regarding division by zero issues.
  • Some participants highlight the distinction between minimizing absolute errors and minimizing perpendicular errors, suggesting that the latter may not yield a unique solution.
  • There is a discussion about the potential for multiple lines to minimize the sum of absolute perpendicular errors, indicating that this may not lead to a single best fit line.
  • One participant expresses doubt about the practicality and acceptance of the proposed formula in real-world applications.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the definition of a "best" fit or the validity of the proposed formula. Multiple competing views on the merits of different fitting methods remain, and the discussion is unresolved regarding the effectiveness of the proposed approach.

Contextual Notes

Participants note limitations in the definitions of best fit and the assumptions underlying various methods. The discussion reveals dependencies on specific criteria for error minimization and the implications of measurement precision on the choice of fitting method.

striphe
Messages
125
Reaction score
1
I've been told that their exists no perfect mathematical method of obtaining a line of best fit from a population of data.

This doesn't make a whole lot of sense to me, so I have made an attempt at doing such (see google docs link)

https://docs.google.com/document/d/1_Ux4ypYtQcTtuO3bq8fTBCmHIiPuiwQXntJ5Y-KOuc0/edit?hl=en_US

Is their a way of determining if the formula has any merit?
 
Last edited by a moderator:
Physics news on Phys.org
striphe said:
I've been told that their exists no perfect mathematical method of obtaining a line of best fit from a population of data.

This doesn't make a whole lot of sense to me, so I have made an attempt at doing such (see google docs link)

https://docs.google.com/document/d/1_Ux4ypYtQcTtuO3bq8fTBCmHIiPuiwQXntJ5Y-KOuc0/edit?hl=en_US

Is their a way of determining if the formula has any merit?

In linear algebra, there is a method known as least squares that finds the best linear approximation to some system.

The method uses the idea that the best approximation provides the linear object with closest distance and this is represented through measures of orthogonality (if you are familiar with the equation of a linear object, the shortest distance from a point to that object is based on the inner product with the point and the "normal" of the object).

I can't really say though with respect to that formula, and its 11pm and I'm tired, but chances are that you may be able to prove it via a least squares formalism.
 
Last edited by a moderator:
striphe said:
Is their a way of determining if the formula has any merit?

The is no single definition for what makes a curve fit "best" or "perfect", so unless you give a mathematical definition for those terms, nobody can say if your formula accomplishes a "perfect" or "best" fit.

For various definitions of "best", there are known ways to attain a "best" fit. The most common criterion for "best" is that the curve minimize the square of the errors in the dependent variable. But there are no mathematical theorems that say this is the only criteria for "best". An example of a method that pursues a different definition of "best" is "total least squares regression".
 
Say if I state that the definition of the line of best fit, is the line which achieves the least absolute errors, when measurements from the line are taken perpendicular to the line. (The measurement is at right angles to the line)

Is there some way of determining if this is a more robust definition of best?
 
striphe said:
Say if I state that the definition of the line of best fit, is the line which achieves the least absolute errors, when measurements from the line are taken perpendicular to the line. (The measurement is at right angles to the line)

This is the basic idea in least squares approximation.
 
The use of alternative definitions to OLS has already been done. The literature is very rich. Just do a google search.
 
chiro said:
This is the basic idea in least squares approximation.

To clarify that remark, the basic idea of least squares approximation is to be a method that is, in some sense, more robust than minimizing the total of the absolute errors. The line that minimizes the absolute errors need not be the same as the line that minimizes the mean square error.
 
striphe,

I don't think your formula can be appiied to data sets such as (-1,-1),(0,0),(1,1) since it involves division by zero.
 
Last edited:
Stephen Tashi said:
I don't think your formula can be applied to data sets such as (-1,-1),(0,0),(1,1) since it involves division by zero.

The issue has to do with determining the gradient between an individual that has the same values for x and y as the mean and the mean, It's anything and everything as they are in the same position.

As a result I've had to change the definitions of y, x and n, so that they are not inclusive of any individuals that are in the same co-ordinates as the mean. Ignoring these individuals once the mean is calculated is the best policy (see the Google doc link in first post for more details)

you must understand that minimising the absolute error is different to the absolute perpendicular error. The length between the line and individual is at 90 degrees to the line; it being the minimum distance between the line and the individual.

Does this technique that I have described exist in the literature?
 
  • #10
striphe said:
you must understand that minimising the absolute error is different to the absolute perpendicular error. The length between the line and individual is at 90 degrees to the line; it being the minimum distance between the line and the individual.

Does this technique that I have described exist in the literature?

You haven't explained what you are trying to minimize. Are you minimizing the size of the largest perpendicular distance between a data point (x,y) and the line you calculate? Or are you minimizing the average of those distances taken over all data points?

You also haven't explained why you think your formula minimizes whatever it is that your are trying to minimize.

I don't know if your formula exists in literature. It isn't the mainstream way of fitting lines to data. Have you actually applied this formula to any real world examples? I doubt many people would want to use your formula since it is so strongly influenced by small errors in y when (x,y) is near ( \bar{x} , \bar{y} ).
 
  • #11
Apply your formula to the dataset (-10.0, -10.0), (-0.1, -0.4), (10.1,10.4)
 
  • #12
I've clearly jumped the gun on this one, the formulation doesn't match up with the minimum sum of absolute errors. The best example of this would be to compare the data set [(-10,0)(-1,-1)(1,1)(10,0)] and [(-20,0)(-1,-1)(1,1)(20,0)] they both have the same line of best fit, but you would intuitively know that the latter would have a line of best fit with a lower gradient.

I still do not see how the minimum sum of absolute perpendicular errors isn't a preferable method; as it isn't determined by the relationship the individuals have with the x-axis but the relationship that the individuals have with each other.
 
  • #13
striphe said:
I still do not see how the minimum sum of absolute perpendicular errors isn't a preferable method

Then you haven't thought clearly about the question of why one method should be preferable to another.

If you are doing measurements where you are confident that x can be measured precisely and the y measurement is the one that is subject to "random errors" then measuring error along the y-axis instead of perpendicular to the regression line makes more sense.

If you are dealing with some situation where percentage errors are what matters (like calibrating a measuring instrument who specs state a max percentage error in its reading) then percentage error is more important than absolute error.

That said, I agree that it would be interesting to investigate how to find lines that minimize the sum of the absolute perpendicular errors. However, notice that more than one line may have have that property. Intuitively, if you have a line that runs through the data points and doesn't hit any of them, you can move that line perpendicular to itself and you won't change the sum of the perpedicular errors as long as you don't cross any data points. So "the line" that minimizes the sum of the absolute perpendicular errors may be one of infinitely many other lines that have that property.

As I said, I haven't searched for whether algorithms to solve this problem have been written up. This is the type of problem that computers can solve numerically -by trial an error if need be. If you are interested in the problem, you should do some searching. The least squares perpendicular error problem is called "total least squares" curve fitting. You might find something if you search for "total absolute error" regression or curve fitting.
 
  • #14
striphe said:
I've been told that their exists no perfect mathematical method of obtaining a line of best fit from a population of data.

This doesn't make a whole lot of sense to me, so I have made an attempt at doing such (see google docs link)

https://docs.google.com/document/d/1_Ux4ypYtQcTtuO3bq8fTBCmHIiPuiwQXntJ5Y-KOuc0/edit?hl=en_US

Is their a way of determining if the formula has any merit?

Typically there are two choices of fitting models to data: Least Squares (LSE) and Maximum Likelihood (MLE) estimation. The latter is considered better for parameter estimation and for non linear data. LSE is often preferred for linear data, particularly when the data are relatively sparse, but still sufficient for hypothesis testing.

http://www.minitab.com/en-US/support/answers/answer.aspx?ID=767&langType=1033
 
Last edited by a moderator:
  • #15
Stephen Tashi said:
That said, I agree that it would be interesting to investigate how to find lines that minimize the sum of the absolute perpendicular errors. However, notice that more than one line may have have that property. Intuitively, if you have a line that runs through the data points and doesn't hit any of them, you can move that line perpendicular to itself and you won't change the sum of the perpedicular errors as long as you don't cross any data points. So "the line" that minimizes the sum of the absolute perpendicular errors may be one of infinitely many other lines that have that property.
.

The thing is, the line of best fit has to go through the mean of the population. I think you will find their are instances where multiple lines can exist, but for the most part they don't.
 
  • #16
striphe said:
The thing is, the line of best fit has to go through the mean of the population. I think you will find their are instances where multiple lines can exist, but for the most part they don't.

If you overlooked my post, that's why MLE is often preferred. The single most likely line/curve, given the data, is selected by an iterative process which maximizes the likelihood function.
 
  • #17
striphe said:
Say if I state that the definition of the line of best fit, is the line which achieves the least absolute errors, when measurements from the line are taken perpendicular to the line. (The measurement is at right angles to the line)

Is there some way of determining if this is a more robust definition of best?

Yes, this is often a more robust definition of best.

In practice you often have outliers in your measurements, which can have various causes.
These points are weighed inordinately heavy by a least squares fit.
To reduce this effect, the method you propose is used (least absolute errors).

This is documented in numerical literature.
 
  • #18
The OP asked about the possibility of analytic linear regression. I've answered his/her question. Is there any reason why this thread keeps on going? Please read the first sentence in the second paragraph:

http://www.itl.nist.gov/div898/handbook/apr/section4/apr412.htm

With small samples of linear data, SLE is better, but the fully analytic MLE is better in most other cases. LSE is not fully analytic in that it is (usually) a linear approximation to the MLE.
 
Last edited:

Similar threads

  • · Replies 13 ·
Replies
13
Views
5K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
5K
Replies
3
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
8
Views
2K
  • · Replies 17 ·
Replies
17
Views
3K
  • · Replies 21 ·
Replies
21
Views
162K
Replies
3
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K