Analytical linear regression: is it possible?

In summary, the conversation discusses the concept of finding a line of best fit for a population of data and the various methods that can be used for this. The main focus is on the use of least squares approximation and its advantages over other methods such as minimizing the sum of absolute errors. The conversation also mentions the use of alternative definitions for "best" fit and the concept of minimizing perpendicular errors rather than just absolute errors. It is unclear if the specific formula mentioned exists in literature or has been applied to real-world examples.
  • #1
striphe
125
1
I've been told that their exists no perfect mathematical method of obtaining a line of best fit from a population of data.

This doesn't make a whole lot of sense to me, so I have made an attempt at doing such (see google docs link)

https://docs.google.com/document/d/1_Ux4ypYtQcTtuO3bq8fTBCmHIiPuiwQXntJ5Y-KOuc0/edit?hl=en_US

Is their a way of determining if the formula has any merit?
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
striphe said:
I've been told that their exists no perfect mathematical method of obtaining a line of best fit from a population of data.

This doesn't make a whole lot of sense to me, so I have made an attempt at doing such (see google docs link)

https://docs.google.com/document/d/1_Ux4ypYtQcTtuO3bq8fTBCmHIiPuiwQXntJ5Y-KOuc0/edit?hl=en_US

Is their a way of determining if the formula has any merit?

In linear algebra, there is a method known as least squares that finds the best linear approximation to some system.

The method uses the idea that the best approximation provides the linear object with closest distance and this is represented through measures of orthogonality (if you are familiar with the equation of a linear object, the shortest distance from a point to that object is based on the inner product with the point and the "normal" of the object).

I can't really say though with respect to that formula, and its 11pm and I'm tired, but chances are that you may be able to prove it via a least squares formalism.
 
Last edited by a moderator:
  • #3
striphe said:
Is their a way of determining if the formula has any merit?

The is no single definition for what makes a curve fit "best" or "perfect", so unless you give a mathematical definition for those terms, nobody can say if your formula accomplishes a "perfect" or "best" fit.

For various definitions of "best", there are known ways to attain a "best" fit. The most common criterion for "best" is that the curve minimize the square of the errors in the dependent variable. But there are no mathematical theorems that say this is the only criteria for "best". An example of a method that pursues a different definition of "best" is "total least squares regression".
 
  • #4
Say if I state that the definition of the line of best fit, is the line which achieves the least absolute errors, when measurements from the line are taken perpendicular to the line. (The measurement is at right angles to the line)

Is there some way of determining if this is a more robust definition of best?
 
  • #5
striphe said:
Say if I state that the definition of the line of best fit, is the line which achieves the least absolute errors, when measurements from the line are taken perpendicular to the line. (The measurement is at right angles to the line)

This is the basic idea in least squares approximation.
 
  • #6
The use of alternative definitions to OLS has already been done. The literature is very rich. Just do a google search.
 
  • #7
chiro said:
This is the basic idea in least squares approximation.

To clarify that remark, the basic idea of least squares approximation is to be a method that is, in some sense, more robust than minimizing the total of the absolute errors. The line that minimizes the absolute errors need not be the same as the line that minimizes the mean square error.
 
  • #8
striphe,

I don't think your formula can be appiied to data sets such as (-1,-1),(0,0),(1,1) since it involves division by zero.
 
Last edited:
  • #9
Stephen Tashi said:
I don't think your formula can be applied to data sets such as (-1,-1),(0,0),(1,1) since it involves division by zero.

The issue has to do with determining the gradient between an individual that has the same values for x and y as the mean and the mean, It's anything and everything as they are in the same position.

As a result I've had to change the definitions of y, x and n, so that they are not inclusive of any individuals that are in the same co-ordinates as the mean. Ignoring these individuals once the mean is calculated is the best policy (see the Google doc link in first post for more details)

you must understand that minimising the absolute error is different to the absolute perpendicular error. The length between the line and individual is at 90 degrees to the line; it being the minimum distance between the line and the individual.

Does this technique that I have described exist in the literature?
 
  • #10
striphe said:
you must understand that minimising the absolute error is different to the absolute perpendicular error. The length between the line and individual is at 90 degrees to the line; it being the minimum distance between the line and the individual.

Does this technique that I have described exist in the literature?

You haven't explained what you are trying to minimize. Are you minimizing the size of the largest perpendicular distance between a data point (x,y) and the line you calculate? Or are you minimizing the average of those distances taken over all data points?

You also haven't explained why you think your formula minimizes whatever it is that your are trying to minimize.

I don't know if your formula exists in literature. It isn't the mainstream way of fitting lines to data. Have you actually applied this formula to any real world examples? I doubt many people would want to use your formula since it is so strongly influenced by small errors in [itex] y [/itex] when [itex] (x,y) [/itex] is near [itex] ( \bar{x} , \bar{y} ) [/itex].
 
  • #11
Apply your formula to the dataset (-10.0, -10.0), (-0.1, -0.4), (10.1,10.4)
 
  • #12
I've clearly jumped the gun on this one, the formulation doesn't match up with the minimum sum of absolute errors. The best example of this would be to compare the data set [(-10,0)(-1,-1)(1,1)(10,0)] and [(-20,0)(-1,-1)(1,1)(20,0)] they both have the same line of best fit, but you would intuitively know that the latter would have a line of best fit with a lower gradient.

I still do not see how the minimum sum of absolute perpendicular errors isn't a preferable method; as it isn't determined by the relationship the individuals have with the x-axis but the relationship that the individuals have with each other.
 
  • #13
striphe said:
I still do not see how the minimum sum of absolute perpendicular errors isn't a preferable method

Then you haven't thought clearly about the question of why one method should be preferable to another.

If you are doing measurements where you are confident that x can be measured precisely and the y measurement is the one that is subject to "random errors" then measuring error along the y-axis instead of perpendicular to the regression line makes more sense.

If you are dealing with some situation where percentage errors are what matters (like calibrating a measuring instrument who specs state a max percentage error in its reading) then percentage error is more important than absolute error.

That said, I agree that it would be interesting to investigate how to find lines that minimize the sum of the absolute perpendicular errors. However, notice that more than one line may have have that property. Intuitively, if you have a line that runs through the data points and doesn't hit any of them, you can move that line perpendicular to itself and you won't change the sum of the perpedicular errors as long as you don't cross any data points. So "the line" that minimizes the sum of the absolute perpendicular errors may be one of infinitely many other lines that have that property.

As I said, I haven't searched for whether algorithms to solve this problem have been written up. This is the type of problem that computers can solve numerically -by trial an error if need be. If you are interested in the problem, you should do some searching. The least squares perpendicular error problem is called "total least squares" curve fitting. You might find something if you search for "total absolute error" regression or curve fitting.
 
  • #14
striphe said:
I've been told that their exists no perfect mathematical method of obtaining a line of best fit from a population of data.

This doesn't make a whole lot of sense to me, so I have made an attempt at doing such (see google docs link)

https://docs.google.com/document/d/1_Ux4ypYtQcTtuO3bq8fTBCmHIiPuiwQXntJ5Y-KOuc0/edit?hl=en_US

Is their a way of determining if the formula has any merit?

Typically there are two choices of fitting models to data: Least Squares (LSE) and Maximum Likelihood (MLE) estimation. The latter is considered better for parameter estimation and for non linear data. LSE is often preferred for linear data, particularly when the data are relatively sparse, but still sufficient for hypothesis testing.

http://www.minitab.com/en-US/support/answers/answer.aspx?ID=767&langType=1033
 
Last edited by a moderator:
  • #15
Stephen Tashi said:
That said, I agree that it would be interesting to investigate how to find lines that minimize the sum of the absolute perpendicular errors. However, notice that more than one line may have have that property. Intuitively, if you have a line that runs through the data points and doesn't hit any of them, you can move that line perpendicular to itself and you won't change the sum of the perpedicular errors as long as you don't cross any data points. So "the line" that minimizes the sum of the absolute perpendicular errors may be one of infinitely many other lines that have that property.
.

The thing is, the line of best fit has to go through the mean of the population. I think you will find their are instances where multiple lines can exist, but for the most part they don't.
 
  • #16
striphe said:
The thing is, the line of best fit has to go through the mean of the population. I think you will find their are instances where multiple lines can exist, but for the most part they don't.

If you overlooked my post, that's why MLE is often preferred. The single most likely line/curve, given the data, is selected by an iterative process which maximizes the likelihood function.
 
  • #17
striphe said:
Say if I state that the definition of the line of best fit, is the line which achieves the least absolute errors, when measurements from the line are taken perpendicular to the line. (The measurement is at right angles to the line)

Is there some way of determining if this is a more robust definition of best?

Yes, this is often a more robust definition of best.

In practice you often have outliers in your measurements, which can have various causes.
These points are weighed inordinately heavy by a least squares fit.
To reduce this effect, the method you propose is used (least absolute errors).

This is documented in numerical literature.
 
  • #18
The OP asked about the possibility of analytic linear regression. I've answered his/her question. Is there any reason why this thread keeps on going? Please read the first sentence in the second paragraph:

http://www.itl.nist.gov/div898/handbook/apr/section4/apr412.htm

With small samples of linear data, SLE is better, but the fully analytic MLE is better in most other cases. LSE is not fully analytic in that it is (usually) a linear approximation to the MLE.
 
Last edited:

What is analytical linear regression?

Analytical linear regression is a statistical method used to analyze the relationship between two variables. It involves finding a line of best fit that represents the relationship between the variables and using this line to make predictions.

Is it possible to perform analytical linear regression on any type of data?

Yes, analytical linear regression can be performed on a wide range of data, including numerical, categorical, and even binary data. However, the data must follow certain assumptions, such as linearity and homoscedasticity, for the results to be valid.

What are the advantages of using analytical linear regression?

Analytical linear regression allows for the identification and quantification of the relationship between two variables. It also provides a way to make predictions and test hypotheses based on the data. Additionally, it is a relatively simple and straightforward method compared to other regression techniques.

Are there any limitations to using analytical linear regression?

Yes, there are some limitations to using analytical linear regression. It assumes a linear relationship between the variables, which may not always be the case. It also assumes that the data is normally distributed and does not have any outliers. Additionally, it can only be used to analyze the relationship between two variables, making it less useful for more complex data.

How can I assess the accuracy of the analytical linear regression model?

The accuracy of an analytical linear regression model can be assessed by looking at the coefficient of determination (R-squared) and the root mean square error (RMSE). R-squared measures the proportion of variability in the data that can be explained by the model, while RMSE measures the average distance between the actual data points and the predicted values. A higher R-squared and a lower RMSE indicate a more accurate model.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
863
  • Set Theory, Logic, Probability, Statistics
Replies
17
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
884
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
Replies
207
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
9K
  • Other Physics Topics
Replies
3
Views
1K
  • General Engineering
Replies
2
Views
2K
Back
Top