# The least squares approximation - best fit lines revisited !

1. Feb 2, 2006

### cosmicminer

We all know the least squares method to find the best fit line for a collection of random data.
But I wonder if it is the best method.

Suppose we have two random variables y and x that appear to have a linear relation of the type y = ax+b.
What we want is, given the next type x signal to predict as close as possible the value of the y signal.
The well known method tells us to use our experimental readings and minimize the variance functional - so the values of a and b are easily computed.
That is we seek the minimum of

F = sum of [ ( Yi - aXi - b) ^ 2 ]

which works out easily - see your maths book.

But what if I go for the minimum of

F1 = sum of [ | Yi - aXi - b | ]

This one has no analytical solution but that does n't matter because it is very easy to work it out using any crude numerical approach.

So in general we get two different -or somewhat different- best fit lines.
Which one is the best best ?

After all if we go back to the standard normal distribution, the moment X^2 is the variance (sigma ^ 2) and the moment |x| is also proportional to the standard deviation (|x| moment = sigma x sqr ( 2 / pi ) ).

In a real problem using one method and then the other, the results are not likely to be identical.
In terms of probabilistic inference which method is better ?
And I don't believe in calculus books, because maybe what they wanted was to have an analytical solution and print it !

2. Feb 2, 2006

### Hurkyl

Staff Emeritus
You've overlooked a part of the model: the noise! While you might have the relationship

Y = aX + b

what you are measuring is the value

Z = Y + N

where N is the noise. So you don't have (X, Y) pairs, but instead have (X, Z) pairs.

Given an a and b, you can compute the noise from an (X, Z) pair via:

N = Z - aX - b

So, a reasonable way to attack the problem is to find the (a, b) pair that gives the most likely noise.

In other words, you want to maximize P(N).

If you model the noise on a gaussian, then you have:

P(N) = P exp(Q N²)

for some constants P and Q. But, since we can calculate N from our (X, Z) data points, given an assumption on a and b:

P(N) = P exp(Q (Z - aX - b)²)

But wait, we have several (independent) data points: we should really be looking at:

$$\prod_i P(N_i) = \prod_i P e^{Q (Z_i - aX_i - b)^2} = P e^{Q \sum_i (Z_i - aX_i - b)^2}$$

So we want to maximize this expression over all a and b. Look familiar? (Recall that Q < 0)

If you don't like my introduction of the Z variable, you should say that you are modelling your Y's as:

Y = aX + b + N

instead of simply being Y = aX + b. (Since that relationship is clearly not true, from the data!)

So there's the theoretical reason. (P.S. least squares is often given in stats books too)

Of course, empirical testing is important -- you might find that your particular channel is not gaussian noise, is better handled with least-absolute error approximation, instead of least-squared error.

Of course, I have not shown that the resulting estimator for future Z's is a good one -- there's no guarantee that the best a and b yield the best estimator for Z in terms of X. I'd have to spend a bit more time to figure that one out.

Last edited: Feb 2, 2006
3. Feb 2, 2006

### cosmicminer

So you are saying the least squares method is strictly speaking the correct one and not the absolute differences.
Looks like sound argument you made - the difference in real applications is likely to be small though to be readily obvious.
What might be a source of random variables of this type (other than computer generated random numbers) ?

Last edited: Feb 2, 2006
4. Feb 2, 2006

### Staff: Mentor

Least squares is the best method if the error source is Gaussian. A method other than least squares is "better" if the underlying random process is not Guassian. For example, minimizing the sum of the absolute errors produces the "best" fit (in a maximum likelihood sense) if the underlying error process is a two-sided exponential,
$$P(x) = k \exp\left(-\left|\frac{x-x_0}{\sigma}\right|\right)$$

Another kind of problem is finding the "best" representation of some function $f(x)$. Suppose you are asked to find some polynomial approximation of $\sin(x)$ over some interval. Least squares will produce one answer. A "better" answer is produced by minimizing the worst-case error over the interval, because that is the error I care about as a user of your approximuation.

5. Feb 4, 2006

### EnumaElish

If you believe/know that the least sq. estimator is biased for whatever reason (e.g. non-normal error), you should use a max. likelihood estimator. You need to set up the likelihood function then maximize it. It may not always have an analytic solution, though.

With normal (gaussian) errors, least sq. estimator is identical to the max. likelihood estimator.

6. Feb 4, 2006

### Hurkyl

Staff Emeritus
Anything, really, due to the central limit theorem.

Very loosely speaking, if there are enough independent factors that contribute to noise, the result will look normally distributed.

I do want to stress again that empiricism is useful -- practice has a nasty habit of defying theory's predictions, especially on the fine details! If you have the time and ability, it would almost certainly be worth trying out different methods to see which one gives the best results for your application.

Or, if you're really ambitious, you could carry out a detailed analysis of the noise to guide the design of a better way to estimate your y's.