# Least squares - why the square?

I have a presentation to give on the least squares method this week, and am finding the least squares method very intuitive with the exception of one thing: why is the square of the offset used, as opposed to the absolute value? I did a bit of searching, but could not find a detailed answer.

Wolfram says this:
The sum of the squares of the offsets is used instead of the offset absolute values because this allows the residuals to be treated as a continuous differentiable quantity. However, because squares of the offsets are used, outlying points can have a disproportionate effect on the fit, a property which may or may not be desirable depending on the problem at hand.
I can't quite visualize what this is saying (wouldn't it be continuous if the absolute values were used?)... I was wondering if someone could elaborate on this? Sorry if this is a vague question.

Thanks

Related Set Theory, Logic, Probability, Statistics News on Phys.org
AlephZero
Homework Helper
Yes, it would be continuous if absolute values were used, but it would not be differentiable if a data point was exactly on the fitted curve.

In fact there are methods to minimize either the sum of the absolute values
http://en.wikipedia.org/wiki/Least_absolute_deviations

or to minimize the largest absolute value, wthout bothering about the smaller "errors". (In practice, that means you make several of the absolute values equal to the largest one.)
http://en.wikipedia.org/wiki/Remez_algorithm

These are not so "simple" as least squares minimization, and the minimum may not be unique.

Stephen Tashi
wouldn't it be continuous
Are you asking if the total absolute error is a continuous function with respect to changes in the parameters of the curve that is being fit? Yes, it would be a continuous function in typical situations, such as fitting a polynomial to data.

The passage you quoted said "continuous differentiable". This is redundant since a differentiable function must be continuous. In typical situations, the total absolute error is continuous but not differentiable with respect to changes of the position of the curve. (Recall that the graph of |x| is continuous at x =0 but not differentiable at x =0.) The trick of forming simultaneous equations by taking partial derivatives of the total error with respect the parameters of a curve and setting them equal to zero, only finds values for the curve's parameters where the total error has such partial derivatives. If you use the absolute error, you must also consider all values where the partial derivatives fail to exist. This happens whenever the curve goes exactly through one of the data points.

D H
Staff Emeritus
why is the square of the offset used, as opposed to the absolute value?
One key reason: It's easier. Think of the joke about the drunk looking for his lost keys underneath the street light, even though he lost them over there in some dark alley.

More importantly, it oftentimes is the correct approach. If the random variables in question truly are normally distributed, a least squares technique is the approach that yields the best possible estimate. A lot of real-world random variables are approximately normal, and this is not just coincidence. Many real-world random variables are random due to a multiplicity of underlying causes. The central limit theorem dictates that those random variables will appear to be approximately normal.

One problem with a least squares approach is that it isn't particularly robust. If the data truly are normally distributed you will never, ever see a 100 sigma outlier. Yet they do happen. Data transmission errors, single event upsets, operator error: All of these (and others) can blow the normal assumption away. Dealing with those outliers is important but is also problematic.

On the other hand, if you are trying to find some magical "best fit" of a known but complex function to a simpler approximation, least squares may not be such good approach. There's nothing random here for one thing. For another, users of this approximation aren't so much concerned with the average error in a least squares sense as they are concerned "what's the worst this approximation will do to me?" Here minimizing the maximum absolute deviation is arguably the better way to go. It is also a much harder way to go. The Remes formulation is a bit tough to implement.

Stephen Tashi
a least squares technique is the approach that yields the best possible estimate.
By what definition of "best" is it the "best possible estimate"?

D H
Staff Emeritus
Well, maximum likelihood, for one.

Given the nature of the question, it is best to assume this is for an introductory undergrad course. So, keep it short and simple. No reason to introduce the UMVUE/MVUE/BLUE alphabet soup. At least not yet.

Stephen Tashi
Well, maximum likelihood, for one.
That would be a good point for wheels1888 to include in his talk! He should ask about it if he doesn't feel confident in explaining the details.

I have a presentation to give on the least squares method this week, and am finding the least squares method very intuitive with the exception of one thing: why is the square of the offset used, as opposed to the absolute value? I did a bit of searching, but could not find a detailed answer.

Wolfram says this:

I can't quite visualize what this is saying (wouldn't it be continuous if the absolute values were used?)... I was wondering if someone could elaborate on this? Sorry if this is a vague question.

Thanks
There are several reasons.

Variances add, which is a very useful property. That's probably the main reason.

Suppose you have a scatter plot of XY data and are fitting a line to it. Then with least squares you get a unique solution, but with absolute value you usually get an infinite number of solutions. Same thing with means. The mean minimizes the least squares while if you use absolute value there may be many solutions that minimize.

On the other hand, any exponent greater than one would give a unique solution. There is reason to think that taking the absolute value to the 1.5th power would be better and avoid all this fiddling around called removing outliers. This wasn't done in the old days because it was too hard to compute, and now squaring is tradition. It's good enough and we've learned to live with the minor flaws.

------

D H
Staff Emeritus
On the other hand, any exponent greater than one would give a unique solution.
Not necessarily. You won't get a unique solution if, for example, you are trying to fit three parameters in a problem space that only has two degrees of freedom; one of those supposedly independent parameters is some function of the other two. You might get a solution with real data because the perfect correlation will be masked a bit by noise. The solution will be lousy, however.

This wasn't done in the old days because it was too hard to compute, and now squaring is tradition.
There's a lot more to it than that.

Least squares is the same as minimizing the L2 norm. The L2 norm is an inner product norm. This means that least squares minimization is in the domain of linear algebra, so all of the incredibly powerful and diverse tools of linear algebra can be brought to bear on the problem.

This is not true in general. Using some power p is the same as minimizing the Lp norm. The L2 norm is the only Lp norm that corresponds to a norm generated by an inner product. It takes a lot of finagling (and a lot of iteration) to apply the tools and techniques of linear algebra to something other than least squares minimization.

MarcoD
Least squares is the same as minimizing the L2 norm. The L2 norm is an inner product norm. This means that least squares minimization is in the domain of linear algebra, so all of the incredibly powerful and diverse tools of linear algebra can be brought to bear on the problem.

This is not true in general. Using some power p is the same as minimizing the Lp norm. The L2 norm is the only Lp norm that corresponds to a norm generated by an inner product. It takes a lot of finagling (and a lot of iteration) to apply the tools and techniques of linear algebra to something other than least squares minimization.
I would find it interesting to see the presentation, if available in pdf format, and I would like to understand that L2 norm.

D H
Staff Emeritus
I would like to understand that L2 norm.
Given some number p≥1 and an n-vector x with components xi, i=1,2,…,n, the Lp norm (some just use p-norm) of the vector x is
$$||\mathbf x||_p \equiv \left(\sum_{i=1}^n\, |x_i|^p\right)^{1/p}$$
There are three cases of particular interest:
• p=1: The L1 norm is the sum of the absolute values of the components. Think of how a taxicab drives in Manhattan. A taxicab can't go as the crow flies. It instead has to follow the roads. The taxicab's odometer measures the Lp norm between points A and B. This norm is often called the taxicab norm or Manhattan norm.

• p=2: For any even integer power p, there's no need to take the absolute value of the xi terms in the sum and then raise that to the pth power. Just raise the terms to the pth power, period. The L2 thus reduces the Euclidean norm. What makes this case special is that the Euclidean norm is induced by the standard definition of the inner product of two vectors. The L2 norm is the only one of the Lp norms that is induced by an inner product. Inner product spaces are very special, and thus the L2 norm is a very special norm.

• p→∞: Let xmax be the maximum of the absolute values of the components of x. The larger elements (absolute sense) of {xi} will take a predominant role in that sum as p grows ever larger. In As p→∞, the only terms that matter are those equal to xmax. Thus the L norm selects the maximum value. This norm is also called the maxnorm because of this.

This website mentions that chapter 15.1 of Numerical Recipes explains where the square comes from. The Numerical Recipes Book can be viewed online here (e.g. click on the Fortran 90 book).

I have only browsed through the chapter but it seems that the square comes from assuming a Gaussian distribution.