I have a presentation to give on the least squares method this week, and am finding the least squares method very intuitive with the exception of one thing: why is the square of the offset used, as opposed to the absolute value? I did a bit of searching, but could not find a detailed answer.

Wolfram says this:

I can't quite visualize what this is saying (wouldn't it be continuous if the absolute values were used?)... I was wondering if someone could elaborate on this? Sorry if this is a vague question.

or to minimize the largest absolute value, wthout bothering about the smaller "errors". (In practice, that means you make several of the absolute values equal to the largest one.) http://en.wikipedia.org/wiki/Remez_algorithm

These are not so "simple" as least squares minimization, and the minimum may not be unique.

Are you asking if the total absolute error is a continuous function with respect to changes in the parameters of the curve that is being fit? Yes, it would be a continuous function in typical situations, such as fitting a polynomial to data.

The passage you quoted said "continuous differentiable". This is redundant since a differentiable function must be continuous. In typical situations, the total absolute error is continuous but not differentiable with respect to changes of the position of the curve. (Recall that the graph of |x| is continuous at x =0 but not differentiable at x =0.) The trick of forming simultaneous equations by taking partial derivatives of the total error with respect the parameters of a curve and setting them equal to zero, only finds values for the curve's parameters where the total error has such partial derivatives. If you use the absolute error, you must also consider all values where the partial derivatives fail to exist. This happens whenever the curve goes exactly through one of the data points.

One key reason: It's easier. Think of the joke about the drunk looking for his lost keys underneath the street light, even though he lost them over there in some dark alley.

More importantly, it oftentimes is the correct approach. If the random variables in question truly are normally distributed, a least squares technique is the approach that yields the best possible estimate. A lot of real-world random variables are approximately normal, and this is not just coincidence. Many real-world random variables are random due to a multiplicity of underlying causes. The central limit theorem dictates that those random variables will appear to be approximately normal.

One problem with a least squares approach is that it isn't particularly robust. If the data truly are normally distributed you will never, ever see a 100 sigma outlier. Yet they do happen. Data transmission errors, single event upsets, operator error: All of these (and others) can blow the normal assumption away. Dealing with those outliers is important but is also problematic.

On the other hand, if you are trying to find some magical "best fit" of a known but complex function to a simpler approximation, least squares may not be such good approach. There's nothing random here for one thing. For another, users of this approximation aren't so much concerned with the average error in a least squares sense as they are concerned "what's the worst this approximation will do to me?" Here minimizing the maximum absolute deviation is arguably the better way to go. It is also a much harder way to go. The Remes formulation is a bit tough to implement.

Given the nature of the question, it is best to assume this is for an introductory undergrad course. So, keep it short and simple. No reason to introduce the UMVUE/MVUE/BLUE alphabet soup. At least not yet.

Variances add, which is a very useful property. That's probably the main reason.

Suppose you have a scatter plot of XY data and are fitting a line to it. Then with least squares you get a unique solution, but with absolute value you usually get an infinite number of solutions. Same thing with means. The mean minimizes the least squares while if you use absolute value there may be many solutions that minimize.

On the other hand, any exponent greater than one would give a unique solution. There is reason to think that taking the absolute value to the 1.5th power would be better and avoid all this fiddling around called removing outliers. This wasn't done in the old days because it was too hard to compute, and now squaring is tradition. It's good enough and we've learned to live with the minor flaws.

Not necessarily. You won't get a unique solution if, for example, you are trying to fit three parameters in a problem space that only has two degrees of freedom; one of those supposedly independent parameters is some function of the other two. You might get a solution with real data because the perfect correlation will be masked a bit by noise. The solution will be lousy, however.

There's a lot more to it than that.

Least squares is the same as minimizing the L_{2} norm. The L_{2} norm is an inner product norm. This means that least squares minimization is in the domain of linear algebra, so all of the incredibly powerful and diverse tools of linear algebra can be brought to bear on the problem.

This is not true in general. Using some power p is the same as minimizing the L_{p} norm. The L_{2} norm is the only L_{p} norm that corresponds to a norm generated by an inner product. It takes a lot of finagling (and a lot of iteration) to apply the tools and techniques of linear algebra to something other than least squares minimization.

Given some number p≥1 and an n-vector x with components x_{i}, i=1,2,…,n, the L_{p} norm (some just use p-norm) of the vector x is
[tex]||\mathbf x||_p \equiv \left(\sum_{i=1}^n\, |x_i|^p\right)^{1/p}[/tex]
There are three cases of particular interest:

p=1: The L_{1} norm is the sum of the absolute values of the components. Think of how a taxicab drives in Manhattan. A taxicab can't go as the crow flies. It instead has to follow the roads. The taxicab's odometer measures the L_{p} norm between points A and B. This norm is often called the taxicab norm or Manhattan norm.

p=2: For any even integer power p, there's no need to take the absolute value of the x_{i} terms in the sum and then raise that to the p^{th} power. Just raise the terms to the p^{th} power, period. The L_{2} thus reduces the Euclidean norm. What makes this case special is that the Euclidean norm is induced by the standard definition of the inner product of two vectors. The L_{2} norm is the only one of the L_{p} norms that is induced by an inner product. Inner product spaces are very special, and thus the L_{2} norm is a very special norm.

p→∞: Let x_{max} be the maximum of the absolute values of the components of x. The larger elements (absolute sense) of {x_{i}} will take a predominant role in that sum as p grows ever larger. In As p→∞, the only terms that matter are those equal to x_{max}. Thus the L_{∞} norm selects the maximum value. This norm is also called the maxnorm because of this.

This website mentions that chapter 15.1 of Numerical Recipes explains where the square comes from. The Numerical Recipes Book can be viewed online here (e.g. click on the Fortran 90 book).

I have only browsed through the chapter but it seems that the square comes from assuming a Gaussian distribution.