Least squares approximation: Is smaller normal distance always better?

DaleSwanson · Jun 26, 2012

I took a LA course in the spring, and was interested by the least squares method for building models. I decided to practice this concept by attempting to build a model that would predict ticket sales for the Mega Millions lottery given the jackpot amount. I have 249 data pairs of jackpot and ticket sales for past data. I tried a bunch of possible models and came up with two that work reasonably well:
Model 1: y = β_0 + β_1 x_1 + β_2 x_1^2
Model 2: y = β_0 * e^{β_1 * x_1}
where x₁ is jackpot, and y is ticket sales.

I found the coefficients for both as well as the normal distance from the vector representing the actual data. After that I put the models into a spreadsheet and had it calculate the predicted sales for all the past drawings I had data for. I then took the average of the percent errors of all those predictions.

The normal distance for model 1 was 52.94, and for model 2 was 0.7446. The percent errors were reversed though. Model 1 had the lower average error of 5.83%, and model 2 had an average error of 6.97%.

I realize the model is best when the normal distance is minimized; however, I'm wondering if I can compare that distance between models of different dimensions? If not then that would explain why model 1 seems to do a better job based on percent errors, but has a much larger normal distance. On the other hand, if comparisons between models of different dimensions are valid, then why the discrepancy between what these two methods of evaluating model accuracy tell me?

chiro · Jun 26, 2012

DaleSwanson said:

I took a LA course in the spring, and was interested by the least squares method for building models. I decided to practice this concept by attempting to build a model that would predict ticket sales for the Mega Millions lottery given the jackpot amount. I have 249 data pairs of jackpot and ticket sales for past data. I tried a bunch of possible models and came up with two that work reasonably well:
Model 1: y = β_0 + β_1 x_1 + β_2 x_1^2
Model 2: y = β_0 * e^{β_1 * x_1}
where x₁ is jackpot, and y is ticket sales.

I found the coefficients for both as well as the normal distance from the vector representing the actual data. After that I put the models into a spreadsheet and had it calculate the predicted sales for all the past drawings I had data for. I then took the average of the percent errors of all those predictions.

The normal distance for model 1 was 52.94, and for model 2 was 0.7446. The percent errors were reversed though. Model 1 had the lower average error of 5.83%, and model 2 had an average error of 6.97%.

I realize the model is best when the normal distance is minimized; however, I'm wondering if I can compare that distance between models of different dimensions? If not then that would explain why model 1 seems to do a better job based on percent errors, but has a much larger normal distance. On the other hand, if comparisons between models of different dimensions are valid, then why the discrepancy between what these two methods of evaluating model accuracy tell me?

Hey DaleSwanson.

For fitting models in a regression, there are different ways of calculating the regression coeffecients.

The different ways depend on the decomposition used. Some methods use what is called a pseudo-inverse.

The different kinds of decomposition include singular value decomposition, and some other one I can't remember. The pseudo-inverse comes into play when you get situations like when random variables are linearly-dependent or close enough to blow up the matrix.

You can also do regression by using principal components in what is known as a principal component analysis (PCA). The PCA effectively creates a orthogonal basis for the random variables where the first linear combination maximizes variance the rest do the same while maintaining orthogonality by setting covariance terms to zero.

I guess the thing is asking yourself what to minimize. With least squares, the idea is to minimize the total error to the model. But this is not a done deal since you can have many different norms. You can have the 1-norm or 2-norm which are standard norms, you can have other arbitrary norms that may get a specific fit corresponding to a characteristic of the norm-itself.

By engineering a custom norm, you may get features that relate to that particular choice of norm that give an explanation that you would not get with standard 1-norm or 2-norm and if you can construct such a norm to get the properties you had in mind, then we are dealing with a more general algorithm.

HallsofIvy · Jun 26, 2012

As chiro said, there are many different ways to approximate a value and they well may give different answers. It is, after all, only an "approximation". The reason we typically use "least squares" is that it relates to our usual idea of "distance" with the formula \sqrt{x^2+ y^2+ z^2}.

Other commonly used approximations are |x|+ |y|+ |z| and min(|x|,|y|,|z|)

DaleSwanson · Jun 26, 2012

Well thanks for both the replies. It would seem my exposure to this is just the tip of a much larger iceberg.

I find the concept of building models to fit data very interesting. What sort of course would cover this in more detail? Would a calc based intro to statistics course be good?

chiro · Jun 27, 2012

DaleSwanson said:

Well thanks for both the replies. It would seem my exposure to this is just the tip of a much larger iceberg.

I find the concept of building models to fit data very interesting. What sort of course would cover this in more detail? Would a calc based intro to statistics course be good?

For fitting deterministic models, the area and ideas of numeric analysis are very useful. Within this, there is the study of interpolation.

Interpolation is simply the practice of finding a fit between data points at the simplest level, given certain conditions. The simplest model is the Lagrange polynomial which calculates a polynomial that goes through the data points of degree n + 1 where n is the number of points.

The more advanced interpolation schemes are things like BSPLINES and NURBS which are more flexible, but harder to work with. You have some schemes which are supplied with things like derivative information.

One thing that you can do is to take sample data and smooth it using time-series series and expectation techniques and use the above framework to create a fit, but I wouldn't do this as a general principle because understanding not only the data output, but the context of the data and underlying processes is far more important than trying to find the best fit.

In terms of the general statistical field, you can look at regression and generalized linear models.

If you wanted to use general norm conditions where you use some more exotic norm rather than 1-norm or 2-norm, then you will end up with an optimization problem under the constraint of the norm. This means that you will be doing an optimization problem which has its nonclamenture and ideas with regard to whether global minima exist. Recall that you are finding a fit with minimum residual, so essentially you are trying to solve an optimization problem.

The tensor calculus is also a good thing to look at if you are using exotic norms since under certain conditions you can always convert between co-ordinate systems which means that you can build a bridge between the euclidean-norms (like 2-norm) and your exotic norm.

So for general understanding of finding fits, look at linear and generalized linear models, optimization (so you can understand how general optimization problems are found with general conditions), numerical analysis, and tensor theory. All of these in combination will give an idea of how to solve more general kinds of residual minimization and thus get a specific model fit given your general requirements.

Least squares approximation: Is smaller normal distance always better?

Similar threads

Undergrad About the existence of Hamel basis for vector spaces

Undergrad How to define a vector field?

Graduate Confusion about the Moyal-Weyl twist

Undergrad Can one find a matrix that's 'unique' to a collection of eigenvectors?

Undergrad Erroneously finding discrepancy in transpose rule

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers