Least squares approximation: Is smaller normal distance always better?

DaleSwanson
Messages
350
Reaction score
2
I took a LA course in the spring, and was interested by the least squares method for building models. I decided to practice this concept by attempting to build a model that would predict ticket sales for the Mega Millions lottery given the jackpot amount. I have 249 data pairs of jackpot and ticket sales for past data. I tried a bunch of possible models and came up with two that work reasonably well:
Model 1: y = β_0 + β_1 x_1 + β_2 x_1^2
Model 2: y = β_0 * e^{β_1 * x_1}
where x1 is jackpot, and y is ticket sales.

I found the coefficients for both as well as the normal distance from the vector representing the actual data. After that I put the models into a spreadsheet and had it calculate the predicted sales for all the past drawings I had data for. I then took the average of the percent errors of all those predictions.

The normal distance for model 1 was 52.94, and for model 2 was 0.7446. The percent errors were reversed though. Model 1 had the lower average error of 5.83%, and model 2 had an average error of 6.97%.

I realize the model is best when the normal distance is minimized; however, I'm wondering if I can compare that distance between models of different dimensions? If not then that would explain why model 1 seems to do a better job based on percent errors, but has a much larger normal distance. On the other hand, if comparisons between models of different dimensions are valid, then why the discrepancy between what these two methods of evaluating model accuracy tell me?
 
Physics news on Phys.org
DaleSwanson said:
I took a LA course in the spring, and was interested by the least squares method for building models. I decided to practice this concept by attempting to build a model that would predict ticket sales for the Mega Millions lottery given the jackpot amount. I have 249 data pairs of jackpot and ticket sales for past data. I tried a bunch of possible models and came up with two that work reasonably well:
Model 1: y = β_0 + β_1 x_1 + β_2 x_1^2
Model 2: y = β_0 * e^{β_1 * x_1}
where x1 is jackpot, and y is ticket sales.

I found the coefficients for both as well as the normal distance from the vector representing the actual data. After that I put the models into a spreadsheet and had it calculate the predicted sales for all the past drawings I had data for. I then took the average of the percent errors of all those predictions.

The normal distance for model 1 was 52.94, and for model 2 was 0.7446. The percent errors were reversed though. Model 1 had the lower average error of 5.83%, and model 2 had an average error of 6.97%.

I realize the model is best when the normal distance is minimized; however, I'm wondering if I can compare that distance between models of different dimensions? If not then that would explain why model 1 seems to do a better job based on percent errors, but has a much larger normal distance. On the other hand, if comparisons between models of different dimensions are valid, then why the discrepancy between what these two methods of evaluating model accuracy tell me?

Hey DaleSwanson.

For fitting models in a regression, there are different ways of calculating the regression coeffecients.

The different ways depend on the decomposition used. Some methods use what is called a pseudo-inverse.

The different kinds of decomposition include singular value decomposition, and some other one I can't remember. The pseudo-inverse comes into play when you get situations like when random variables are linearly-dependent or close enough to blow up the matrix.

You can also do regression by using principal components in what is known as a principal component analysis (PCA). The PCA effectively creates a orthogonal basis for the random variables where the first linear combination maximizes variance the rest do the same while maintaining orthogonality by setting covariance terms to zero.

I guess the thing is asking yourself what to minimize. With least squares, the idea is to minimize the total error to the model. But this is not a done deal since you can have many different norms. You can have the 1-norm or 2-norm which are standard norms, you can have other arbitrary norms that may get a specific fit corresponding to a characteristic of the norm-itself.

By engineering a custom norm, you may get features that relate to that particular choice of norm that give an explanation that you would not get with standard 1-norm or 2-norm and if you can construct such a norm to get the properties you had in mind, then we are dealing with a more general algorithm.
 
As chiro said, there are many different ways to approximate a value and they well may give different answers. It is, after all, only an "approximation". The reason we typically use "least squares" is that it relates to our usual idea of "distance" with the formula \sqrt{x^2+ y^2+ z^2}.

Other commonly used approximations are |x|+ |y|+ |z| and min(|x|,|y|,|z|)
 
Well thanks for both the replies. It would seem my exposure to this is just the tip of a much larger iceberg.

I find the concept of building models to fit data very interesting. What sort of course would cover this in more detail? Would a calc based intro to statistics course be good?
 
DaleSwanson said:
Well thanks for both the replies. It would seem my exposure to this is just the tip of a much larger iceberg.

I find the concept of building models to fit data very interesting. What sort of course would cover this in more detail? Would a calc based intro to statistics course be good?

For fitting deterministic models, the area and ideas of numeric analysis are very useful. Within this, there is the study of interpolation.

Interpolation is simply the practice of finding a fit between data points at the simplest level, given certain conditions. The simplest model is the Lagrange polynomial which calculates a polynomial that goes through the data points of degree n + 1 where n is the number of points.

The more advanced interpolation schemes are things like BSPLINES and NURBS which are more flexible, but harder to work with. You have some schemes which are supplied with things like derivative information.

One thing that you can do is to take sample data and smooth it using time-series series and expectation techniques and use the above framework to create a fit, but I wouldn't do this as a general principle because understanding not only the data output, but the context of the data and underlying processes is far more important than trying to find the best fit.

In terms of the general statistical field, you can look at regression and generalized linear models.

If you wanted to use general norm conditions where you use some more exotic norm rather than 1-norm or 2-norm, then you will end up with an optimization problem under the constraint of the norm. This means that you will be doing an optimization problem which has its nonclamenture and ideas with regard to whether global minima exist. Recall that you are finding a fit with minimum residual, so essentially you are trying to solve an optimization problem.

The tensor calculus is also a good thing to look at if you are using exotic norms since under certain conditions you can always convert between co-ordinate systems which means that you can build a bridge between the euclidean-norms (like 2-norm) and your exotic norm.

So for general understanding of finding fits, look at linear and generalized linear models, optimization (so you can understand how general optimization problems are found with general conditions), numerical analysis, and tensor theory. All of these in combination will give an idea of how to solve more general kinds of residual minimization and thus get a specific model fit given your general requirements.
 
##\textbf{Exercise 10}:## I came across the following solution online: Questions: 1. When the author states in "that ring (not sure if he is referring to ##R## or ##R/\mathfrak{p}##, but I am guessing the later) ##x_n x_{n+1}=0## for all odd $n$ and ##x_{n+1}## is invertible, so that ##x_n=0##" 2. How does ##x_nx_{n+1}=0## implies that ##x_{n+1}## is invertible and ##x_n=0##. I mean if the quotient ring ##R/\mathfrak{p}## is an integral domain, and ##x_{n+1}## is invertible then...
The following are taken from the two sources, 1) from this online page and the book An Introduction to Module Theory by: Ibrahim Assem, Flavio U. Coelho. In the Abelian Categories chapter in the module theory text on page 157, right after presenting IV.2.21 Definition, the authors states "Image and coimage may or may not exist, but if they do, then they are unique up to isomorphism (because so are kernels and cokernels). Also in the reference url page above, the authors present two...
When decomposing a representation ##\rho## of a finite group ##G## into irreducible representations, we can find the number of times the representation contains a particular irrep ##\rho_0## through the character inner product $$ \langle \chi, \chi_0\rangle = \frac{1}{|G|} \sum_{g\in G} \chi(g) \chi_0(g)^*$$ where ##\chi## and ##\chi_0## are the characters of ##\rho## and ##\rho_0##, respectively. Since all group elements in the same conjugacy class have the same characters, this may be...
Back
Top