Multiple least squares regression

zzmanzz · Apr 11, 2013

Homework Statement

design a regression model that will use the dataset

y trial x1 x2 x3

0.08536, 1, -1, -1, -1.00000
0.09026, 2, -1, -1, -1.00000
0.10188, 1, -1, -1, -0.33333
0.09301, 2, -1, -1, -0.33333
0.10362, 1, -1, -1, 0.33333
0.09920, 2, -1, -1, 0.33333
0.11033, 1, -1, -1, 1.00000
0.10744, 2, -1, -1, 1.00000
0.10172, 1, -1, 0, -1.00000
0.09360, 2, -1, 0, -1.00000
0.10800, 1, -1, 0, -0.33333
0.11685, 2, -1, 0, -0.33333
0.11002, 1, -1, 0, 0.33333
0.11221, 2, -1, 0, 0.33333
0.11533, 1, -1, 0, 1.00000
0.12328, 2, -1, 0, 1.00000
0.21908, 1, -1, 1, -1.00000
0.19675, 2, -1, 1, -1.00000
0.22744, 1, -1, 1, -0.33333
0.21138, 2, -1, 1, -0.33333
0.28118, 1, -1, 1, 0.33333
0.26413, 2, -1, 1, 0.33333
0.32416, 1, -1, 1, 1.00000
0.30590, 2, -1, 1, 1.00000
0.32390, 1, 1, -1, -1.00000
0.34938, 2, 1, -1, -1.00000
0.13669, 1, 1, -1, -0.33333
0.12953, 2, 1, -1, -0.33333
0.07987, 1, 1, -1, 0.33333
0.07884, 2, 1, -1, 0.33333
0.05959, 1, 1, -1, 1.00000
0.06172, 2, 1, -1, 1.00000
0.21624, 1, 1, 0, -1.00000
0.21925, 2, 1, 0, -1.00000
0.11777, 1, 1, 0, -0.33333
0.11127, 2, 1, 0, -0.33333
0.07338, 1, 1, 0, 0.33333
0.07354, 2, 1, 0, 0.33333
0.05601, 1, 1, 0, 1.00000
0.05622, 2, 1, 0, 1.00000
0.69966, 1, 1, 1, -1.00000
1.58131, 2, 1, 1, -1.00000
0.18522, 1, 1, 1, -0.33333
0.17043, 2, 1, 1, -0.33333
0.09530, 1, 1, 1, 0.33333
0.10060, 2, 1, 1, 0.33333
0.06655, 1, 1, 1, 1.00000
0.06814, 2, 1, 1, 1.00000

Homework Equations

I loaded the dataset and calculated

c = (X'*X)^(-1)*X' * y

where

X = [ones X1 X2 X3]

48*4 data matrix

y is a 48*1 column vector

solving for column vector c -> [c_o c_1 c_2 c_3]'

The Attempt at a Solution

I got the regression coefficients but the predictions are terrible for my model. Am I doing something wrong?

BruceW · Apr 13, 2013

your method looks correct to me. It is not surprising that the predictions are not very good. You need to keep in mind that even though your method might be correct, it still may be terrible at making predictions. In this case, there are 3 'dimensions' and it seems that the input variables take on only a few different possible values. Maybe you can try plotting the data, looking at one dimension at a time, to see intuitively whether it looks linear or not.

edit: when I say 3 'dimensions', I mean the 3 input variables, for example, (temperature, size, colour) might be the three 'dimensions', i.e. the 3 input variables which correspond to a particular value of y. I thought I should say this, because I am not sure about how widely used the word 'dimensions' is, in this context.

Ray Vickson · Apr 13, 2013

zzmanzz said:

Homework Statement

design a regression model that will use the dataset

y trial x1 x2 x3

0.08536, 1, -1, -1, -1.00000
0.09026, 2, -1, -1, -1.00000
0.10188, 1, -1, -1, -0.33333
0.09301, 2, -1, -1, -0.33333
0.10362, 1, -1, -1, 0.33333
0.09920, 2, -1, -1, 0.33333
0.11033, 1, -1, -1, 1.00000
0.10744, 2, -1, -1, 1.00000
0.10172, 1, -1, 0, -1.00000
0.09360, 2, -1, 0, -1.00000
0.10800, 1, -1, 0, -0.33333
0.11685, 2, -1, 0, -0.33333
0.11002, 1, -1, 0, 0.33333
0.11221, 2, -1, 0, 0.33333
0.11533, 1, -1, 0, 1.00000
0.12328, 2, -1, 0, 1.00000
0.21908, 1, -1, 1, -1.00000
0.19675, 2, -1, 1, -1.00000
0.22744, 1, -1, 1, -0.33333
0.21138, 2, -1, 1, -0.33333
0.28118, 1, -1, 1, 0.33333
0.26413, 2, -1, 1, 0.33333
0.32416, 1, -1, 1, 1.00000
0.30590, 2, -1, 1, 1.00000
0.32390, 1, 1, -1, -1.00000
0.34938, 2, 1, -1, -1.00000
0.13669, 1, 1, -1, -0.33333
0.12953, 2, 1, -1, -0.33333
0.07987, 1, 1, -1, 0.33333
0.07884, 2, 1, -1, 0.33333
0.05959, 1, 1, -1, 1.00000
0.06172, 2, 1, -1, 1.00000
0.21624, 1, 1, 0, -1.00000
0.21925, 2, 1, 0, -1.00000
0.11777, 1, 1, 0, -0.33333
0.11127, 2, 1, 0, -0.33333
0.07338, 1, 1, 0, 0.33333
0.07354, 2, 1, 0, 0.33333
0.05601, 1, 1, 0, 1.00000
0.05622, 2, 1, 0, 1.00000
0.69966, 1, 1, 1, -1.00000
1.58131, 2, 1, 1, -1.00000
0.18522, 1, 1, 1, -0.33333
0.17043, 2, 1, 1, -0.33333
0.09530, 1, 1, 1, 0.33333
0.10060, 2, 1, 1, 0.33333
0.06655, 1, 1, 1, 1.00000
0.06814, 2, 1, 1, 1.00000

Homework Equations

I loaded the dataset and calculated

c = (X'*X)^(-1)*X' * y

where

X = [ones X1 X2 X3]

48*4 data matrix

y is a 48*1 column vector

solving for column vector c -> [c_o c_1 c_2 c_3]'

The Attempt at a Solution

I got the regression coefficients but the predictions are terrible for my model. Am I doing something wrong?

When you have such limited ranges of variables (values like -1, 0, 1, etc.) it starts to look like an experimental design problem for a *quadratic* fit. I suggest you re-run the model with added columns ##x_2^2, x_3^2, x_1 x_2, x_1 x_3, x_2 x_3.## That will give you a total of 1 + 3 + 2 + 3 = 9 terms in your expression for y. If you have the x-values already, you can (depending on the software you use) calculate those extra columns to add to the data set.

Note: the data set has only the two values -1 and +1 for ##x_1##, so does not distinguish between 1 and ##x_1^2##; that is why we omit ##x_1^2##.

BruceW · Apr 13, 2013

mm, it depends on what kind of behaviour we would believe the underlying system actually has. Using a quadratic fit might give no better than a linear fit (maybe even worse). Trying a quadratic fit is a good way to extend the homework though.

(@zzmanzz) Also, when you said the predictions are 'terrible', they shouldn't be like ridiculously far off. How bad are the predictions compared to the range of the y data?

Ray Vickson · Apr 14, 2013

BruceW said:

mm, it depends on what kind of behaviour we would believe the underlying system actually has. Using a quadratic fit might give no better than a linear fit (maybe even worse). Trying a quadratic fit is a good way to extend the homework though.

(@zzmanzz) Also, when you said the predictions are 'terrible', they shouldn't be like ridiculously far off. How bad are the predictions compared to the range of the y data?

The best quadratic fit cannot be worse than a linear fit; if the best quadratic happens to have zero coefficients for the squared or product terms, it will reduce to linear.

Anyway, no matter what the exact form is, when the x-values are limited one cannot tell the difference between a more general model and a quadratic. The variable ##x_1## just takes the two values +1 and -1, so any ##f(x_1)## is indistinguishable from a linear function. The variable ##x_2## takes only the three values -1, 0 and +1, so any ##f(x_2)## is indistinguishable from a quadratic. The variable ##x_3## takes 4 values, so one can think of going to a cubic, but that any other function ##f(x_3)## would give the same results.

Where we have some wriggle room is in the "interaction" terms: we could include terms like ##x_1 x_2^2, x_1 x_3^2, x_2 x_3^2, x_2^2 x_3, x_1 x_2 x_3,## etc. Including them would improve the fit, but whether or not the coefficients would be statistically significant is another matter.

BruceW · Apr 14, 2013

Ray Vickson said:

The best quadratic fit cannot be worse than a linear fit

Yeah, I wasn't clear on what I meant. I mean the prediction accuracy for the best quadratic fit can be worse than the prediction accuracy for the best linear fit. (Which is the most meaningful way to test a model, I think). For example, if the underlying data is y=mx+c, with some noise added, then the linear fit will give closer predictions, on average.

Ray Vickson said:

Anyway, no matter what the exact form is, when the x-values are limited one cannot tell the difference between a more general model and a quadratic.

Ah, yeah that is a good point. As long as we treat the dimensions independently, and If the data in a certain dimension only take on n values, then a polynomial fit of degree n (for that dimension) will give the same predictions as a polynomial of any higher order. (assuming the data values we have already seen are the only possible data values, and I don't know what his data is coming from, but it does look that way).

So in this case, if we use a polynomial fit of higher order, we never make our model worse. In this situation, it will always give at least as good prediction as a lower order polynomial (under the conditions in my last paragraph). I guess this is a consequence of the fact that in this situation, a discrete distribution should be used really. But the problem was to use a regression model, and AFAIK, that implies using a continuous distribution. So I guess he should not use a discrete distribution.

Multiple least squares regression

Homework Statement

Homework Equations

The Attempt at a Solution

Homework Statement

Homework Equations

The Attempt at a Solution

Thread 'Finding the nth roots of a complex number'

Thread 'Solve this problem that involves induction'

Hot Threads

Prove that the integral is equal to ##\pi^2/8##

Solving the wave equation with piecewise initial conditions

Area of loop in x-y plane

Calculating radius of gyration of plane figure about x-axis

Solve this problem that involves induction

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective