# Degrees of Freedom

1. Mar 14, 2014

### emmasaunders12

Hi all

If my model consists of two steps,

e.g, multiple linear regression to get an estimate of an intermediate response variable

followed by a further regression to get the final estimate of response variable

To estimate the degrees of freedom for the total model can I simply sum the degrees of freedom for the individual models?

Thanks

Emma

2. Mar 14, 2014

### FactChecker

I always hate it when someone suggests a different approach instead of answering the original question, but I have to suggest this: Why not apply multiple linear regression directly to estimate the response variable? It seems like either way will end up with a linear estimator, but the direct approach will allow you to apply existing tools and obtain all the relevant statistical information directly.

3. Mar 15, 2014

### emmasaunders12

Hi Fact checker, the reason I phrased the question as such is because it's slightly more complicated than regression, I'm comparing two "models", one of which requires pre processing of the data so want to know if, during this pre processing step I can simply add the degrees of freedom for each individual step?

4. Mar 15, 2014

### FactChecker

Emma, I see. My opinion is that you can only add the independent degrees of freedom of the second process from variables that are not the result of the first process. Anything more than that is beyond my abilities.

5. Mar 15, 2014

### Stephen Tashi

Emma,

"Degrees of freedom" has no precise meaning until a particular context is specified. (This is the case with many mathematical terms like "dual", "conjugate", "closed", "homogeneous".) I assume you are using a particular statistic or procedure which requires a "degrees of freedom" number. Explain exactly what procedure or formula you intend to use.

6. Mar 15, 2014

### emmasaunders12

Hi I using the F test for the comparsion of two models

7. Mar 16, 2014

### Stephen Tashi

As I understand your first post, it talks about a single linear model that is created in stages. This model is not (in general) the same model that you would obtain by a least-squares linear regression because you did the fit in two stages. Your final model is something like z = A x1 + B y + C where x1 is from the data, and y is not. The "intermediate" variable y is the result of a least squares fit to the data that gave y = D x2 + E x3 + F where x2 and x3 are values from the data.

In least squares regression, to obtain a model z = A x1 + By + C, we assume there are no "errors" in the x1 and y measurements. So you can't say that your procedure produces the same model as you would have obtained if you had done the regression in a single step using the data (x1, x2, x3) because there are "errors" in the y values. (The method of "total least squares" regression is often used when the model assumes errors exist in several of the variables.)

One technicality to investigate, is whether the F-test comparison of two linear models actually applies if one model is not the result of a least squares fit.

If you are comparing two models, then I assume the two models predict the same variable, which is z in my example. If so, my example describes only one of the models. Where does the other model come from?

8. Mar 17, 2014

### emmasaunders12

Thanks for the reply stephen, but without drifting off the topic too much, with respect to the degrees of freedom, is it legitimate to add the degrees of freedom for each individual step?

Thanks

Emma

9. Mar 17, 2014

### FactChecker

The math of "degrees of freedom" allows you to count up the number of variables in an equation that are independent of others and are free to vary. In that context, you can add them as long as you do not count the variables that are a result of your first step. Those variables are not free to vary since they are calculated in the first step.

However, using the degrees of freedom in statistics like F or chi-squared requires additional assumptions about the distribution of the free variables and about the equation of the statistic being calculated. Since your calculation is not one of the usual ones (sample mean, sample variance, linear regression, goodness of fit, etc.), it is not clear what statistics are valid to use, even if your degrees of freedom is correct. To use the standard distributions, you will have to use one of the processes that they apply to.

10. Mar 18, 2014

### Stephen Tashi

let's say you are dealing with a linear model and "degrees of freedom" in your context means the number of parameters in the model that were determined when you fit the model to data.

Using the previous example, z = A x1 + B y + C can be written as z = A x1 + B( D x2 + E x3 + F) + C = A x1 + BD x2 + BE x3 + BF + C. This amounts to a linear model with 4 parameters P1 = A, P2 = BD, P3 = BE and P4 = (BF + C). So there are 4 degrees of freedom.

There are 3 parameters in z = A x1 + B y + C and 3 parameters in y = D x2 + E x3 + F but there are only 4 parameters in the model that expresses z as a linear function of x1,x2,x3.

11. Mar 18, 2014

### emmasaunders12

Thanks stephen

Im a little confused as to why your not counting e.g, BD as two parameters. In MLR where e.g z=AX1, where A will now have many parameters, aren't all elements of A counted in this case?

In the example you give above the addition of number of parameters (=6 parameters above) would always result in more parameters when one simply adds them, resulting in even less degrees of freedom.

When comparing the sum of square residuals of two models using the F test, a simple model (S1) with degrees of freedom DF1 and a more complex model (S2) having less degrees (DF2):

F=[(S1-S2)/S2]/[(DF1-DF2)/DF2]

estimating less degrees of freedom in the complex model than may perhaps exist would give a smaller F ratio and thus favour the simpler model. Would this assumption be correct?

Emma

12. Mar 18, 2014

### Stephen Tashi

My undestanding of applying the F test to compare linear models is that we assume the models are nested and that they are each least squares fit to the data (...and a lot of other assumptions). So it's hard to answer your question because you are not fitting a model to data by a method that is guaranteed to produce a least squares fit. (and you have not mentioned a second model to which your first model is being compared.)

But suppose we have a linear model of the form z = P x + Q y + R where P,Q,R are constants. Suppose we fit this to data by some procedure that goes in stages and is guaranteed to produce a least squares fit in the end. We write the model as z = (A)(B)(C) + (D)(E)(F) y + (G)(H)(I) in stage 1, we find A,D,G. In stage 2 we find B,E,H. In stage 3 we find C,F,I. This does not change the fact that the final result of the process is a linear model that is a least squares fit to the data and has the form z = P x + Q y + R, which involves 3 constants.

13. Mar 18, 2014

### emmasaunders12

I was however under the impression that the F test can still be used if the models were not fitted using least squares. Quoting wiki

"It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least squares. "

I understand your proposition with respect to number of constants. Could you please however clarify my earlier point.

"Im a little confused as to why your not counting e.g, BD as two parameters. In MLR where e.g z=AX1, where A will now have many parameters, aren't all elements of A counted in this case?"

Thanks

Emma

14. Mar 18, 2014

### Stephen Tashi

We'd have to investigate whether "inexact" F-tests are good idea.. I don't know if the wiki is merely stating that they are a customary practice or whether it is stating they are a mathematically justifiable practice.

The counting of parameters in the linear model counts the number of constants in the model, with each numerical coefficient of a variable counted as single constant and the "constant term" of the model counted as a single constant. So for the term 16 x, the numerical value 16 is one constant even though it could be factored as (8)(2) or (2)(2)(2)(2).

15. Mar 19, 2014

### emmasaunders12

Hi Stephen thanks for your help:

The models I'm comparin are:

Model1: Y=A(B^-1)X

Model2: Y=A([ Z-CH]D+ C(B^-1)X)

How would you best proceed in this instance?

Thanks

Emma

16. Mar 19, 2014

### Stephen Tashi

Which of those letters represent independent variables? X and Z ? Z isn't a function of X?

17. Mar 19, 2014

### emmasaunders12

X is the independent variable, Z is a linear function of a projection of X, i.e S , in a lower dimensional domain, i.e an autoregressive describing the evolution in time of S.

S=(A^-1)X;

18. Mar 19, 2014

### Stephen Tashi

I suggest you give a precise definition of things involved.

I don't know what "an autoregressive" might be. The term "autoregressive" suggests your independent variables might be values indexed with time. Is X a vector of values indexed by a "time" ? Or is the kth component of X the value of something at time k? Are the values of the dependent variable Y also indexed by time?

19. Mar 20, 2014

### emmasaunders12

Hi Stephen,

Thanks for your patience with my problem:

Model1: Y=A(B^-1)X

A are the eigenvectors of Y,
B are the eigenvectors of X,
So the above is a total least squares type problem.
There are no time indices in the above method

Model2: Y=A([ Z-CH]D+ C(B^-1)X)

[ Z-CH]D+ C - is a Kalman filter

some preliminaries:
S=(A^-1)X;
D=(B^-1)Y
S is an estimate of X in a lower dimensional domain
D is an estimate of Y in a lower dimensional domain

C is the Kalman gain
H is a liner model between S and D
Z is an autoregressive fit of D

From the above A and B are fixed, all other parameters can vary, as the Kalman filter is adaptive, dependent upon Y using an EM algorithm.

20. Mar 20, 2014

### Stephen Tashi

It isn't clear what you are doing, because you haven't described the format of the observed data you are using.

One effort at mind reading says that your data consists of M ordered pairs of vectors (X,Y), so to exhibit one pair of vectors as scalars (X[k],Y[k]) = ( (X[k][1],X[k][2]...X[k][nx]), (Y[k][1],Y[k][2],...Y[k][ny])

Another effort at mind reading says your data consists of M ordered pairs of scalars (x[k],y[k]) and that there is a single vector Y = (y[1],y[2],...y[M]) and a single vector X = (x[1],x[2],...x[M]).

You haven't written an equation which shows any random errors, so it isn't clear why you say that the fit is a total least squares problem. I assume you mean that the model assumes a random additive error in both the X and Y terms.

What do X and Y represent in this model? (Are they the same variables in this model as they are in Model1?)

The term Kalman filter suggests that there are time indices involved in this model. Which index represents time?

21. Mar 21, 2014

### emmasaunders12

Hi Stephen,

your correct M ordered pairs of vectors (X,Y) represents the data.

The fit is total least squares as the eigen domain is used, i.e orthogonal regression

X and Y are the same in model 1 and model 2. Model 1 however is dynamic as mentioned previously

Yk=A([ Zk-CkHk]Dk+ Ck(B^-1)Xk) - - - with k as a time indicie

similarly model 1 can be written as

Yk=A(B^-1)Xk

if one wishes, just depends on how the model is being used, i,e with a batch of Xk's or just incremental data points?

Any idea how to proceed here?

22. Mar 22, 2014

### Stephen Tashi

Use simulation.

For simulation you need stochastic models, not mere curve fits. Each model, should specify a method for making a deterministic prediction, (such as Y = AX) but it also must specify a model for how the observed data arises in a stochastic manner (such as Y = AX + B* err(k}, where err(k) is an independent random draw at each time k from normal distribution with mean 0 and variance 1.)

I think your x-data is a time series of vectors. You need to generate representative examples of the x-data by simulation or have such examples from actual observations (i.e. one "example" is an entire time series of vectors). So you might need a stochastic model for the x-data.

I am assuming your predictive models give the predicted y-value as a function of the observed x-values , not as a function of the underlying "true" x-values. Of course a model may use the observed x-values to predict the "true" x-values and then make it's prediction based on those estimates.

Once you have the capability to do simulations, you can investigate various statistics by the Monte-Carlo method.

-------------
For example:

Let model_X be the stochastic model for generating the x-data.

Create a Mont-Carlo simulation involving two (possibly identical models) model_A and model_B as follows. One replication of the simulation is:

1) Generate the X-data using model_X
2) Generate the Y-data using the stochastic model associated with model_A
3) Generate the predicted Y-data using the deterministic model associated with model_A
4) Compute RSS_A = the sum of the squared residuals between the Y-data of step 2 and the predicted Y-values of step 3.

( I'm assuming that when using the F-test, your intent was to define the "residual" between two vectors as the euclidean distance between them. Whether this is wise depends on details of the real world problem.)

5) Generate the Y_data using the stochastic model associated with model_B
6) Generate the the predicted Y-data using the deterministic model associated with model_B
7) Compute RSS_B = the sum of the squares of the residuals between the Y_data from step 5 and the predictions of step 6.

G is an obvious imitation of the F-statistic. We don't know that G has the same distribution as any F-statistic, so we shouldn't call it one.

We can set model_A = model_B = your model2 and use the Monte-Carlo simulation to estimate the distribution of G. (When the stochastic model associated with model 2 is applied to the same X-data twice, it probably won't produce the same residuals due to the stochastic terms. Hence the value of G will vary on different replications.)

Take the "null hypothesis" to be that model1 is the same as model2 (as far as producing residuals goes). Compute the single numerical value G_obs = (RSS_1 - RSS_2)/RSS_2 by applying the two models to the actually observed X-data. Use the distribution of G to compute how likely it is to get a value of G equal or greater than G_obs. Then "accept" or "reject" the null hypothesis based on how this probability compares to whatever "significance level" you have chosen.

23. Mar 22, 2014

### emmasaunders12

wouldn't i still however be in the same predicament, as the testing of the null hypothesis requires the degrees of freedom for each model, which is what I am not sure about?

Thanks

Emma

24. Mar 22, 2014

### Stephen Tashi

No, you don't need to know any "degrees of freedom" information. You use the empirical distribution of G to do the test. There is never any need to deal with F statistics.

25. Mar 23, 2014

### emmasaunders12

I actually have 10 examples of the ground truth Y, would this be enough to not perform siulation as I am calculating the residuals (RSS_A - RSS_B) / RSS_B using them. Also the technique you mention would be dependent upon what one chooses as err{k}, in Y = AX + B* err(k}.

I'm also a little confused on how you would use the emprical distirbution of G to do the test, if you could clarify that would be great.

Thanks so much for your help

Emma