Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Quadratic form and degrees of freedom- fixed title

  1. Jan 15, 2013 #1
    My question is: Is there any gain in intuitive mathematical understanding of degrees of freedom from learning their expression using the 'quadratic form' and matrix algebra techniques?

    This sort of explanation is at least understandable and self-consistent, if not rigorous mathematically: "Degrees of freedom are a way of keeping score. A data set contains a number of observations, say, n. They constitute n individual pieces of information. These pieces of information can be used to estimate either parameters or variability. In general, each item being estimated costs one degree of freedom. The remaining degrees of freedom are used to estimate variability. All we have to do is count properly. "

    However, I don't like that I am doing relatively simple mathematical operations without an understanding of what the mathematical justification is. If I learned matrix algebra enough to understand the 'quadratic form' sense of d.f., will it make more sense to me why we use a denominator of e.g. n-1 for estimating average variance, or will I just know a more complicated way of deriving degrees of freedom?

    (I have very limited exposure to matrix algebra, and it didn't seem that intuitive, i.e. hard to translate into non-matrix terms, but maybe that would change if I studied it more seriously and at least got used to its rules).

    Thanks in advance!
  2. jcsd
  3. Jan 16, 2013 #2

    Stephen Tashi

    User Avatar
    Science Advisor

    In my opinion, you won't find a single mathematical theorem that justifies all the results that people state using the terminology "degrees of freedom". It's also an interesting question whether it is even possible to state a single mathematical definition for "degrees of freedom" that applies in all situations where that terminology is used.

    The general idea is that if you have a set S of things that can be defined by a description that uses n variables and each arbitrary choice of values for the n variables defines a distinct element of S then you can claim that S has n degrees of freedom.

    In the typical applied math situation, you a set have a set S defined by M variables and K constraints on the variables. (For example an element of S might be defined by M real numbers x1,x2,,xM subject to the contraints x1 + x2 + ..xM = 1 and (x1)^2 + (x2)^2 + ... (xM)^2 = 7 ). So you don't have M degrees of freedom because you can't assign the x's an arbitrary set of M values.

    The way things usually work out is that the number of degrees of freedom you have is = M minus the number of contraints. This is not a foolproof rule, because it won't work if some of the constraints are "dependent" on each other. (This brings up the further problem of how to define "dependent" since we aren't necessarily talking about contraints as vectors.)

    Most useful formulas involving degrees of freedom give us the convenient of not having to re-write the set S explicity as a new set of M-K variables. We just use the values of the original set of M variables and fix the answer by some adjustment involving K.

    In a typical statistics problem, each element in the set S is a single real number that is the value of some function. Often S is a statistic, which by definition is a function of the values in a sample.

    I think you can find a mathematical justification for the formulas you encounter that use "degrees of freedom". it certainly might involve matrix algebra and quadratic forms if it is a statistics formula. But I don't think think you'll find a single mathematical theory that justifies all the "degrees of freedom" formulas that pop-up.

    Explain which particular "degrees of freedom" formula that you want to justify.
    Last edited: Jan 16, 2013
  4. Jan 16, 2013 #3
    Very quickly, it would be great if I could see how with a three-point sample of item a, b, and c, when you calculate sample variance, you have to divide the total variance into n-1 portions instead of n. I am reading the frequently cited HW Walker's 1940 paper (Journal of Educational Psychology. 31(4) (1940) 253-269) on d.f., wherein it is explained that degrees of freedom arise from dimensions of your sample, minus the number of constraints you have placed on your data:

    "Consider now a point (x, y, z) in three-dimensional space (N = 3). If no restrictions are placed on its coördinates, it can move with freedom in each of three directions, has three degrees of freedom. All three variables are independent. If we set up the restriction
    x + y + z = c , where c is any constant, only two of the numbers can be freely chosen, only two are independent observations. For example, let x − y − z = 10 . If now we choose, say, x = 7 and y = 9 , then z is forced to be − 12 . The equation x − y − z = c is the equation of a plane, a two-dimensional space cutting across the original three-
    dimensional space, and a point lying on this space has two degrees of freedom. N − r = 3− 1 = 2. If the coördinates of the (x, y, z) point are made to conform to the
    condition x^2 + y^2 + z^2 = k , the point will be forced to lie on the surface of a sphere whose center is at the origin and whose radius is √k. The surface of a sphere is a two-
    dimensional space. (N = 3, r = 1, N − r = 3 − 1 = 2 .)."

    It would help me, I think, if basically the above paragraph was written explicitly from the point of, say, you have your 3 data points, and then how the constraints emerge for those three data points, causing a decrease in d.f., when you calculate the sample variance after calculating the sample mean. For example, I see that x^2 + y^2 + z^2 = k means you have restricted the degrees of freedom to 3-1 for any k, but x, y, z are sample points right, not residuals, in that example? So that's not directly analogous to calculating variance.

    Thanks a lot for your response, it already gives a lot to think about.
  5. Jan 17, 2013 #4

    Stephen Tashi

    User Avatar
    Science Advisor

    The first thing to straighten out is the distinction between "the sample variance" and "an estimator of the population variance".

    "The sample variance" is actually an ambiguous phrase. Some books define it with the formula [itex] \sum_{i=1}^n \frac{ (X_i - \bar{X})^2} {n} [/itex] (e.g. http://mathworld.wolfram.com/SampleVariance.html) and some books define it to be [itex] \sum_{i=1}^n \frac{ (X_i - \bar{X})^2} {n-1} [/itex], where the n independent sample values are the [itex] X_i [/itex] and [itex] \bar{X} [/itex] is the sample mean.

    An "unbiased estimator" is a function of the sample values whose purpose is to estimate some parameter of the population. Since an estimators is a function of the random values in the sample, the estimator itself is a random variable. An "unbiased estimator" is an estimator whose expected value is exactly equal to the population parameter it is intended to estimate.

    One "unbiased estimator" for the population variance is given by [itex] \sum_{i=1}^n \frac{ (X_i - \bar{X})^2} {n-1} [/itex]. To prove it is unbiased, you must prove the expected value of this estimator is the population variance.

    The estimator [itex] \sum_{i=1}^n \frac{ (X_i - \bar{X})^2} {n} [/itex] is not an unbiased estimator of the population variance.

    So you shouldn't look for any mathematical argument that "proves" the formula for the sample variance. Its formula is simply a matter of convention. You may look for a mathematical argument proving that a formula is an unbiased estimator of the population variance.

    Mathematically correct arguments about estimators can be complicated and non-intuitive. For example the estimator [itex] \sum_{i=1}^n \frac{ (X_i - \bar{X})^2} {n-1} [/itex] is an unbased estimator of the population variance but the estimator [itex] \sqrt{ \sum_{i=1}^n \frac{ (X_i - \bar{X})^2} {n-1}} [/itex] is not necessarily an unbiased estimator for the population standard deviation.
  6. Jan 17, 2013 #5

    Stephen Tashi

    User Avatar
    Science Advisor

  7. Jan 17, 2013 #6
    So there is no way to take the language from Walker's paper and explicitly make it about how to calculating an unbiased estimate of the population variance causes you to divide by n-1 instead of n? What I'm getting is that there are proofs that n-1 works, which is fine, and there is this sense that degrees of freedom are about loss of dimensionality when estimating (?), but I don't see how the latter sense arises mathematically.
  8. Jan 17, 2013 #7

    Stephen Tashi

    User Avatar
    Science Advisor

    I don't know if Walker continued his argument or what his conclusion was. What you quoted explains why a constraint reduces the number of "free" choices we have for the variables values from 3 to 2. The passage quoted does not explain anything about unbiased estimation.

    I don't know. As I interpret your question, you are asking if we can look at the proof given in the PDF I linked and make a geometric interpretation of some of its steps as showing a loss of dimensionality in some surface. I suspect we can, but I haven't done so. It's about my bed time, so I won't attempt it tonight. Perhaps some other forum member sees it. I'll try to look at it later.
  9. Jan 17, 2013 #8
    This seems to be it. However, I don't see why we need to use that particular proof. Walker's paper is here if you want to take a look (http://www.nohsteachers.info/pcaso/ap_statistics/PDFs/DegreesOfFreedom.pdf). What is the 'constant' that you calculate on your way to calculating variance? (x1 +x2 +x3)/3 = X? How does this constant X then impose a loss of geometric dimensionality when using it to calculate residuals and the variance from them?
  10. Jan 17, 2013 #9

    Stephen Tashi

    User Avatar
    Science Advisor

    As far as I can see, her discussion of the figure doesn not lead to a proof of any of the formulas that involve the degrees of freedom. I think her exposition would have been clearer if she had reconciled her algebra with the geometry of the picture by setting N = 3 and mu = 0. A complicated situation is often broken up into simpler cases. She is considering a case where all the samples (of size 3) have the same mean. Within such a case, the possible samples are points that lie on the shaded triangle, which is a 2D figure. The geometry shows that for the sample S, the sum of the squares of the deviations from the population mean is the square of length OS. The sum of the squares of the deviations from the sample mean is the square of length AS. I don't see how this leads to a specific formula for an unbiased estimator of the variance. It simply illustrates that the two ways of calculating the squares of the deviations are different.

    If there is a proof for the formula for the unbiased estimator of the variance based on geometry and reducing degrees of freedom, my guess is that it would involve conditional expectation. The expected value of a random variable can be computed by dividing the possible outcomes up into mutually exclusive sets of outcomes. In the case of a function of 3 sample values, the 3D solid of possible outcomes can be divided up into "layers", each of which is a 2D figure. We compute the mean of the variable on each layer and add up the result. To compute the mean value of the variable X on the layer L, compute the mean value X using the conditional probability density that restricts X to be in L. Then we multiply this mean value by the probability that X is in L. But this general pattern of proof doesn't produce any universal rule that says "always use n-1 instead of n".
    Last edited: Jan 17, 2013
  11. Jan 17, 2013 #10
    I don't understand. Are the sample points, X1, X2, X3, to be found along those axes? Or are those just the most extreme possible cases? If they are along those axes, then what is the triangle? If you could just explain what the algebraic constraints are (what are the equations for ), because I don't understand her figure.

    I don't understand how you create a layer, or even a 3d solid.
  12. Jan 18, 2013 #11

    Stephen Tashi

    User Avatar
    Science Advisor

    Instead of an x,y,z axis system, she has a X1,X2,X3 axis system. A point (a,b,c) represents a sample of size ( three realizations of the same random variable.) If we ask for the surface where X1+X2+X3 = k then we get a plane [itex] P_k [/itex] that includes the points (0,0,k), (0,k,0), (k,0,0). Those points are where the shaded triangle hits the 3 axes. The shaded triangle is the part of the plane where each sample value isl non-negative. (There is no requirement that each sample value must be non-negative, it just makes it easier to visualize the plane if you only show the shaded triangle.) Everywhere on the plane [itex] P_k [/itex] , the sample mean is (X1+X2+X3)/3 = k/3. .

    The point A represents the sample where all 3 samples values are equal to each other. This is the sample (k/3, k/3, k/3). She is assuming the origin [itex] O [/itex] = (0,0,0) represents a sample where each realization of the random variable was exactly equal to the mean of the population. So, the picture assumes the mean of the population is 0.

    The point S represents a sample where the sum of the sample values is k, but the sample values are not equal to each other. The sum of the square deviations from the sample values in S to the population mean (0,0,0) is the square of the length of line segment OS. This follows from using the distance formula for the distance between two points in 3 dimensions. The sum of the square deviations of the sample values in S from the sample mean is the square of the length of line segment AS, by using the distance formula.

    Using some convoluted language she says (I think) that the ratio OA/OS is the value of the t-statistic for the sample represented by S.

    As far as I can see, she hasn't proved any theorem. She didn't give any particular reason for only considering samples where the values sum to k.

    My reaction is "Why the heck is this an often-cited paper?". Perhaps some other forum member has more insight.

    To illustrate my remarks about conditional expectation, if you look at the planes [itex] P_k [/itex] and let k range over all possible values, these planes would include all possible 3-value samples. If you know a formula for the mean of some estimator on the plane [itex] P_k [/itex] then you can find the mean value of the estimator (over all samples) by "averaging up" its mean value over all possible [itex] P_k [/itex].
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook