Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Chi Squared vs. Coefficient of Determination

  1. Dec 7, 2008 #1
    I am confused about the meanings of these two goodness-of-fit descriptors.

    If I have two independent variables and I wish to test the validity of a proposed functional relation between them (say I have measured electric potentials and currents and wish to test if the data reflects a linear relationship), which should I use?

    What are the differences between these tests?

    What texts would you recommend that have full proofs of such arguments?

    Forgive me if this is a stupid question, but I have had difficulty in finding a text explaining the differences.
  2. jcsd
  3. Dec 8, 2008 #2


    User Avatar
    Science Advisor

    You'll probably want to perform a regression analysis (which will give you the coefficient of determination) and then see what can be inferred from your results. A Chi-square test statistic can be used in a hypothesis test.

    The coefficient of determination may tell you how well your linear model accounts for the variation in it (i.e. the larger the value the better the model explains the variation between the variables).

    Any statistics book should have some sections on linear regression and hypothesis testing. Here is some info on the net that may help: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm

    Hope this helps.

  4. Dec 8, 2008 #3
    Your reply and link seem to confirm that which I have read and thought but deny what people around me seem to say.

    In my mind, the logical test to conduct to assess the validity in whether or not a curve fit is justified is the coefficient of determination. One then converts the coefficient of determination into a probability of correlation by considering the degrees of freedom in the problem in question. Is this correct? Are there limitations on this test?

    I am well aware that Wikipedia is by NO means a source, but I was disturbed to read the claims that there is no consensus on what the coefficient of determination means. From my limited understanding, the meaning of it seems crystal clear, and the Wikipedia description to be nonsense. However, I am by no means an expert in statistics, just a solid state theory graduate student.

    Additionally, the link you gave describes the idea that the chi squared test as a test of the validity of the application of a certain statistical distribution. I am used to this description. It makes sense to me. However, at one point in Taylor's introductory book on statistics and error, he briefly mentions that chi squared can be used to assess linear fits. I have also heard different physicists whose intelligence I respect invoke using chi squared to judge curve fits. Can chi squared be used to judge an arbitrary curve fit? Conceptually, I don't see why it cannot, considering how it judges a distribution, but then I do not know which is better, chi squared or r squared, to judge a curve fit.

    Do these observations make sense to you? Can chi squared be used to judge arbitrary curve fits? If so, how? If so, when should one use each test? What are there differences? Or is that people applying chi squared to curve fitting is not logical?

    Please forgive me if this is personal, but would you mind also stating your background (as in what field you work, degree) in case there are different conventions in different fields? The people I know who use chi squared as a curve fit test are both spectroscopists, one in chemical physics, one in quantum optics.
  5. Dec 8, 2008 #4


    User Avatar
    Homework Helper

    "One then converts the coefficient of determination into a probability of correlation by considering the degrees of freedom in the problem in question"

    No; [tex] R^2 [/tex] is not, in itself, a measure of probability. It has the classical interpretation of ``explaining the percentage of variability in Y that is due to the variable x'',
    but even this is limited, for at least three reasons.
    1. If the regression is forced to go through the origin, the coefficient is meaningless
    2. The coefficient is easily influenced by outliers and what we (statisticians) refer to as leverage points. This means that a large value of [tex] R^2 [/tex] can be due entirely to a single data valu
    3. If you move past simple regression (adding more x variables) [tex] R^2 [/tex] will always increase, even if those values add nothing to your model. A modifed, or ajusted, coefficient is usually suggested then.
  6. Dec 8, 2008 #5
    I am very grateful for your response, thank you very much for your time.

    I do not mean to impugn anything you have written, I merely ask some questions to complete my understanding of the material for myself.

    If you are familiar with "An Introduction to Error Analysis" by John Taylor, can you comment on the content of p. 215-220? If I understand the material, a big if, it seems that Taylor is directly claiming that the coefficient of determination's square root is the correlation coefficient, and the correlation coefficient is converted to the probability of correlation by the degrees of freedom in the table in those pages. Do I not understand the material? Is Taylor wrong?

    Regarding your individual points of weakness,

    1.) I don't see how this is a weakness. I have been taught and seems reasonable to me that one should NEVER force a curve fit to go through any point if you want an honest statistical analysis. In my mind, it defeats the point of objectively analyzing the data to force additional points on them in the curve fit. Is this naive?
    2.) Can you direct me to a text which gives some sort of proof of this statement so I might better understand the justification? Is there a compensation method? When you say outlier, my immediate thought is can't the outliers be eliminated anyway using Chauvernet's criterion? If so, doesn't this remove this weakness?
    3.) Makes sense, I have seen this argument before, I was being sloppy and should have explicitly stated this condition being met.

    Can you comment on using chi squared?
  7. Dec 8, 2008 #6


    User Avatar
    Homework Helper

    I think I have some answers, but I need to head to class and may add more later.
  8. Dec 21, 2008 #7
    Thanks StatDad, this is immensely helpful. So what exactly do you do when there is one or more influential outlier that is not due to human error? Can you name ,or perhaps explain, a robust regression method?

    Thank you kindly!
  9. Dec 22, 2008 #8


    User Avatar
    Homework Helper

    If all you have at your disposal is software that does Least Squares regression, you are rather stuck. It can be helpful to rerun the analysis without the problem data values, but only to see how much the results change without them . If the coefficients change by more than a couple standard errors of their original values (or worse, change sign, or cause a significant relationship to become non-significant, etc.) you have a serious problem.

    Remember that least squares regression works by finding the estimates that minimize the sum of the squared residuals:

    S(\hat \alpha, \hat \beta) = \min_{(a,b) \in \mathbf{R}^2} \sum \left(y_i - (a + b x_i)\right)^2

    The problem with squaring the residuals is that the resulting estimates are easily influenced by outliers - just as the sample mean is. The notion of robust regression is this: replace the operation of squaring with a function that has similar interpretations (can be thought of as a distance function, essentially) but which will downplay the role of outliers. One early choice was to use absolute values (think of the median compared to the mean)

    S_{L1}(\hat \alpha, \hat \beta) = \min_{(a,b) \in \mathbb{R}^2} \sum |y_i - (a + bx_i)|

    This can be solved via linear programming and other methods, but there is no closed form expression for the estimates. This provides some protection against [tex] y [/tex] outliers but not against [tex] x [/tex] outliers.

    The starting point for robust regression begins with this observation: both least squares and L1 regression can be viewed as a process of minimizing the following sum, a function of the residuals

    \sum \rho(y_i - (a+bx_i))

    For least squares [tex] \rho(x) = x^2 [/tex], for L1 [tex] \rho(x) = |x| [/tex]. Huber (and others) had the idea of generalizing this to other forms of the function [tex] \rho [/tex].

    The "simplest" case is Huber's function, which is "least squares in the middle and L1 on the ends". To ease typing I will use [tex] e_i [/tex] as an abbreviation for [tex] y_i - (a + bx_i) [/tex]

    \rho(e_i) = \begin{cases}
    \frac{e_i^2} 2 & \text{ if } |e_i| \le k\\
    k |e_i| - \frac 1 2 k^2 & \text{ if } |e_i| > k

    The number [tex] k [/tex] is a "tuning constant" - a common choice is [tex] 1.345 [/tex], chosen for a number of theoretical and practical reasons. (Admission: this formulation assumes that the standard deviation of [tex] y [/tex] is known: if not, things can get a little messier.) The solution can be found using Weighted Least Squares.

    Other choices for the [tex] \rho [/tex] function are available, offering different benefits.

    There are other robust regression methods. Regression based on ranks can be motivated by this starting point. Here is the least squares function again.

    \sum \left(y_i - (a+bx_i)\right)^2 = \sum {\left(y_i - (a+bx_i)\right) \cdot \left(y_i - (a+bx_i)\right)}

    The idea is to replace one of the terms in the second form by an expression that involves a function of the ranks of the residuals.

    \sum \phi(\text{R}(y_i - (a+bx_i)) (y_i - (a+bx_i))

    Judicious choice of the score function [tex] \phi [/tex] can result in estimates that have reasonable break-down values and bounded influence. The 'drawback' here is that the intercept cannot be directly estimated, this must be done with some function of the residuals (median of the residuals, for example)

    All the robust procedures I've mentioned give estimates that are asymptotically normal in distribution, which means (from a practical point of view) confidence intervals and other estimates can be easily obtained.

    The down side? I don't know of any add-ins for Excel that will do any of these methods - there may be some, but since I don't use Excel in my teaching or other work I don't know of them.

    There is a free statistics package, R, that has packages for robust regression. Don't let the word "free" lead you to thinking that this is a shabby program: it is very powerful. It runs on Windows, OS X, and Linux. You can find more information here.


    I hope some (most?) of this is the type of response you wanted.
  10. Dec 22, 2008 #9
    To note more about R, it's pretty much a free version of S-PLUS if the OP is more familiar with that.
  11. Dec 22, 2008 #10


    User Avatar
    Homework Helper

    True - I didn't mention that because only Excel had been discussed earlier. R doesn't have the pretty GUI of S-Plus, but the lack of the price tag makes up for it in a massive way.
  12. Dec 22, 2008 #11
    If the user likes pretty GUI's then he/she should look into SPSS. I prefer the layout of SAS/S-PLUS/R.

    Not to hijack the thread but what can SAS do that R cannot (or is it more the fact that SAS has packages already developed whereas the ones for R come out whenever people create them)
  13. Dec 22, 2008 #12


    User Avatar
    Homework Helper

    SAS is a little quicker with massive data sets, and since it has been around much longer than R, it has the reputation of being an "industry standard".
    Programming in SAS is also a little more flexible than it is in R. I'm not sure how the availability of extra packages for the two compare. I do know that R is growing in popularity in a variety of biostatistics applications.

    Note that my comment about the GUI was not aimed at the OP, it was meant to be a humorous comparison between R and S-Plus.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook