Chi Squared vs. Coefficient of Determination

Enjolras1789 · Dec 7, 2008

I am confused about the meanings of these two goodness-of-fit descriptors.

If I have two independent variables and I wish to test the validity of a proposed functional relation between them (say I have measured electric potentials and currents and wish to test if the data reflects a linear relationship), which should I use?

What are the differences between these tests?

What texts would you recommend that have full proofs of such arguments?

Forgive me if this is a stupid question, but I have had difficulty in finding a text explaining the differences.

stewartcs · Dec 8, 2008

Enjolras1789 said:

I am confused about the meanings of these two goodness-of-fit descriptors.

If I have two independent variables and I wish to test the validity of a proposed functional relation between them (say I have measured electric potentials and currents and wish to test if the data reflects a linear relationship), which should I use?

What are the differences between these tests?

What texts would you recommend that have full proofs of such arguments?

Forgive me if this is a stupid question, but I have had difficulty in finding a text explaining the differences.

You'll probably want to perform a regression analysis (which will give you the coefficient of determination) and then see what can be inferred from your results. A Chi-square test statistic can be used in a hypothesis test.

The coefficient of determination may tell you how well your linear model accounts for the variation in it (i.e. the larger the value the better the model explains the variation between the variables).

Any statistics book should have some sections on linear regression and hypothesis testing. Here is some info on the net that may help: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm

Hope this helps.

CS

Enjolras1789 · Dec 8, 2008

Your reply and link seem to confirm that which I have read and thought but deny what people around me seem to say.

In my mind, the logical test to conduct to assess the validity in whether or not a curve fit is justified is the coefficient of determination. One then converts the coefficient of determination into a probability of correlation by considering the degrees of freedom in the problem in question. Is this correct? Are there limitations on this test?

I am well aware that Wikipedia is by NO means a source, but I was disturbed to read the claims that there is no consensus on what the coefficient of determination means. From my limited understanding, the meaning of it seems crystal clear, and the Wikipedia description to be nonsense. However, I am by no means an expert in statistics, just a solid state theory graduate student.

Additionally, the link you gave describes the idea that the chi squared test as a test of the validity of the application of a certain statistical distribution. I am used to this description. It makes sense to me. However, at one point in Taylor's introductory book on statistics and error, he briefly mentions that chi squared can be used to assess linear fits. I have also heard different physicists whose intelligence I respect invoke using chi squared to judge curve fits. Can chi squared be used to judge an arbitrary curve fit? Conceptually, I don't see why it cannot, considering how it judges a distribution, but then I do not know which is better, chi squared or r squared, to judge a curve fit.

Do these observations make sense to you? Can chi squared be used to judge arbitrary curve fits? If so, how? If so, when should one use each test? What are there differences? Or is that people applying chi squared to curve fitting is not logical?

Please forgive me if this is personal, but would you mind also stating your background (as in what field you work, degree) in case there are different conventions in different fields? The people I know who use chi squared as a curve fit test are both spectroscopists, one in chemical physics, one in quantum optics.

statdad · Dec 8, 2008

"One then converts the coefficient of determination into a probability of correlation by considering the degrees of freedom in the problem in question"

No; [tex]R^2[/tex] is not, in itself, a measure of probability. It has the classical interpretation of ``explaining the percentage of variability in Y that is due to the variable x'',
but even this is limited, for at least three reasons.

If the regression is forced to go through the origin, the coefficient is meaningless
The coefficient is easily influenced by outliers and what we (statisticians) refer to as leverage points. This means that a large value of [tex]R^2[/tex] can be due entirely to a single data valu
If you move past simple regression (adding more x variables) [tex]R^2[/tex] will always increase, even if those values add nothing to your model. A modifed, or ajusted, coefficient is usually suggested then.

Enjolras1789 · Dec 8, 2008

I am very grateful for your response, thank you very much for your time.

I do not mean to impugn anything you have written, I merely ask some questions to complete my understanding of the material for myself.

If you are familiar with "An Introduction to Error Analysis" by John Taylor, can you comment on the content of p. 215-220? If I understand the material, a big if, it seems that Taylor is directly claiming that the coefficient of determination's square root is the correlation coefficient, and the correlation coefficient is converted to the probability of correlation by the degrees of freedom in the table in those pages. Do I not understand the material? Is Taylor wrong?

Regarding your individual points of weakness,

1.) I don't see how this is a weakness. I have been taught and seems reasonable to me that one should NEVER force a curve fit to go through any point if you want an honest statistical analysis. In my mind, it defeats the point of objectively analyzing the data to force additional points on them in the curve fit. Is this naive?
2.) Can you direct me to a text which gives some sort of proof of this statement so I might better understand the justification? Is there a compensation method? When you say outlier, my immediate thought is can't the outliers be eliminated anyway using Chauvernet's criterion? If so, doesn't this remove this weakness?
3.) Makes sense, I have seen this argument before, I was being sloppy and should have explicitly stated this condition being met.

Can you comment on using chi squared?
3.)

statdad · Dec 8, 2008

I think I have some answers, but I need to head to class and may add more later.

Enjolras1789 said:

I am very grateful for your response, thank you very much for your time.

I do not mean to impugn anything you have written, I merely ask some questions to complete my understanding of the material for myself.

If you are familiar with "An Introduction to Error Analysis" by John Taylor, can you comment on the content of p. 215-220? If I understand the material, a big if, it seems that Taylor is directly claiming that the coefficient of determination's square root is the correlation coefficient, and the correlation coefficient is converted to the probability of correlation by the degrees of freedom in the table in those pages. Do I not understand the material? Is Taylor wrong?
*********************************
I'm not an engineer, but a statistician, so am not familiar with the text of which you speak.
It is ALMOSTtrue that the square root of [tex]R^2[/tex] is the correlation coefficient:
the correct statement is

[tex] r = \text{sign of regression slope} \cdot \sqrt{\, R^2}[/tex]

where I use [tex]r[/tex] as the correlation. Neither of these is a probability: both can be used in test statistics to determine the "validity" of the regression. In simple linear regression the F-test for slope can be written in terms of the appropriate sums of squares or in terms of [tex]R^2[/tex], and both are equivalent to testing for correlation.
*********************************************
Regarding your individual points of weakness,

1.) I don't see how this is a weakness. I have been taught and seems reasonable to me that one should NEVER force a curve fit to go through any point if you want an honest statistical analysis. In my mind, it defeats the point of objectively analyzing the data to force additional points on them in the curve fit. Is this naive?
********************************
It is not that you are forcing a regression line through a particular point. Remember that
linear regression (simple linear regression) essentially splits the plane into quadrants, with 'origin' at the single point [tex](\overline x, \overline y)[/tex] (the center of mass). A leverage point is any point that is significantly far removed from this point, and a leverage point will attract the regression line toward it as a magnet attracts shavings. You don't have to do this, it is a consequence of the underlying process.
*****************************************
2.) Can you direct me to a text which gives some sort of proof of this statement so I might better understand the justification? Is there a compensation method? When you say outlier, my immediate thought is can't the outliers be eliminated anyway using Chauvernet's criterion? If so, doesn't this remove this weakness?

No, you should NEVER remove a point believed to be an outlier, or a leverage point, simply because you or some test identifies it as such. The only time it is appropriate to remove outliers is when you can conclude that there has been a problem with the process of recording that value (usually not fixable) or with the data entry (transposition of digits, misplaced decimal, etc). If there is no "problem" that can be seriously identified as the cause
of an outlier, removing the offending point from the analysis is inappropriate. This is (only) one of the reasons there has been so much development of robust (resistant) regression methods.
********************************************************
3.) Makes sense, I have seen this argument before, I was being sloppy and should have explicitly stated this condition being met.

Can you comment on using chi squared?

I may need a little more information on your goal about this: one major use of chi square analysis is in the analysis of contingency tables (counts); how were you thinking this would relate to regression?

Hope at least some of this makes sense.
3.)

flying_young · Dec 21, 2008

statdad said:

No, you should NEVER remove a point believed to be an outlier, or a leverage point, simply because you or some test identifies it as such. The only time it is appropriate to remove outliers is when you can conclude that there has been a problem with the process of recording that value (usually not fixable) or with the data entry (transposition of digits, misplaced decimal, etc). If there is no "problem" that can be seriously identified as the cause
of an outlier, removing the offending point from the analysis is inappropriate. This is (only) one of the reasons there has been so much development of robust (resistant) regression methods.

Thanks StatDad, this is immensely helpful. So what exactly do you do when there is one or more influential outlier that is not due to human error? Can you name ,or perhaps explain, a robust regression method?

Thank you kindly!

statdad · Dec 22, 2008

If all you have at your disposal is software that does Least Squares regression, you are rather stuck. It can be helpful to rerun the analysis without the problem data values, but only to see how much the results change without them . If the coefficients change by more than a couple standard errors of their original values (or worse, change sign, or cause a significant relationship to become non-significant, etc.) you have a serious problem.

Remember that least squares regression works by finding the estimates that minimize the sum of the squared residuals:

[tex] S(\hat \alpha, \hat \beta) = \min_{(a,b) \in \mathbf{R}^2} \sum \left(y_i - (a + b x_i)\right)^2[/tex]

The problem with squaring the residuals is that the resulting estimates are easily influenced by outliers - just as the sample mean is. The notion of robust regression is this: replace the operation of squaring with a function that has similar interpretations (can be thought of as a distance function, essentially) but which will downplay the role of outliers. One early choice was to use absolute values (think of the median compared to the mean)

[tex] S_{L1}(\hat \alpha, \hat \beta) = \min_{(a,b) \in \mathbb{R}^2} \sum |y_i - (a + bx_i)|[/tex]

This can be solved via linear programming and other methods, but there is no closed form expression for the estimates. This provides some protection against [tex]y[/tex] outliers but not against [tex]x[/tex] outliers.

The starting point for robust regression begins with this observation: both least squares and L1 regression can be viewed as a process of minimizing the following sum, a function of the residuals

[tex] \sum \rho(y_i - (a+bx_i))[/tex]

For least squares [tex]\rho(x) = x^2[/tex], for L1 [tex]\rho(x) = |x|[/tex]. Huber (and others) had the idea of generalizing this to other forms of the function [tex]\rho[/tex].

The "simplest" case is Huber's function, which is "least squares in the middle and L1 on the ends". To ease typing I will use [tex]e_i[/tex] as an abbreviation for [tex]y_i - (a + bx_i)[/tex]

[tex] \rho(e_i) = \begin{cases}<br /> \frac{e_i^2} 2 & \text{ if } |e_i| \le k\\<br /> k |e_i| - \frac 1 2 k^2 & \text{ if } |e_i| > k<br /> \end{cases}[/tex]

The number [tex]k[/tex] is a "tuning constant" - a common choice is [tex]1.345[/tex], chosen for a number of theoretical and practical reasons. (Admission: this formulation assumes that the standard deviation of [tex]y[/tex] is known: if not, things can get a little messier.) The solution can be found using Weighted Least Squares.

Other choices for the [tex]\rho[/tex] function are available, offering different benefits.

There are other robust regression methods. Regression based on ranks can be motivated by this starting point. Here is the least squares function again.

[tex] \sum \left(y_i - (a+bx_i)\right)^2 = \sum {\left(y_i - (a+bx_i)\right) \cdot \left(y_i - (a+bx_i)\right)}[/tex]

The idea is to replace one of the terms in the second form by an expression that involves a function of the ranks of the residuals.

[tex] \sum \phi(\text{R}(y_i - (a+bx_i)) (y_i - (a+bx_i))[/tex]

Judicious choice of the score function [tex]\phi[/tex] can result in estimates that have reasonable break-down values and bounded influence. The 'drawback' here is that the intercept cannot be directly estimated, this must be done with some function of the residuals (median of the residuals, for example)

All the robust procedures I've mentioned give estimates that are asymptotically normal in distribution, which means (from a practical point of view) confidence intervals and other estimates can be easily obtained.

The down side? I don't know of any add-ins for Excel that will do any of these methods - there may be some, but since I don't use Excel in my teaching or other work I don't know of them.

There is a free statistics package, R, that has packages for robust regression. Don't let the word "free" lead you to thinking that this is a shabby program: it is very powerful. It runs on Windows, OS X, and Linux. You can find more information here.

http://cran.r-project.org/

I hope some (most?) of this is the type of response you wanted.

NoMoreExams · Dec 22, 2008

To note more about R, it's pretty much a free version of S-PLUS if the OP is more familiar with that.

statdad · Dec 22, 2008

NoMoreExams said:

To note more about R, it's pretty much a free version of S-PLUS if the OP is more familiar with that.

True - I didn't mention that because only Excel had been discussed earlier. R doesn't have the pretty GUI of S-Plus, but the lack of the price tag makes up for it in a massive way.

NoMoreExams · Dec 22, 2008

If the user likes pretty GUI's then he/she should look into SPSS. I prefer the layout of SAS/S-PLUS/R.

Not to hijack the thread but what can SAS do that R cannot (or is it more the fact that SAS has packages already developed whereas the ones for R come out whenever people create them)

statdad · Dec 22, 2008

SAS is a little quicker with massive data sets, and since it has been around much longer than R, it has the reputation of being an "industry standard".
Programming in SAS is also a little more flexible than it is in R. I'm not sure how the availability of extra packages for the two compare. I do know that R is growing in popularity in a variety of biostatistics applications.

Note that my comment about the GUI was not aimed at the OP, it was meant to be a humorous comparison between R and S-Plus.

Chi Squared vs. Coefficient of Determination

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect