Checking if the residues are normal ad nauseum?

nomadreid · Feb 23, 2013

If I am checking whether my data fits a curve C1, I have to check to see whether the residues R1_n are normally distributed, which is checking R1_n against a normal curve C2, giving me residues R2_m; which must be normally distributed, that is, must be checked against a normal curve C3, giving me residues R3_p, and so on ad nauseum. Where does this end?

mfb · Feb 24, 2013

The normal distribution of your residues is not necessary - if their mean is 0 and their standard deviation is 1, you are done. Any deviation from a normal distribution there would indicate some weird (non-gaussian) uncertainties for the individual data points.

Stephen Tashi · Feb 24, 2013

nomadreid said:

If I am checking whether my data fits a curve C1, I have to check to see whether the residues R1_n are normally distributed, which is checking R1_n against a normal curve C2, giving me residues R2_m; which must be normally distributed, that is, must be checked against a normal curve C3, giving me residues R3_p, and so on ad nauseum. Where does this end?

In the first place, what do you mean when you say you are "checking"? You aren't describing a definite statistical test. I can appreciate your general train of thought. If there were some method of determining whether a given sample definitely did or did-not come from a normal distribution then a similar method could be applied to residues of plotting the histogram of the data vs the normal probability density. Then a similar method could also be applied to residues of the residues etc. However, there is no such fool proof method. All standard statistical hypothesis tests for normality compute is the probability of certain aspects of the observed data given than we assume it came from a normal distribution. If you don't assume it came from a given distribuiton, you can't compute anything. (If this is upsetting, see Bayesian statistics.)

It is possible that you could invent a statistical hypothesis test based on residues-of-residues. To compare the utility of that test to the customary tests, people would look a the "power" of your test. The "power" of a test is complicated to define. It isn't a single number. It is a curve or surface that depends on how you parameterize the shape of the non-normal distributions that you consider.

nomadreid · Feb 26, 2013

Thanks for the answers, mfb and Stephen Tashi. (Sorry for the delayed response.) Apparently statisticians rely quite a bit on "hm, looks OK". (I'm not at all a statistician, which you can certainly tell from my beginner's questions; I'm more used to those strange places in mathematics where correlation is a yes/no affair unless you are doing perturbation theory. On the other hand, prior assumptions are the heart and soul of mathematics: "Er, well, let's call (N, <) consistent, and have done with it.")

More seriously: the statistical test I had in mind for the beginning set of points was the Pearson's correlation coefficient or something similar, where the residues should (I think) be more or less normally distributed, because otherwise (it appears at first glance at the formula) one could construct some wild mismatch between data and a line yet come up with a high r². It might even not be too difficult to construct such with a 0 mean and sd=1. But as was pointed out, such a counter-example would probably look weird. (Something like Anscombe's quartet.) Or, to a blind computer, there would be other tests (which I haven't got to yet in my self-study of statistics) to check if it was weird. But then I was not sure about a test for the following steps to check data (residues) against normality; your answers indicate that there is none. Interesting.

Stephen Tashi · Feb 26, 2013

nomadreid said:

But then I was not sure about a test for the following steps to check data (residues) against normality; your answers indicate that there is none. Interesting.

Curve fitting falls under the statistical topic of "estimation". This is a distinct topic from "hypothesis testing", which involves procedures that specify yes-or-no decisions. So if your goal is find the best possible fit of curve to an empirical distribution you should approach it as a problem of estimation.

In the standard sort of statistics ("frequentist" statistics) people do sometimes employ several hypothesis tests to analyze data. (The Wikipedia has an article about this under the topic of "Multiple Comparisons", which I haven't read carefully.)

Applying statistics to real life data is a subjective matter. The nature of hypothesis testing is that it is a procedure for producing a decision, not a proof that the decision is correct. In most cases, all that can be quantified is the probability of making the wrong decision given that the "null hypothesis" is assumed to be correct. (From the point of view of a proof, if one assumes the null hypothesis is true then there is nothing to decide about whether it is true or not.)

Checking if the residues are normal ad nauseum?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad A variant of the Monty Hall problem

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad How do E[X] and E[|X|] relate?

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight