Test for Normality of Residuals -- Is this how it Works?

WWGD · Feb 17, 2015

Hi, let ## r_1,..,r_n## be residuals in a given regression. I am trying to understand how the test for normality
works. This is how I think it works:

We take the sampling mean , i.e., ##r:=\frac {1}{n} \Sigma r_i / n## , and the sampling standard deviation ##\sigma_{r_i} ##. Then, if the residuals are normally-distributed, we could find the z-values for each of the ## r_i## , using ## r, \sigma_{r_i} / \sqrt n ## , and we compare these to the actual z-values of the actual data. And, overall (this is an observation, not part of the test), around 68.6% of data values should be within 1-##\sigma_{r_i}## of the sample mean ##r## (i.e., the sample data should follow the 68-95-99.7 rule). Is that it?

statdad · Feb 17, 2015

If your regression has an intercept the residuals will sum to zero

WWGD · Feb 17, 2015

What do you mean, don't all regression lines have an intercept?

Stephen Tashi · Feb 17, 2015

If you do a least squares linear regression to a set of data, the mean of the residuals computed using that data will be zero - that's just because the formulas for doing a least squares fit make this happen.

If you normalize the residuals by dividing them by their sample standard deviation then a histogram of the residuals should resemble a normal distribution, but the 68.6% rule may not be followed exactly even if the residuals are actually a sample from a normal distribution. You can look up various statistical hypothesis tests for whether a distribution of data comes from a normal distribution.

WWGD · Feb 17, 2015

Thanks, but I am trying to understand how the test for normality of residuals is made. I assumed if the residuals came from a normal population, then the sample residuals (meaning the residuals from the regression) would have mean equal to the sample mean and standard deviation equal to the standard error. Then I guess most residuals would fall within 3 SE's from the sample mean. I think this last is a corollary of the CLT. I know with most any statistical package I could just do the test by double-clicking , but I am trying to understand the interpretation of whether the residual values fall along the line in the test.

Stephen Tashi · Feb 17, 2015

WWGD said:

Thanks, but I am trying to understand how the test for normality of residuals is made. I assumed if the residuals came from a normal population, then the sample residuals (meaning the residuals from the regression) would have mean equal to the sample mean and standard deviation equal to the standard error.

It's improbable that sample from a continuous population will have a sample mean exactly equal to the population mean and sample standard deviation exactly equal to the population standard deviation.

Then I guess most residuals would fall within 3 SE's from the sample mean. I think this last is a corollary of the CLT.

The Central Limit Theorem, like most theorems about probability, doesn't guarantee anything about observed frequencies in sample data. It tells something about the probability distribution of a random variable. Also, there are various non-normal distributions where there is a high probability that sample data is within 3 standard deviations of the mean.

I know with most any statistical package I could just do the test by double-clicking , but I am trying to understand the interpretation of whether the residual values fall along the line in the test.

You have the correct intuitive idea of what the plot of the residuals should look like.

If you need to understand statstical tests for normality (as opposed to "eyeball" inspections), the first thing to understand about statistical tests is the way the "game" is played.

The way typical statistical hypothesis tests work is that there is a formula applied to the data to compute a numerical value. This formula defines a "statistic". Since the statistic depends on values in the sample, which are random, the statistic itself is a random variable. From one given set of data, you compute a numerical value for the statistic. This gives you a single sample of the statistic. From that sample of one numerical value, you attempt to draw a conclusion.

Under the null hypothesis (for example, the null hypothesis that the sample is from a normal distribution) the statistic will have a probability distribution. Sometimes you can look up this distribution. If you made up the formula for the statistic yourself, you might have to empirically determine the distribution. For example, if the null hypothesis is that the data is from a normal distribution and you want to know the distribution of a home made statistic that is computed from 100 data points, you could do a Monte-Carlo simulation simulating thousands and thousands of random batches of 100 data points from a normal distribution. Computing the statistic for each batch, and histogramming the result estimates the distribution of the statistic.

The graph of this distribution of the statistic shows that the statistic is more likely to fall in a certain range of values. You pick a probable range of these values as the "acceptance" region. If the single numerical value you got from you actual data falls in this region, you "accept" that the null hypothesis is true. (This is not a proof the null hypothesis is true, it is just a procedure.)

Is the general game plan clear?

statdad · Feb 18, 2015

"Don't all regression lines have intercepts?"
It is possible to fit a regression line without an intercept (rarely a good idea, however). If the intercept is included then the residuals will sum to zero. If the intercept is not then the residuals won't sum to zero.

It isn't possible to prove any set of numbers comes from a normal distribution, for many reasons. Two of the most important:

no set of data is ever really normally distributed - the normal distribution is a mathematical model that tells us what patterns we can expect to see when data is very nicely mound-shaped an symmetric. In reality, many data sets resemble a normal distribution near the center but differ in the tails - where all the exciting stuff happens
A test of normality would have "The residuals are normally distributed" in the null hypothesis, and a test does not prove the null hypothesis is true: if it doesn't give strong indication the null is false it leaves us with "we can't say it is false".

In short - a test of normality really tells us that the residuals that the residuals aren't normal (H0 is rejected) or "the differences between them and a normal distribution are small enough to ignore".

And you never, ever, "accept" the null hypothesis.

WWGD · Feb 18, 2015

Stephen Tashi said:

It's improbable that sample from a continuous population will have a sample mean exactly equal to the population mean and sample standard deviation exactly equal to the population standard deviation.
The Central Limit Theorem, like most theorems about probability, doesn't guarantee anything about observed frequencies in sample data. It tells something about the probability distribution of a random variable. Also, there are various non-normal distributions where there is a high probability that sample data is within 3 standard deviations of the mean.
You have the correct intuitive idea of what the plot of the residuals should look like.

If you need to understand statstical tests for normality (as opposed to "eyeball" inspections), the first thing to understand about statistical tests is the way the "game" is played.

The way typical statistical hypothesis tests work is that there is a formula applied to the data to compute a numerical value. This formula defines a "statistic". Since the statistic depends on values in the sample, which are random, the statistic itself is a random variable. From one given set of data, you compute a numerical value for the statistic. This gives you a single sample of the statistic. From that sample of one numerical value, you attempt to draw a conclusion.

Under the null hypothesis (for example, the null hypothesis that the sample is from a normal distribution) the statistic will have a probability distribution. Sometimes you can look up this distribution. If you made up the formula for the statistic yourself, you might have to empirically determine the distribution. For example, if the null hypothesis is that the data is from a normal distribution and you want to know the distribution of a home made statistic that is computed from 100 data points, you could do a Monte-Carlo simulation simulating thousands and thousands of random batches of 100 data points from a normal distribution. Computing the statistic for each batch, and histogramming the result estimates the distribution of the statistic.

The graph of this distribution of the statistic shows that the statistic is more likely to fall in a certain range of values. You pick a probable range of these values as the "acceptance" region. If the single numerical value you got from you actual data falls in this region, you "accept" that the null hypothesis is true. (This is not a proof the null hypothesis is true, it is just a procedure.)

Is the general game plan clear?

Thank, I am aware of this, but it was my understanding that in the test I was referring to , the CLTheorem was necessary. Sorry, I should have given the name of the test, but I was not able to find it. The test comes down to seeing how well some function of the residuals falls into the line y=x. My best guess was that we were using the CLT this way: each residual is consider a random variable, and all of them are identically distributed and independent. Then, if we had enough residuals, say more than 15, the distribution of the sum of the residuals is normal ( converges in distribution to a normal distribution), I believe to the sample mean and standard deviation being the standard error SE:= ## s/ \sqr n## , where s is the standard deviation of the sample data. But I cannot make sense of how this would become a test for normality.

Stephen Tashi · Feb 18, 2015

WWGD said:

Thank, I am aware of this, but it was my understanding that in the test I was referring to , the CLTheorem was necessary. Sorry, I should have given the name of the test, but I was not able to find it.

The title of your thread asks how a test for the normality of residuals works. There are many different tests for normality. (Look up "normality tests" in the Wikipedia). If you can't remember which test is used, I don't think anyone can explain to you how the test works.

The test comes down to seeing how well some function of the residuals falls into the line y=x. My best guess was that we were using the CLT this way: each residual is consider a random variable, and all of them are identically distributed and independent. Then, if we had enough residuals, say more than 15, the distribution of the sum of the residuals is normal ( converges in distribution to a normal distribution),

The CLT would tell you about the distribution of the sample mean of observations, not about the distribution of the individual observations. If the residuals were independently and identically distributed with some non-normal distribution (for example, a uniform distribution) then the CLT tells you the sample mean of a large number of the observations is a random variable with an approximately normal distribution.

I believe to the sample mean and standard deviation being the standard error SE:= ## s/ \sqr n## , where s is the standard deviation of the sample data. But I cannot make sense of how this would become a test for normality.

You are correct, the CLT (by itself) isn't a basis for a test for normality.

WWGD · Feb 18, 2015

Sorry for being vague about the test, I was hoping someone would know which test it was by the description. The test measures how well (meaning how close) the residual values fall in the line y=x. I will continue looking up the actual name.

statdad · Feb 19, 2015

If you are referring to a graph you want a normal probability plot, available from most any statistics package. If you want a test you could look at the Kolgomorov-Smirnov test - it works by examining the largest difference between the empirical distribution function of the residuals and the hypothesized (model) distribution.

" the distribution of the sum of the residuals is normal"

Probably not. Unless your residuals are already reasonably symmetric it would be rare for a sum of as few as 15 to resemble in any noteworthy way a normal distribution, and the sample size would be far too small to get an indication from a histogram (which are horribly unreliable for small samples).

It is worth noting that the classical statement of assumptions for regression require normality in the errors: that can be relaxed to symmetry alone for larger samples. If you have a fairly large sample, and the residuals' distribution is reasonably symmetric, with no outliers, that is likely sufficient.

WWGD · Mar 3, 2015

Update:. The test I was using is the Q-Q plot : http://en.wikipedia.org/wiki/Q–Q_plot

Test for Normality of Residuals -- Is this how it Works?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect