Test for Normality of Residuals -- Is this how it Works?

In summary, the test for normality of residuals involves comparing the z-values of the residuals to the actual z-values of the data. The 68.6% rule may not always be followed, but there are various statistical tests available to determine if the data is from a normal distribution. The Central Limit Theorem does not guarantee that the sample data will fall within a certain range of values, and the game plan for statistical hypothesis tests involves computing a statistic and comparing it to a probability distribution. It is possible for regression lines to not have intercepts, and it is not possible to prove that a set of numbers comes from a normal distribution.
  • #1
WWGD
Science Advisor
Gold Member
7,007
10,459
Hi, let ## r_1,..,r_n## be residuals in a given regression. I am trying to understand how the test for normality
works. This is how I think it works:

We take the sampling mean , i.e., ##r:=\frac {1}{n} \Sigma r_i / n## , and the sampling standard deviation ##\sigma_{r_i} ##. Then, if the residuals are normally-distributed, we could find the z-values for each of the ## r_i## , using ## r, \sigma_{r_i} / \sqrt n ## , and we compare these to the actual z-values of the actual data. And, overall (this is an observation, not part of the test), around 68.6% of data values should be within 1-##\sigma_{r_i}## of the sample mean ##r## (i.e., the sample data should follow the 68-95-99.7 rule). Is that it?
 
Physics news on Phys.org
  • #2
If your regression has an intercept the residuals will sum to zero
 
  • #3
What do you mean, don't all regression lines have an intercept?
 
  • #4
If you do a least squares linear regression to a set of data, the mean of the residuals computed using that data will be zero - that's just because the formulas for doing a least squares fit make this happen.

If you normalize the residuals by dividing them by their sample standard deviation then a histogram of the residuals should resemble a normal distribution, but the 68.6% rule may not be followed exactly even if the residuals are actually a sample from a normal distribution. You can look up various statistical hypothesis tests for whether a distribution of data comes from a normal distribution.
 
  • #5
Thanks, but I am trying to understand how the test for normality of residuals is made. I assumed if the residuals came from a normal population, then the sample residuals (meaning the residuals from the regression) would have mean equal to the sample mean and standard deviation equal to the standard error. Then I guess most residuals would fall within 3 SE's from the sample mean. I think this last is a corollary of the CLT. I know with most any statistical package I could just do the test by double-clicking , but I am trying to understand the interpretation of whether the residual values fall along the line in the test.
 
Last edited:
  • #6
WWGD said:
Thanks, but I am trying to understand how the test for normality of residuals is made. I assumed if the residuals came from a normal population, then the sample residuals (meaning the residuals from the regression) would have mean equal to the sample mean and standard deviation equal to the standard error.

It's improbable that sample from a continuous population will have a sample mean exactly equal to the population mean and sample standard deviation exactly equal to the population standard deviation.

Then I guess most residuals would fall within 3 SE's from the sample mean. I think this last is a corollary of the CLT.

The Central Limit Theorem, like most theorems about probability, doesn't guarantee anything about observed frequencies in sample data. It tells something about the probability distribution of a random variable. Also, there are various non-normal distributions where there is a high probability that sample data is within 3 standard deviations of the mean.

I know with most any statistical package I could just do the test by double-clicking , but I am trying to understand the interpretation of whether the residual values fall along the line in the test.

You have the correct intuitive idea of what the plot of the residuals should look like.

If you need to understand statstical tests for normality (as opposed to "eyeball" inspections), the first thing to understand about statistical tests is the way the "game" is played.

The way typical statistical hypothesis tests work is that there is a formula applied to the data to compute a numerical value. This formula defines a "statistic". Since the statistic depends on values in the sample, which are random, the statistic itself is a random variable. From one given set of data, you compute a numerical value for the statistic. This gives you a single sample of the statistic. From that sample of one numerical value, you attempt to draw a conclusion.

Under the null hypothesis (for example, the null hypothesis that the sample is from a normal distribution) the statistic will have a probability distribution. Sometimes you can look up this distribution. If you made up the formula for the statistic yourself, you might have to empirically determine the distribution. For example, if the null hypothesis is that the data is from a normal distribution and you want to know the distribution of a home made statistic that is computed from 100 data points, you could do a Monte-Carlo simulation simulating thousands and thousands of random batches of 100 data points from a normal distribution. Computing the statistic for each batch, and histogramming the result estimates the distribution of the statistic.

The graph of this distribution of the statistic shows that the statistic is more likely to fall in a certain range of values. You pick a probable range of these values as the "acceptance" region. If the single numerical value you got from you actual data falls in this region, you "accept" that the null hypothesis is true. (This is not a proof the null hypothesis is true, it is just a procedure.)

Is the general game plan clear?
 
  • #7
"Don't all regression lines have intercepts?"
It is possible to fit a regression line without an intercept (rarely a good idea, however). If the intercept is included then the residuals will sum to zero. If the intercept is not then the residuals won't sum to zero.

It isn't possible to prove any set of numbers comes from a normal distribution, for many reasons. Two of the most important:

  • no set of data is ever really normally distributed - the normal distribution is a mathematical model that tells us what patterns we can expect to see when data is very nicely mound-shaped an symmetric. In reality, many data sets resemble a normal distribution near the center but differ in the tails - where all the exciting stuff happens
  • A test of normality would have "The residuals are normally distributed" in the null hypothesis, and a test does not prove the null hypothesis is true: if it doesn't give strong indication the null is false it leaves us with "we can't say it is false".
In short - a test of normality really tells us that the residuals that the residuals aren't normal (H0 is rejected) or "the differences between them and a normal distribution are small enough to ignore".

And you never, ever, "accept" the null hypothesis.
 
  • #8
Stephen Tashi said:
It's improbable that sample from a continuous population will have a sample mean exactly equal to the population mean and sample standard deviation exactly equal to the population standard deviation.
The Central Limit Theorem, like most theorems about probability, doesn't guarantee anything about observed frequencies in sample data. It tells something about the probability distribution of a random variable. Also, there are various non-normal distributions where there is a high probability that sample data is within 3 standard deviations of the mean.
You have the correct intuitive idea of what the plot of the residuals should look like.

If you need to understand statstical tests for normality (as opposed to "eyeball" inspections), the first thing to understand about statistical tests is the way the "game" is played.

The way typical statistical hypothesis tests work is that there is a formula applied to the data to compute a numerical value. This formula defines a "statistic". Since the statistic depends on values in the sample, which are random, the statistic itself is a random variable. From one given set of data, you compute a numerical value for the statistic. This gives you a single sample of the statistic. From that sample of one numerical value, you attempt to draw a conclusion.

Under the null hypothesis (for example, the null hypothesis that the sample is from a normal distribution) the statistic will have a probability distribution. Sometimes you can look up this distribution. If you made up the formula for the statistic yourself, you might have to empirically determine the distribution. For example, if the null hypothesis is that the data is from a normal distribution and you want to know the distribution of a home made statistic that is computed from 100 data points, you could do a Monte-Carlo simulation simulating thousands and thousands of random batches of 100 data points from a normal distribution. Computing the statistic for each batch, and histogramming the result estimates the distribution of the statistic.

The graph of this distribution of the statistic shows that the statistic is more likely to fall in a certain range of values. You pick a probable range of these values as the "acceptance" region. If the single numerical value you got from you actual data falls in this region, you "accept" that the null hypothesis is true. (This is not a proof the null hypothesis is true, it is just a procedure.)

Is the general game plan clear?

Thank, I am aware of this, but it was my understanding that in the test I was referring to , the CLTheorem was necessary. Sorry, I should have given the name of the test, but I was not able to find it. The test comes down to seeing how well some function of the residuals falls into the line y=x. My best guess was that we were using the CLT this way: each residual is consider a random variable, and all of them are identically distributed and independent. Then, if we had enough residuals, say more than 15, the distribution of the sum of the residuals is normal ( converges in distribution to a normal distribution), I believe to the sample mean and standard deviation being the standard error SE:= ## s/ \sqr n## , where s is the standard deviation of the sample data. But I cannot make sense of how this would become a test for normality.
 
  • #9
WWGD said:
Thank, I am aware of this, but it was my understanding that in the test I was referring to , the CLTheorem was necessary. Sorry, I should have given the name of the test, but I was not able to find it.
The title of your thread asks how a test for the normality of residuals works. There are many different tests for normality. (Look up "normality tests" in the Wikipedia). If you can't remember which test is used, I don't think anyone can explain to you how the test works.

The test comes down to seeing how well some function of the residuals falls into the line y=x. My best guess was that we were using the CLT this way: each residual is consider a random variable, and all of them are identically distributed and independent. Then, if we had enough residuals, say more than 15, the distribution of the sum of the residuals is normal ( converges in distribution to a normal distribution),

The CLT would tell you about the distribution of the sample mean of observations, not about the distribution of the individual observations. If the residuals were independently and identically distributed with some non-normal distribution (for example, a uniform distribution) then the CLT tells you the sample mean of a large number of the observations is a random variable with an approximately normal distribution.

I believe to the sample mean and standard deviation being the standard error SE:= ## s/ \sqr n## , where s is the standard deviation of the sample data. But I cannot make sense of how this would become a test for normality.

You are correct, the CLT (by itself) isn't a basis for a test for normality.
 
  • #10
Sorry for being vague about the test, I was hoping someone would know which test it was by the description. The test measures how well (meaning how close) the residual values fall in the line y=x. I will continue looking up the actual name.
 
  • #11
If you are referring to a graph you want a normal probability plot, available from most any statistics package. If you want a test you could look at the Kolgomorov-Smirnov test - it works by examining the largest difference between the empirical distribution function of the residuals and the hypothesized (model) distribution.

" the distribution of the sum of the residuals is normal"

Probably not. Unless your residuals are already reasonably symmetric it would be rare for a sum of as few as 15 to resemble in any noteworthy way a normal distribution, and the sample size would be far too small to get an indication from a histogram (which are horribly unreliable for small samples).

It is worth noting that the classical statement of assumptions for regression require normality in the errors: that can be relaxed to symmetry alone for larger samples. If you have a fairly large sample, and the residuals' distribution is reasonably symmetric, with no outliers, that is likely sufficient.
 
  • #12

1. What is the purpose of testing for normality of residuals?

The purpose of testing for normality of residuals is to assess whether the residuals of a statistical model follow a normal distribution. This is important because many statistical tests and models assume normality of the residuals, and violating this assumption can lead to incorrect conclusions.

2. How is normality of residuals typically assessed?

Normality of residuals is typically assessed using visual methods, such as a histogram or a normal probability plot, or statistical tests, such as the Shapiro-Wilk test or the Kolmogorov-Smirnov test.

3. What does it mean if the residuals are not normally distributed?

If the residuals are not normally distributed, it means that the model may not be capturing all of the important relationships in the data and may need to be revised. It also means that any statistical tests or inferences made using the model may not be reliable.

4. Can non-normal residuals be transformed to achieve normality?

Yes, non-normal residuals can sometimes be transformed to achieve normality. Common transformations include logarithmic, square root, and inverse transformations. However, it is important to note that transforming the residuals may also affect the interpretation of the model and should be done with caution.

5. Is it always necessary to test for normality of residuals?

No, it is not always necessary to test for normality of residuals. If the assumptions of the statistical test or model being used do not include normality of residuals, then testing for normality is not required. Additionally, if the sample size is large enough, the Central Limit Theorem states that the distribution of the residuals will tend towards normality regardless of the underlying distribution of the data.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
464
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
477
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
912
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
919
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
959
Back
Top