WWGD said:
Thanks, but I am trying to understand how the test for normality of residuals is made. I assumed if the residuals came from a normal population, then the sample residuals (meaning the residuals from the regression) would have mean equal to the sample mean and standard deviation equal to the standard error.
It's improbable that sample from a continuous population will have a sample mean exactly equal to the population mean and sample standard deviation exactly equal to the population standard deviation.
Then I guess most residuals would fall within 3 SE's from the sample mean. I think this last is a corollary of the CLT.
The Central Limit Theorem, like most theorems about probability, doesn't guarantee anything about observed frequencies in sample data. It tells something about the
probability distribution of a random variable. Also, there are various non-normal distributions where there is a high probability that sample data is within 3 standard deviations of the mean.
I know with most any statistical package I could just do the test by double-clicking , but I am trying to understand the interpretation of whether the residual values fall along the line in the test.
You have the correct intuitive idea of what the plot of the residuals should look like.
If you need to understand statstical tests for normality (as opposed to "eyeball" inspections), the first thing to understand about statistical tests is the way the "game" is played.
The way typical statistical hypothesis tests work is that there is a formula applied to the data to compute a numerical value. This formula defines a "statistic". Since the statistic depends on values in the sample, which are random, the statistic itself is a random variable. From one given set of data, you compute a numerical value for the statistic. This gives you a single sample of the statistic. From that sample of one numerical value, you attempt to draw a conclusion.
Under the null hypothesis (for example, the null hypothesis that the sample is from a normal distribution) the statistic will have a probability distribution. Sometimes you can look up this distribution. If you made up the formula for the statistic yourself, you might have to empirically determine the distribution. For example, if the null hypothesis is that the data is from a normal distribution and you want to know the distribution of a home made statistic that is computed from 100 data points, you could do a Monte-Carlo simulation simulating thousands and thousands of random batches of 100 data points from a normal distribution. Computing the statistic for each batch, and histogramming the result estimates the distribution of the statistic.
The graph of this distribution of the statistic shows that the statistic is more likely to fall in a certain range of values. You pick a probable range of these values as the "acceptance" region. If the single numerical value you got from you actual data falls in this region, you "accept" that the null hypothesis is true. (This is not a proof the null hypothesis is true, it is just a procedure.)
Is the general game plan clear?