Why are definitions for the studentized residual so confusing?

1. Nov 17, 2012

meanrev

I cannot find a consistent definition of the studentized residual and the RMSEP, because I've noticed that various websites, lecture notes and software packages mix up 1 or 2 definitions along the way to the point that a "compound" definition ends up very different between one reference source and another!

So I'm going to write all of my definitions from the ground up. Would someone be so kind as to confirm to me if my definitions 4, 5, 7 and 8 are correct?
> Regarding (4) and (5), should I divide my PRESS by the sample size $n$ or should I divide it by the degrees of freedom, as I would calculate the RMSE?
> Regarding (7) and (8), am I correct to use the jackknifed residual in the numerator and the RMSEP (instead of the RMSE) in the denominator? Is there an intuitive explanation as to why I should prefer the jackknifed residual over the internally studentized residual?

DEFINITION 1. My raw residuals are $\hat{e}_{i}=Y_{i}-\hat{Y}_{i}$ where $Y_{i}$'s are the actual values and $\hat{Y}_{i}$ are the values predicted by the regression equation.

DEFINITION 2. The hat matrix is defined as $H$ such that the vector of values predicted by the regression equation $\hat{Y}=HY$, where $Y$ is the vector of actual values.

DEFINITION 3. The jackknifed residuals are defined as $\hat{e}_{i,-i}=Y_{i}-\hat{Y}_{i,-i}$ where $\hat{Y}_{i,-i}$ are the values predicted by the regression equation estimated while excluding $Y_{i}$

DEFINITION 4. Given a sample size of $n$ data points and $k$ predictor variables, the RMSE is simply the SSE divided by the degrees of freedom, $\sqrt{\dfrac{SSE}{n-k-1}}$.

DEFINITION 5. Given a sample size of $n$ data points, the predicted residual sums of squares (PRESS) is $PRESS=\sum_{i=1}^{n}\hat{e}_{i,-i}=\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i,-i}\right)^{2}$ so the root mean squared error of prediction (RMSEP) is $RMSEP=\sqrt{\dfrac{PRESS}{n}}$

DEFINITION 6. The standardized residual is the raw residual divided by its RMSE, i.e. $\dfrac{\hat{e}_{i}}{RMSE}$.

DEFINITION 7. The internally studentized residual is $\dfrac{\hat{e}_{i}}{RMSE\sqrt{1-h_{ii}}}$ where the leverage $h_{ii}\in\left[0,1\right]$ is the $i$th diagonal entry of the hat matrix .

DEFINITION 8. The studentized deleted residual is calculated using the jackknifed residuals, so it is computed as $\dfrac{\hat{e}_{i,-i}}{RMSEP\sqrt{1-h_{ii}}}$.

2. Nov 18, 2012

ImaLooser

The basic idea is that the sample size is small, ie less than 30. Then the canonical estimate for the variance of the residuals is not correct: the actual variance is larger, and you have to use Student's t tables to correct to the actual variance. That's called "studentizing" the residuals.

3. Nov 18, 2012

Stephen Tashi

Contrary to popular opinion, in different specialized areas of mathematics the same terms may have different meanings (for example: "conjugate", "dual problem", "homogeneous form"). The correctness of a definition is a cultural matter. It has to do with an agreement by a group of human beings. So if you have been reading a wide variety of specialized mathematical literature it isnt' surprising that you find some terms defined inconsistently. Definitions for some things like "an abelian group", "Lebesgue measure" and "complement of a set" are universal in the sense that mathematicans in various specialties mean the same things by them. I don't think the phrase "studentized residuals" has such a widely accepted meaning. A specialist in a particular field (such as linear models or jackknife estimates) might be able to evaluate your definitions with respect to that particular field. If you want the forum-at-large to help, you could give links for the definitions you've found.

The general idea of "studentizing" an estimator of something is to modify the formula for it in order to convert it into an unbiased estimator. (There is nothing "incorrect" about biased estimators. A realization of any sort of estimator, studentized or otherwise, isn't guaranteed to be correct for what its trying to estimate. For example, see post #12 in this thread: https://www.physicsforums.com/showthread.php?t=616643 .)

4. Nov 23, 2012

meanrev

@ImaLooser

Thanks very much for explanation! If I don't understand wrongly, doesn't the "standardized residual" already take the biased estimates into account? It seems like the difference between "standardized" and "studentized" lies with whether (1) the leverage is applied, (2) the whole thing is "jackknifed" (and there doesn't seem to be agreement what exactly to exclude).

@Stephen Tashi

That's a great post, I'm grateful for the link.

Yes! There is a weird dual usage of RMSE to refer to both the "root mean square error" and the "root mean square (fitting) error", which apparently differ by whether you take the sample size or the degrees of freedom. I've seen it done both ways, e.g. the former in: http://en.wikipedia.org/wiki/Root_mean_square_deviation and http://www.ltrr.arizona.edu/~dmeko/notes_12.pdf [Broken] while the latter in http://www.math.uah.edu/stat/sample/Variance.html and http://statmaster.sdu.dk/courses/ST02/module10/module.pdf. I am *guessing* that the convention is to use the latter when you are doing regression analysis, i.e. the same way it is calculated on software packages like MATLAB and R, which agrees with the little note on: http://en.wikipedia.org/wiki/Mean_squared_error

I got the Definition 5 from: http://www.vub.ac.be/fabi/multi/pcr/chaps/chap13.html which suggests that I take the sample size rather than the degrees of freedom. I am convinced this is correct, as corroborated by: http://www.physiol.ox.ac.uk/Computing/Online_Documentation/Matlab/toolbox/mbc/model/techdo11.html

There are three parts of studentizing (Definitions 7 and 8) that get confusing. (1) Do you use the RMSE or RMSEP? (2) Do you use the original residual or the jackknifed residual? (3) Does the jackknifed residual imply removing the ith data point altogether or taking the difference between the ith data point and the regression line value estimated without i? According to: http://www-stat.wharton.upenn.edu/~waterman/Teaching/701f99/Class04/class04.pdf, the studentized residual "is just a standardized jackknifed residual". Which leads to an ambiguity in Definition 7 and 8: do I "standardize it" by dividing by the RMSE or the RMSEP? Here: https://stat.ethz.ch/pipermail/r-help/2011-August/286427.html, I see someone else facing the same issues and Breheny gives a good explanation of the different definitions. He says, 'The "studentized" residuals are similar, but involve estimating sigma in a way that leaves out the ith data point when calculating the ith residual' - which makes it ambiguous whether I should take the RMSE or RMSEP for Definitions 7 and 8 to estimate the sigma.

Regarding how the jackknifed residual is taken and calculated: In some software packages like http://support.sas.com/documentatio...lt/viewer.htm#statug_intromod_a0000000355.htm, and articles like http://en.wikipedia.org/wiki/Studentized_residual#Internal_and_external_studentization, it is suggested that I "remove" the ith data point altogether when calculating the sigma at i, and reduce the "sample size" by 1, in both Definitions 7 and 8. But they leave the numerator (raw residual) alone. This doesn't agree with what I've been taught - I understand that you should include the ith data point, but take the difference between the value of the ith data point and the value obtained from the regression line excluding the ith data point (Definition 3) in the numerator. In the denominator, it seems that they take the RMSE excluding the ith data point (exclude ith data point and divide by degrees of freedom-1) rather than the RMSEP (include every residual except estimate them as the difference between the estimated regression line and the raw residual, then divide by the number of samples-1). Some ignore the leverage altogether! (http://statistika.vse.cz/konference/amse/PDF/Blatna.pdf)

****​

To put it in simple words, I really want to know whether to use:
1. the raw residual or the jackknifed residual in the numerator
2. the RMSE or RMSEP in the denominator
for Definitions 7 and 8.

Last edited by a moderator: May 6, 2017
5. Nov 23, 2012

Stephen Tashi

There is a lot of ground to cover. Let's start with "mean square error".

As far as I can tell, the term "mean square error" has some ambiguity to it apart from particular formulas for it. Among those references, you may have discovered different formulas for the same meaning of "mean square error", but some of the differences you found may simply be due to different meanings of "mean square error".

Take the more familiar term "standard deviation". It has (at least!) the following different meanings, depending on the context in which it is used.

1. A random variable with a given probability density has a "standard deviation" which is calculated by using the values of the density function. It would be done by integrations in the case of a continuous density.

2. A sample of n realizations of a random variable has a "standard deviation", which ought be called the "sample standard deviation". The "sample standard deviation", since it is a function of the random values in the sample, is itself a random variable. So it has a probability density function, which is, in general different from the distribution from which the individual samples are drawn. From this point of view, the "sample standard deviation" is a function of the N values in the sample. There IS ambiguity about how this formula is defined. Many texts and software programs define the formula to have the denominator N-1. Other texts, define the formula to have the denominator N.

3. An estimator is a function of the sample values that is used to estimate some property of the distribution from which the samples are taken. As the post I linked-to illustrated, there are various estimators the "standard deviation" of the distribution from which the samples are taken. Different formulas define different estimators so it wouldn't be correct to say that there is "ambiguity" in the usual estimators for the standard deviation. They are simply different estimators.

With reference to "mean square error", I think there are differences in meaning similar to those above.

1. If we consider a prediction function and a specific distribution for the errors then there will be a "mean square error" that one could calculated from the distribution function of the errors.

2. For a sample of data, there is a "mean square error" that one computes from the sample values. (This will have the ambiguity about whether to use N or N-1 in the denominator.)

3. There are also different estimators of the mean square error which use various formulas to estimate the value of the mean square error (as in meaning 1. ) . from the sample values.

Last edited by a moderator: May 6, 2017