I've calculated the mean difference of my (normally distributed) data set. The mean difference is defined as:

Now, I'm trying to calculate the "mean difference deviation" in order to generate a confidence interval for this quantity ( "95% of the differences in the set are greater than ____"). My question is: can I generalize the standard deviation formula to calculate this? If we take the following concepts to be parallel:

Code (Text):

Mean <---------> Mean Difference
Single value <---------> Single Difference

... can I use the standard deviation equation to calculate the "mean difference deviation"? Namely turning this:

to this:

I've looked up more direct ways to calculate this quantity, and all of them are contained in statistics articles that are way(1) over(2) my head(3); so much so that I can't even determine if its what I'm looking for, much less how to go about translating it in to code (and I haven't even touched on efficiency concerns).

Can anyone provide some insight? And if it turns out this can't be done, would anyone mind taking a crack at translating the derived equations in those articles in to English?

It sounds like you are doing pretty much the same thing for the second case with the exception that the random variable is defined in a more complex way.

It seems like you are finding a measure of variation, but that you are referring to different things (one random variable involves a relationship of other random variables whereas the first is just a normal random variable).

It might help if you describe what you mean for the 'mean distribution' to be mathematically in terms of a formula of random variables and means (you can use E(X) to denote the mean of a particular random variable X).

We actually do this in statistical applications quite a bit. Although we deal with only mean and variance/standard deviation, we do in different contexts where it has a particular interpretation in one context versus another.

It will help you, if you read further statistics or learn/do further statistics to understand how to create a new random variable from other existing random variables using a formula to relate the two. This way you will be able to see mathematically that although you are just doing a "normal standard deviation calculation", when you are defining your new random variable in a certain way you are encoding "a specific kind of information" relevant to the actual formula.

Thanks chiro for the insight. I feel a lot more confident using the formula now. The formula for the mean difference (which is what I assume when you said "mean distribution") is:

I don't know whether you are doing this work for any serious purpose. In case you are, I think you better read those articles. (Unless a forum member is a JSTOR subscriber, that person cannot read the articles in your links. I can't - not that I'm promising to do so if they become available!)

You didn't say whether "the set" is the data or the population from which the data is drawn.

I don't think you are talking about a "confidence interval" in a technically correct way. The current Wikipedia article on confidence interval might straighten you out.

You are apparently planning to assume the differences are independent and normally distributed. It isn't clear that they are independent. For example |x1 - x2| and |x1 - x3| share the value x1. There may be some theory that says that they have a normal distribution and even that they are independent. If so, you should learn that theory - at least its results.

What do you mean by "mean difference deviation"? Do you mean "the standard deviation of the differences"? ( If so you could just say "difference standard deviation".)

Let's look an example. (Check my work.)

Suppose the there are 3 data values { 1, 2, 4}.

The formula you have is different than the one in the Wikipedia in a trivial way. The formula you give excludes the case i = j. Since |x_i - x_i| = 0 it seems unnecessary to do that.

How do you define the variance of "the differences"? Are you going let each difference appear twice or just once? I don't think it matters for the usual definition of "sample variance".

If you count each difference once, you compute the variance of the data set {1,3,2}.
You get a mean of 2 and a variance of [itex] ( 1 + 1 +0) [/itex] divided by 3, which is 2/3.

If you count each difference twice, you compute the variance of the data set {1,3,2,1,2,3}.
You get a mean of 2 and a variance of (1 + 1 + 0 + 1 + 0 + 1) divided by 6, which is also 2/3.

However, some people define the sample variance to be [itex] \frac{ \sum_{i=1}^n(x_i - \bar{x})^2 }{n-1} [/itex] instead of [itex] \frac{ \sum_{i=1}^n(x_i - \bar{x})^2 }{n} [/itex]. Those people would get 2/2 for the first answer and 4/5 for the second answer.

So our candidates for the sample standard deviation are [itex] \sqrt{\frac{2}{3}} ,\sqrt{\frac{2}{2}}, \sqrt{\frac{4}{5}} [/itex].

The method your propose:

Agrees with the answer [itex] \sqrt{\frac{2}{3}} [/itex] doesn't it?