Can the standard deviation calculation be generalized for other statistics?

klawson88 · Feb 1, 2012

I've calculated the mean difference of my (normally distributed) data set. The mean difference is defined as:

The average absolute difference of any two independent values in a data set

Now, I'm trying to calculate the "mean difference deviation" in order to generate a confidence interval for this quantity ( "95% of the differences in the set are greater than ____"). My question is: can I generalize the standard deviation formula to calculate this? If we take the following concepts to be parallel:

Code:

Mean         <---------> Mean Difference
Single value <---------> Single Difference

... can I use the standard deviation equation to calculate the "mean difference deviation"? Namely turning this:

Standard deviation calculation steps

1. Take the difference of the mean and each single value
2. Square each result and add up the resulting numbers
3. Divide by the total number of values
4. Take the square root of #3

to this:

Proposed "mean difference deviation" calculation steps

1. Take the difference of the mean difference and each single difference
2. Square each result and add up the resulting numbers
3. Divide by the total number of differences
4. Take the square root of #3

I've looked up more direct ways to calculate this quantity, and all of them are contained in statistics articles that are http://www.jstor.org/stable/pdfplus/2333957.pdf?acceptTC=true(1) http://www.jstor.org/stable/pdfplus/2236592.pdf(2) http://www.jstor.org/stable/pdfplus/2282402.pdf(3); so much so that I can't even determine if its what I'm looking for, much less how to go about translating it into code (and I haven't even touched on efficiency concerns).

Can anyone provide some insight? And if it turns out this can't be done, would anyone mind taking a crack at translating the derived equations in those articles into English?

chiro · Feb 1, 2012

Hey klawson88 and welcome to the forums.

It sounds like you are doing pretty much the same thing for the second case with the exception that the random variable is defined in a more complex way.

It seems like you are finding a measure of variation, but that you are referring to different things (one random variable involves a relationship of other random variables whereas the first is just a normal random variable).

It might help if you describe what you mean for the 'mean distribution' to be mathematically in terms of a formula of random variables and means (you can use E(X) to denote the mean of a particular random variable X).

We actually do this in statistical applications quite a bit. Although we deal with only mean and variance/standard deviation, we do in different contexts where it has a particular interpretation in one context versus another.

It will help you, if you read further statistics or learn/do further statistics to understand how to create a new random variable from other existing random variables using a formula to relate the two. This way you will be able to see mathematically that although you are just doing a "normal standard deviation calculation", when you are defining your new random variable in a certain way you are encoding "a specific kind of information" relevant to the actual formula.

klawson88 · Feb 1, 2012

Thanks chiro for the insight. I feel a lot more confident using the formula now. The formula for the mean difference (which is what I assume when you said "mean distribution") is:

Stephen Tashi · Feb 3, 2012

klawson,

I don't know whether you are doing this work for any serious purpose. In case you are, I think you better read those articles. (Unless a forum member is a JSTOR subscriber, that person cannot read the articles in your links. I can't - not that I'm promising to do so if they become available!)

klawson88 said:

Now, I'm trying to calculate the "mean difference deviation" in order to generate a confidence interval for this quantity ( "95% of the differences in the set are greater than ____").

You didn't say whether "the set" is the data or the population from which the data is drawn.

I don't think you are talking about a "confidence interval" in a technically correct way. The current Wikipedia article on confidence interval might straighten you out.

You are apparently planning to assume the differences are independent and normally distributed. It isn't clear that they are independent. For example |x1 - x2| and |x1 - x3| share the value x1. There may be some theory that says that they have a normal distribution and even that they are independent. If so, you should learn that theory - at least its results.

Stephen Tashi · Feb 3, 2012

klawson88 said:

... can I use the standard deviation equation to calculate the "mean difference deviation"?

What do you mean by "mean difference deviation"? Do you mean "the standard deviation of the differences"? ( If so you could just say "difference standard deviation".)

Let's look an example. (Check my work.)

Suppose the there are 3 data values { 1, 2, 4}.

The formula you have is different than the one in the Wikipedia in a trivial way. The formula you give excludes the case i = j. Since |x_i - x_i| = 0 it seems unnecessary to do that.

The "GMD" is
[tex] \frac{ |1-2| + |1-4] + |2-4| + |2-1| + |2-4| + |4-1| + |4-2|}{(3)(3-1) } = \frac{12}{6} = 2 [/tex]

How do you define the variance of "the differences"? Are you going let each difference appear twice or just once? I don't think it matters for the usual definition of "sample variance".

If you count each difference once, you compute the variance of the data set {1,3,2}.
You get a mean of 2 and a variance of [itex] ( 1 + 1 +0) [/itex] divided by 3, which is 2/3.

If you count each difference twice, you compute the variance of the data set {1,3,2,1,2,3}.
You get a mean of 2 and a variance of (1 + 1 + 0 + 1 + 0 + 1) divided by 6, which is also 2/3.

However, some people define the sample variance to be [itex] \frac{ \sum_{i=1}^n(x_i - \bar{x})^2 }{n-1} [/itex] instead of [itex] \frac{ \sum_{i=1}^n(x_i - \bar{x})^2 }{n} [/itex]. Those people would get 2/2 for the first answer and 4/5 for the second answer.

So our candidates for the sample standard deviation are [itex] \sqrt{\frac{2}{3}} ,\sqrt{\frac{2}{2}}, \sqrt{\frac{4}{5}} [/itex].

The method your propose:

1. Take the difference of the mean difference and each single difference
2. Square each result and add up the resulting numbers
3. Divide by the total number of differences
4. Take the square root of #3

Agrees with the answer [itex] \sqrt{\frac{2}{3}} [/itex] doesn't it?

Can the standard deviation calculation be generalized for other statistics?

1. Can the standard deviation be used for any type of data?

2. Is the standard deviation calculation affected by extreme values in the data?

3. Can the standard deviation be negative?

4. Is the standard deviation a robust measure of variability?

5. Can the standard deviation calculation be generalized for non-normal distributions?

Similar threads

Hot Threads

Recent Insights