Can the standard deviation calculation be generalized for other statistics?

In summary, the individual is trying to calculate the "mean difference deviation" in order to generate a confidence interval for this quantity. They are wondering if they can use the standard deviation equation to calculate this "mean difference deviation" and have provided proposed steps for doing so. They have also looked up other methods for calculating this quantity but are having difficulty translating the equations into code. They are seeking insight and clarification on this topic.
  • #1
klawson88
3
0
I've calculated the mean difference of my (normally distributed) data set. The mean difference is defined as:
The average absolute difference of any two independent values in a data set

Now, I'm trying to calculate the "mean difference deviation" in order to generate a confidence interval for this quantity ( "95% of the differences in the set are greater than ____"). My question is: can I generalize the standard deviation formula to calculate this? If we take the following concepts to be parallel:

Code:
Mean         <---------> Mean Difference
Single value <---------> Single Difference


... can I use the standard deviation equation to calculate the "mean difference deviation"? Namely turning this:
Standard deviation calculation steps

1. Take the difference of the mean and each single value
2. Square each result and add up the resulting numbers
3. Divide by the total number of values
4. Take the square root of #3

to this:

Proposed "mean difference deviation" calculation steps

1. Take the difference of the mean difference and each single difference
2. Square each result and add up the resulting numbers
3. Divide by the total number of differences
4. Take the square root of #3

I've looked up more direct ways to calculate this quantity, and all of them are contained in statistics articles that are http://www.jstor.org/stable/pdfplus/2333957.pdf?acceptTC=true(1) http://www.jstor.org/stable/pdfplus/2236592.pdf(2) http://www.jstor.org/stable/pdfplus/2282402.pdf(3); so much so that I can't even determine if its what I'm looking for, much less how to go about translating it into code (and I haven't even touched on efficiency concerns).

Can anyone provide some insight? And if it turns out this can't be done, would anyone mind taking a crack at translating the derived equations in those articles into English?
 
Physics news on Phys.org
  • #2
Hey klawson88 and welcome to the forums.

It sounds like you are doing pretty much the same thing for the second case with the exception that the random variable is defined in a more complex way.

It seems like you are finding a measure of variation, but that you are referring to different things (one random variable involves a relationship of other random variables whereas the first is just a normal random variable).

It might help if you describe what you mean for the 'mean distribution' to be mathematically in terms of a formula of random variables and means (you can use E(X) to denote the mean of a particular random variable X).

We actually do this in statistical applications quite a bit. Although we deal with only mean and variance/standard deviation, we do in different contexts where it has a particular interpretation in one context versus another.

It will help you, if you read further statistics or learn/do further statistics to understand how to create a new random variable from other existing random variables using a formula to relate the two. This way you will be able to see mathematically that although you are just doing a "normal standard deviation calculation", when you are defining your new random variable in a certain way you are encoding "a specific kind of information" relevant to the actual formula.
 
  • #3
Thanks chiro for the insight. I feel a lot more confident using the formula now. The formula for the mean difference (which is what I assume when you said "mean distribution") is:

PK4Ki.png
 
  • #4
klawson,

I don't know whether you are doing this work for any serious purpose. In case you are, I think you better read those articles. (Unless a forum member is a JSTOR subscriber, that person cannot read the articles in your links. I can't - not that I'm promising to do so if they become available!)

klawson88 said:
Now, I'm trying to calculate the "mean difference deviation" in order to generate a confidence interval for this quantity ( "95% of the differences in the set are greater than ____").
You didn't say whether "the set" is the data or the population from which the data is drawn.

I don't think you are talking about a "confidence interval" in a technically correct way. The current Wikipedia article on confidence interval might straighten you out.

You are apparently planning to assume the differences are independent and normally distributed. It isn't clear that they are independent. For example |x1 - x2| and |x1 - x3| share the value x1. There may be some theory that says that they have a normal distribution and even that they are independent. If so, you should learn that theory - at least its results.
 
  • #5
klawson88 said:
... can I use the standard deviation equation to calculate the "mean difference deviation"?

What do you mean by "mean difference deviation"? Do you mean "the standard deviation of the differences"? ( If so you could just say "difference standard deviation".)

Let's look an example. (Check my work.)

Suppose the there are 3 data values { 1, 2, 4}.

The formula you have is different than the one in the Wikipedia in a trivial way. The formula you give excludes the case i = j. Since |x_i - x_i| = 0 it seems unnecessary to do that.

The "GMD" is
[tex] \frac{ |1-2| + |1-4] + |2-4| + |2-1| + |2-4| + |4-1| + |4-2|}{(3)(3-1) } = \frac{12}{6} = 2 [/tex]

How do you define the variance of "the differences"? Are you going let each difference appear twice or just once? I don't think it matters for the usual definition of "sample variance".

If you count each difference once, you compute the variance of the data set {1,3,2}.
You get a mean of 2 and a variance of [itex] ( 1 + 1 +0) [/itex] divided by 3, which is 2/3.

If you count each difference twice, you compute the variance of the data set {1,3,2,1,2,3}.
You get a mean of 2 and a variance of (1 + 1 + 0 + 1 + 0 + 1) divided by 6, which is also 2/3.

However, some people define the sample variance to be [itex] \frac{ \sum_{i=1}^n(x_i - \bar{x})^2 }{n-1} [/itex] instead of [itex] \frac{ \sum_{i=1}^n(x_i - \bar{x})^2 }{n} [/itex]. Those people would get 2/2 for the first answer and 4/5 for the second answer.

So our candidates for the sample standard deviation are [itex] \sqrt{\frac{2}{3}} ,\sqrt{\frac{2}{2}}, \sqrt{\frac{4}{5}} [/itex].

The method your propose:
1. Take the difference of the mean difference and each single difference
2. Square each result and add up the resulting numbers
3. Divide by the total number of differences
4. Take the square root of #3

Agrees with the answer [itex] \sqrt{\frac{2}{3}} [/itex] doesn't it?
 

1. Can the standard deviation be used for any type of data?

Yes, the standard deviation can be used for any type of numerical data, including both discrete and continuous variables.

2. Is the standard deviation calculation affected by extreme values in the data?

Yes, extreme values can have a significant impact on the standard deviation, as it is calculated by taking the square root of the variance, which is heavily influenced by outliers.

3. Can the standard deviation be negative?

No, the standard deviation cannot be negative as it represents a measure of the spread or variability of the data from the mean. If the result of the calculation is negative, it is likely due to an error in the data or the calculation itself.

4. Is the standard deviation a robust measure of variability?

No, the standard deviation is not a robust measure of variability as it is highly sensitive to extreme values. A more robust alternative is the interquartile range, which is not affected by outliers.

5. Can the standard deviation calculation be generalized for non-normal distributions?

Yes, the standard deviation can be used for non-normal distributions, but it may not accurately represent the spread of the data. In these cases, alternative measures of variability such as the range or interquartile range may be more appropriate.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
884
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
979
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
977
  • Set Theory, Logic, Probability, Statistics
Replies
15
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
703
Back
Top