Calculating the standard deviation of the standard deviation

In summary, the standard deviation of the standard deviation is the error in the standard deviation calculated from a finite sample size.
  • #1
Chain
35
3
I was wondering if anyone could help me with calculating the standard deviation of the standard deviation. What I mean by this is say for example I roll a dice 100 times and then calculate the mean and standard deviation from the results I collected. The results are not going to be exact because I took a finite sample size [itex] N [/itex]. I could calculate the standard deviation in the result for the mean which would be:

[itex] \sigma/\sqrt{N} [/itex]

Where [itex] \sigma [/itex] is the true standard deviation not the measured one. I was wondering since it's possible to calculate the standard deviation in the mean whether it's also possible to do it for the standard deviation. Essentially I want to calculate the error in the standard deviation calculated from a finite sample size.
 
Physics news on Phys.org
  • #2
You need to state your goal using more precise language. It isn't clear whether you are asking a question about "estimation" or about theoretical calculations, or whether you are merely asking a question about convention.

In your post you made the distinction between "the true standard deviation" and "the measured one". This is an important distiction. It applies to all parameters of the population being sampled - for example, the "true mean" ( which is the population mean) is different concept than the "measured mean", which is the sample mean. A common goal in statistics is to estimate a population parameter by doing computations on a sample. For a population parameter such as the standard deviation, there are several different formulas that can be used to estimate the population parameter. Which formula is "best" depends on how you define the precise the meaning of "best".

On the other hand there are conventional meanings to terms like , "the sample mean" and "the sample standard deviation". Unfortunately, different textbooks define "the sample standard deviation" in different ways. But once you select a definite meaning for that term, you can compute the sample standard deviation from a given sample of data. It doesn't matter what the data represents. Each single value in the sample might be a sample standard deviation computed from sample values of a different random variable.

The term "sample standard deviation" is often used to indicate a single number such as when we say "the sample standard deviation was 23.8". This is technically not correct. The sample standard deviation is a formula applied to values in the sample. The values in the sample are random variables. Hence the "sample standard deviation" is a random variable. What we should say is that "The realization of the sample standard deviation was 23.8" since this refers to one observation of a random variable. Since the sample standard deviation is a random variable, this random variable has a probability distribution and the distribution has parameters that specify its own population mean and population standard deviation. (This is what makes statistics complicated and where students in introductory courses get confused. It seems to be a snake swallowing its tail.)

If X is a given random variable, the standard deviation of the "population standard deviation" of X would have to be defined as zero because its population standard deviation of X is a constant. It doesn't depend on samples.

If we define the random variable Y to be the sample standard deviation of a set of N independent measurements of X then we can do theoretical calculations to compute the standard deviation of Y as a function of the population parameters of the distribution of X. These calculations don't involve using specific numbers from sample data.

If we don't assume the distribution of X is known then we can ask how to best estimate the parameters of the distributions of X and Y from data in a sample. However, this is not a precise question. The problem must be fleshed out by specifying what we do know about the distribution of X and how we intend to define "best". (The common definitions of "best" involve the technical definitons of "unbiased", "maximum liklihood", and "minimum expected square error". Different goals can lead to different formulas.)
 
  • #3
Okay so if a roll a dice 100 times I would predict the mean value of the sum of all the rolls to be:

[itex] \langle x \rangle = \sum_{n=1}^{100} \sum_{i=1}^6 P_i i = \sum_{n=1}^{100} \sum_{i=1}^6 i/6 = 350 [/itex]

and we would expect the mean squared deviation to be:

[itex] \langle x^2 \rangle - \langle x \rangle^2 = \sum_{n=1}^{100} (\sum_{i=1}^6 P_i i^2 - (\sum_{i=1}^6 P_i i)^2) = 100(15.17-3.5^2) = 292[/itex]

Which gives a standard deviation of:

[itex] \sqrt{292}=17 [/itex]

So I would expect the sum of all the dice to be 350[itex]\pm[/itex]17. Now this has all been on pen and paper so let's say I actually roll 100 dice and calculate the average and standard deviation using the data I collected. I may end up with a result like 360[itex]\pm[/itex]15. The mean and standard deviation values I get from actually rolling 100 dice are also random variables.

I want to know to what certainty I can claim my result for the standard deviation is correct. Obviously this will be a function of [itex]N[/itex] which is the sample size since if I rolled 1000 dice I would get a much more accurate value for the mean and standard deviation of the sample.
 
  • #4
The sample standard deviation is typically defined by
[tex] \sigma = \sqrt{ \frac{1}{n-1} \sum_{k=1}^{n} (X_n - \overline{X}) } [/tex]
Where [itex] \overline{X}[/itex] is the sample mean

To find the standard deviation of sigma you just want to calculate
[tex] \sqrt{ E(\sigma^2) - E(\sigma)^2} [/tex]
Interestingly [itex] E(\sigma^2)[/itex] is just the true variance of the random variable (because the definition of [itex] \sigma[/itex] is that it's an unbiased estimator for the variance). Calculating [itex] E(\sigma)^2[/itex] looks challenging, which is interesting because usually this is the easier term to deal with. It would be easier to deal with the sample variance,
[tex] \sigma^2 = \frac{1}{n-1} \sum_{k=1}^{n} (X_n - \overline{X}) [/tex]
And the variance of the sample variance
[tex] E(\sigma^4) - E(\sigma^2)^2[/tex]
Everything is now just a polynomial in the Xk so you should be able to calculate it in terms of the moments of your random variable.Alternatively you can start with an unbiased estimator of the sample standard deviation
http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

This is highly dependent on which random variable you start with, and hard to calculate, but an asymptotically correct formula is given. Once you have that, [itex] E(\sigma)^2[/itex] is known and [itex]E(\sigma^2)[/itex] is something you can calculate because it will be some constant times the true variance.

I guess the moral of the story is that variance is a lot nicer than standard deviation
 
  • #5
Office_Shredder said:
The sample standard deviation is typically defined by
[tex] \sigma = \sqrt{ \frac{1}{n-1} \sum_{k=1}^{n} (X_n - \overline{X}) } [/tex]
Where [itex] \overline{X}[/itex] is the sample mean

To find the standard deviation of sigma you just want to calculate
[tex] \sqrt{ E(\sigma^2) - E(\sigma)^2} [/tex]
Interestingly [itex] E(\sigma^2)[/itex] is just the true variance of the random variable (because the definition of [itex] \sigma[/itex] is that it's an unbiased estimator for the variance). Calculating [itex] E(\sigma)^2[/itex] looks challenging, which is interesting because usually this is the easier term to deal with. It would be easier to deal with the sample variance,
[tex] \sigma^2 = \frac{1}{n-1} \sum_{k=1}^{n} (X_n - \overline{X}) [/tex]
And the variance of the sample variance
[tex] E(\sigma^4) - E(\sigma^2)^2[/tex]
Everything is now just a polynomial in the Xk so you should be able to calculate it in terms of the moments of your random variable.


Alternatively you can start with an unbiased estimator of the sample standard deviation
http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

This is highly dependent on which random variable you start with, and hard to calculate, but an asymptotically correct formula is given. Once you have that, [itex] E(\sigma)^2[/itex] is known and [itex]E(\sigma^2)[/itex] is something you can calculate because it will be some constant times the true variance.

I guess the moral of the story is that variance is a lot nicer than standard deviation

You forgot to square the terms in the summation.
 
  • #6
Thanks for the replies guys that was helpful :)
 
  • #7
Actually sorry I think I need a bit more help, I have no idea how to go about evaluating

[itex] E(\sigma^2) [/itex] and [itex] E(\sigma^4) [/itex]

The expectation value is usually defined as

[itex] \frac{1}{N}\sum_{n=1}^N X_n [/itex]

Where [itex] X_n [/itex] is the [itex] n^{th} [/itex] value of the sample data. However for a given sample you only get one value of the variance so would I just use that one value of all [itex] X_n [/itex]??

If that was the case then I would find the standard deviation in my measured value of the sample variance to be always be zero so I'm kinda confused.
 
  • #8
Okay I just did some computer simulations and I found that the standard deviation of the measured variance of my sample of random numbers seemed to depend on the sample size as:

[itex] \sigma^2/\sqrt{N} [/itex]

Where [itex] \sigma^2 [/itex] is the mean value of the measured variance of the samples and [itex] N [/itex] is the sample size used to calculate the measured variance. To measure the standard deviation on my value of the measured variance I simply produced of large number of samples and calculated the variance for each one and calculated the standard deviation of all of the values.

[itex] \sigma^2/\sqrt{N} [/itex] tended to slightly overestimate the measured value of the standard deviation of the variances however the values were always close.

I think I've found my answer but some theoretical justification would be nice.
 
Last edited:
  • #9
First note that the sample standard deviation and variance are [tex] s [/tex] and [tex] s^2 [/tex] (not sigma and sigma squared as written above)

First note that
[tex]
E(s^2) = E\left(\frac 1 {n-1} \sum_{i=1}^n (x_i - \bar x)^2\right) = %
\frac 1 {n-1} \sum_{i=1}^n E\left((x_i - \bar x)^2\right)
[/tex]
 
  • #10
Okay so that would give [itex] E(s^2)=\frac{n}{n-1}s^2 [/itex] and

[itex] E(s^4)= (\frac{n}{n-1})^2\sum_{i=1}^{n} E((x_i-\bar{x})^4) [/itex]

Okay thank you for the responses :)
 

1. What is the formula for calculating the standard deviation of the standard deviation?

The formula for calculating the standard deviation of the standard deviation is the square root of the variance of the standard deviation. This can be written as sqrt(variance of standard deviation) or as sqrt((1/n) * sum((standard deviation - mean of standard deviation)^2)).

2. Why is it important to calculate the standard deviation of the standard deviation?

Calculating the standard deviation of the standard deviation helps to measure the variability and consistency of a set of data. It can also provide valuable information about the distribution of the data and can help identify any outliers or abnormalities.

3. How is the standard deviation of the standard deviation affected by outliers?

Outliers can greatly affect the standard deviation of the standard deviation, as they can increase the variability of the data and make it more difficult to determine the true standard deviation. In extreme cases, outliers can even make the standard deviation of the standard deviation undefined.

4. Can the standard deviation of the standard deviation be negative?

No, the standard deviation of the standard deviation cannot be negative. This is because it is calculated by taking the square root of the variance, which is always a positive value.

5. How does the sample size affect the calculation of the standard deviation of the standard deviation?

The sample size does not have a direct effect on the calculation of the standard deviation of the standard deviation. However, a larger sample size can provide a more accurate estimate of the true standard deviation and reduce the variability of the data, resulting in a more reliable value for the standard deviation of the standard deviation.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
737
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
926
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
943
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
24
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
18
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
981
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
Back
Top