# B What's wrong with these formulas of standard deviation?

1. Nov 11, 2016

### Prem1998

I just looked up the standard deviation formula. I makes quite sense to me because we're actually trying to 'measure' how much all the observations deviate from the mean value. To do this, we calculate all the possible differences, [X(mean) - x(i)], then we square all the differences to get rid of negative differences, then we find arithmetic mean of them, and ultimately the square root to get rid of the original squaring that we'd done to the original values. This makes sense but I could think of some other methods if our goal is just to determine how much the observations deviate. These are-
1. Arithmetic mean of all the values of Modulus[X(mean) - x(i)]
2. Geometric mean of all the values of Modulus[X(mean) - x(i)]
These two methods also make sense. And, we don't have to do the square root in the end, because basically we're just finding the average of the differences. So, I'm talking about using Modulus instead of squaring to get rid of negative differences.
Actually, there's another one I could think of:
3.Why don't we just calculate all the values of [X(mean) - x(i)] raised to any even power n ( to get rid of negative differences), then find arithmetic mean of these values and then calculate the nth root of that arithmetic mean? Why do we just use n=2? Basically, the standard deviation formula would now look like:
(arithmetic mean ((X(mean) - x(i))^n))^(1/n) , where n is even. Why do we just use n=2?

So, what's wrong with these 3 formulas of standard deviation?

2. Nov 11, 2016

### Simon Bridge

At "B" level:
The variance (which the std deviation is the square root of) is something we use that way for historical reasons.
To get an idea why, try relating the definitions to the Normal distribution, particularly hypothesis testing and comparing statistical distributions.
Bottom line: we use these definitions because they are the simplest forms that are useful beyond just what they are.

3. Nov 11, 2016

### Prem1998

Talking about being simple, this definition definitely looks simpler:
1. Arithmetic mean of all the values of Modulus[X(mean) - x(i)]
It has less calculations, if that's what being simple means. The second formula which I've written is definitely not simple but it also makes sense if the goal is to find the average of deviations from the central or mean value.
Talking about usefulness, I don't understand why this one is less useful:
(arithmetic mean ((X(mean) - x(i))^n))^(1/n) , where n is an even natural number. It seems like a logical and sensible formula for standard deviation. I don't know why does it become useful only when n=2.

4. Nov 11, 2016

### Simon Bridge

Sure - but it is not as useful.
... you have to realize what it is useful for. Hint: not just as a measure of spread ...

Like I said - look at how it would fit in with the Normal distribution, combining distributions, and hypothesis testing... also "goodness of fit" formulae.
The trouble is that a decent answer is not at B level, which you asked for.

5. Nov 11, 2016

### Prem1998

I don't want to start another thread on A-level. Just tell me the decent answer. Pretend I'll understand it.

6. Nov 11, 2016

### Simon Bridge

You realise that, in general, the mean and variance are given by integrals?

For probability distribution p(x), $P(a<x<b)=\int_a^bp(x)\; dx$
p(x) has mean: $\mu = \int_\infty xp(x)\; dx$ and variance $\text{var}(X) = \sigma^2 = \int_\infty (x-\mu)^2p(x)\; dx$

If you wanted to use something else as a measure of spread - say: $s=\int |\mu - x|p(x)\; dx$ then try to use it.

ie... the gaussian: $p(x)=Ne^{-k(x-\mu)^2}$ where $N$ and $k$ are constants so that $\int_\infty p(x)\; dx =1$ and the units come out right.
Using the variance as the measure of spread - you get $k=1/2\sigma^2$ and $N=1/\sqrt{2\pi\sigma^2}$
Your task is to find s, and represent N and k in terms of s.

If you have two distributions with different "s" values, how do they combine? How do you compare them?
If you add two normal distributions, then the new distribution has a variance that is the sum of the two added distributions - does that work out for s?

OK - maybe could choose $2\sigma^2$ for the spread measure - but now we are splitting hairs.

Ultimately it does kinda boil down to what mathematicians in history thought was a good idea and fit well with other parts of maths - like we have 12x3600 seconds on a 12hr clock face but 360x3600 seconds on a protractor even though it is the same circle. All I tried to do was demonstrate that there are uses beyond just saying which distribution is more spread out.

7. Nov 11, 2016

### Prem1998

I'm sorry. I shouldn't have asked for that. I understood little but you're saying that we can't get N and k (maybe some important things) in terms of 's' where 's' is the measure of spread that I'm talking about.
One more thing, if we have x= (arithmetic mean ((X(mean) - x(i))^4))^1/4, then what percentage of values in the normal distribution will lie in the range (-x, x) and in (-3x, 3x)? Just asking.
EDIT: Again, one more thing, if we had p(x) = Ne^(-k*(x-u)^4), then could we use this n=4 definition of standard deviation? I'm guessing that the normal definition is used because of the exponent 2 in the expression of p(x).

Last edited: Nov 11, 2016
8. Nov 11, 2016

### Staff: Mentor

You are trying to infer a simple answer based on some "special" numbers you see. You are trying to make something "simple" because, I guess, you do not have the math background to get why that is not always general answer. @Simon Bridge gave you a very good answer.

If you want to go an alternate route with a less steep learning curve, please try the cookbook method. Special numbers get you in trouble.

I once took a horribly boring grad course in stats. Kind of like 'statistics for poets', I guess. Clearly not from the math department. Anyway.
There are some references for statistics approaches that are like that. Try them.

If you know some Excel, http://www.dummies.com/software/mic...ow-to-use-excels-descriptive-statistics-tool/

This book is probably great for you - it is based on precisely the idea you have. Make it simple to understand. So. They made it like a recipe book. You learn to boil water. And slowly progress until you learn about adding spices and new flavoring veggies.

Then, you can go back to taking a more rigorous approach if you need it. But you won't be tempted to go over to the dark side.

9. Nov 11, 2016

### Stephen Tashi

The term "standard deviation" is ambiguous. What you are talking about are formulas that are applied to values in sample of a random variable. (A "sample" can consist of several measurements of the same random variable. A single measurement of a random variable is often called an "outcome".) These formulas are formulas for "estimators".

The standard deviation "of a population" or "of a random variable" is a different concept that an "estimator" of those standard deviations. You are discussing a formula that estimates the "standard deviation of a population", so you are discussion a formula for an "estimator". The number this formula produces is called "the sample standard deviation". ( In fact, there are several slightly different formulas that can be used to estimate the population standard deviation, so it isn't correct to think that the formula you mentioned is "the" (i.e. the only) estimator of the population standard deviation.)

To further complicate matters, a formula for an estimator of the standard deviation of a random variable can be regarded as defining another random variable (because its value depends on the random values in a sample.) So there is an ambiguity in the phrase "sample standard deviation". The phrase might mean the estimator considered as a random variable, or it might mean a specific value of that random variable, like 230.91.

The question of why people apply the formula
eq. 1) $\sigma_{estimated} = \sqrt{ (1/N) \sum_{i=1}^N (x_{estimated\ mean} - x_i)^2}$
to estimate the population standard deviation has two parts.

A) Why is the population standard deviation defined the way it is?
B) In what sense is the formula eq. 1) a "good" way of estimating it?

As to question A), definitions in mathematics don't have proofs. They are (technically) arbitrary stipulations. However, we may ask the sociological question: "What motivated mathematicians to define the population standard deviation in the way they chose?"

The situation frequently arises when a random variable X ( e.g. the length of some assembly) is the sum of other independent random variables Y and Z (e.g. the lengths of two independently manufactured parts). Denoting the standard deviations of these random variables by $\sigma_X, \sigma_Y, \sigma_Z$ , there is theorem that tell us $\sigma_X^2 = \sigma_Y^2 + \sigma_Z^2$. This handy formula would not be correct if we used a different definition for "population standard deviation". So mathematicians are motivated to define "population standard deviation" the way it is defined because that definition leads to a useful statement about the variability of the sum of independent random variables.

The question of why the formula eq. 1) above is a "good" way of estimating the population standard deviation is complicated and can't be understood without a complete appreciation of the basics of probability and statistics. We can observe that the word "good" is mathematically ambiguous. Likewise it would be ambiguous to ask if the formula eq. 1) for an estimate of the population standard deviation is "correct" or "accurate". An estimator depends on the values in a sample. Any estimation formula might be "way off" because the values in a sample are random selections from the population. (E.g. the average height of 5 randomly chosen people might be "way off" from the average height of the total population.)

In the early history of statistics there was a concept of a "consistent" estimator. In modern statistics the definition of a "consistent" estimator is very technical and it is not the same as the old concept, but the old concept has intuitive appeal. The old concept was that a "consistent" estimator of a population parameter is an estimator whose formula resembles the formula for the population parameter that it was trying to estimate - in the sense that we use observed frequencies $f(v_i)$ of values in the formula for the estimator in the places where probabilities $p(v_i)$ appear in the formula for the population parameter.

For example, the formula for the usual estimator of the population mean can be expressed as $\mu_{estimate} = \sum_{i=1}^M v_i f(v_i)$ where $f(v_i)$ is the fraction of values in the sample where the outcome was $v_i$. The definition of the population mean (for a discrete random variable) can be expressed as $\mu = \sum_{i=1}^M v_i p(v_i)$ where $p(v_i)$ is the probability that the value $v_i$ occurs. The observed frequency $f(v_i)$ of a value in a sample will not necessarily be equal to the probability $p(v_i)$ of that value occurring. However, there is an intuitive appeal in using $f(v_i)$ as an estimate of $p(v_i)$.

So, without going into the technicalities of question B), you can use the old fashioned concept of a "consistent" estimator to get an intuitive idea why the formula eq. 1) is "good" for estimating the population standard deviation. Rewrite eq. 1) as an equation involving frequencies $f(v_i)$ and you can see it resembles the definition of the population standard deviation, which is given in terms of probabilities $p(v_i)$.

10. Nov 11, 2016

### Simon Bridge

You can - it's just a lot of work.

Lets try a bit of motivation:
Statistical distributions are tricky to talk about - especially in the continuous case. To describe them properly can take a long time.

When we want a shorthand to describe statistics we talk about the central value of the distribution and the spread it has about that central value.
So we use two numbers (c,s) ... c= central value, s=spread value. (I want to avoid $\mu$ and $\sigma$ - reserving those names for the mean and standard deviation) ... we can define a deviation value d = f(s), if you like. I want you to take your focus off the standard deviation for a bit, it's misleading you.

For a discrete distribution you may have variable X can take on values x picked from $x\in\{ x_1, x_2, \cdots , x_N\}$
Lets make this an ordered set so $x_n \leq x_{n+1}: 1\leq n \leq N$ ... so values can repeat, but, in general, lower n means smaller x.
(Are you following the notation?)

What I want to do is look at ways of deciding how to assign values to c and s without necessarily referring to standard methods like mean and variance.

So what sort of ways are there ... here's some samples:

Some ways to estimate the central value:
1. $c = \frac{1}{2}(x_N+x_1)$ ... this would be the value half way between the extreme values.
2. $c=x_{(N+1)/2}$ if N is odd, and $c=\frac{1}{2}[x_{N/2} + x_{1+(N/2)}]$ if N is even. (the median - as 1. accounts for skewdness)
3. $c = \frac{1}{N}\sum_n x_n$ (the mean you are used to)

... of these, the first is the simplest to use... why not just use that one all the time?
You should recognise the other two as commonly used estimates ... there are lots of other ways, ie you could just pick the value that appears most often, you could take the average of the middle 6 values ... anything you like. So how do you decide which one to use?

Some ways to estimate the spread:
1. $s = x_{N}- x_{1}$
2. $s = x_{3N/4}-x_{N/4}$ ( - need rules to deal with N not divisible by 4)
3. $s = \frac{1}{N}\sum_n|c-x_n|$ ($d=s/2$?)
4. $s = \frac{1}{N}\sum_n (c-x_n)^2$ ($d = \sqrt{s}$?)
5. $s = \frac{1}{N}\sum_n (c-x^n)^4$ ($d=\sqrt[4]{s}$?)

... of these, the first is the simplest. So why not just use it all the time? (You recognise 1 and 3 from box and whisker plots right?)
There are lots of other ways, so how do you decide which one to use?

For both central value and spread, a good choice for a general approach is to pick the estimate that works well in most of the situations you will encounter - that's why there are 360° in a circle... it's kinda arbitrary but not completely: it just divides by a lot of numbers so most angles you have to deal with will come out to a whole number of degrees so it's handy.

The most common distribution is the normal distribution ... that is why it is called "normal", though physicists usually prefer "gaussian".
It has general shape like a bell curve, you've seen these: $p(x) = Ae^{-(ax-b)^2}$ ... notice I simplified the expression from last time so it is more generic.

This curve is perfectly symmetric about a central maximum at $x=b/a$ ... so it seems reasonable to put $c=b/a$ for that distribution right?

To be a probability distribution, $A = a/\sqrt{\pi}$ ... so $a$ will have something to do with the way the curve spreads out.
(Try plotting the curve in a graphing program to see what I mean.)

Taking a closer look, turns out the bigger $a$ is the smaller the spread - so $s=1/a$ makes sense perhaps?

What I have tried to do above is make no assumptions about how to find a central value and a spread value, just using intuitive ideas about what those things mean and what they may be for.

The result is that the normal distribution now looks like this: $p(x) = \frac{1}{s\sqrt{\pi}}e^{(x-c)^2/s^2}$

There's more - but that will do for now. See if you can absorb this...
The next step is to compare this intuitive approach to what is actually done.
You can probably guess that the intuitive s and c will turn out to be something like the usual definition of variance and mean ;)

Last edited: Nov 11, 2016
11. Nov 11, 2016

### Prem1998

So, since we frequently encounter distributions of the form p(x)=Ne^-(ax-b)^2 so it is convenient to use s=1/a or s=(summation[X(mean)-x(i)]^2) and c=b/a or c= X(mean). And, these two definitions are good for comparing and combining the normal distributions. Maybe, they are good for combining because of the property s(x)^2=s(y)^2+s(z)^2 pointed out by Stephen Tashi but I don't agree with the comparing part. In comparing, these quantities are just to compare which distribution is more spread, right? And, the value of spread given by each of these formulas of spread will be higher when the observations vary more and will be lower when the observations vary less.
What about non-continuous discrete distributions? Is the usual formula also more convenient there or Will all these formulas be equivalent there for calculating spread? And what if the frequently encountered distribution in practical applications was Ne^(-k*(x-u)^4) ? Do different formulas become convenient for different curves?

12. Nov 12, 2016

### Simon Bridge

Hold up ... not finished yet.

Once deciding on $c=b/a$ and $s=1/a^2$ ... we could use an understanding of how to extract those parameters from an arbitrary equation.
To get c out of p(x), it turns out we need to do $c = \int_\infty xp(x)\; dx$ ... which you see from the previous posts is the definition of the mean.
So for the gaussian, the intuitive estimate for the central value is exactly the mean.

To get s out, you have to do: $s = 2\int_{\infty} (x-\mu)^2p(x)\; dx$ ... from prev posts the RHS is twice the definition of the variance.
So $s=2\sigma^2$ ... now do you see how mean and standard deviation come about?

Since the normal distribution is so very very common, this way of estimating central values and spread is very useful a lot of the time.
But it is not always useful ... which you will learn when you get to the weirder distributions. Right now it looks like you are just starting out on your journey and you are asking about stuff from much farther on. The best I can do is the picture postcard version.

Wrong. We don't use the parameters just to describe the distribution, we use them to do things.

Example:
When you do an experiment to measure the value of something, you do not get a single exact value (in fact, that would be suspicious). Instead you get a distribution of values which has a mean and a standard deviation (usually quoted as a value and an uncertainty). You will want to compare that value with what other researchers get for the same measurement, but using different methods. So they get different distributions of values.
You want to be able to say if your measurement agrees with the other researcher's measurements.
You need to use the spread to decide that. It is probably easiest to see by comparing box plots.

The short answer is "yes" (often) - the reason the normal distribution crops up so much is because when you combine lots of different things that are each distributed randomly, you end up with the result being distributed, well, normally. A normal distribution.

An example of a non continuous distribution would be rolling some dice and adding them up.
One 6-sided die has a flat distribution so the mean and standard deviation don't work well for describing it
... but what if you roll 3x 6-sided dice and add them up?
Draw a histogram of the frequency a number appears vs the number and see what shape you get.
What if you use 100 dice? What if the dice are all different shapes and sizes?
Now google for "mean value theorem".

That is speculation - the reality is that it is not frequently encountered. But lets say you happened to find one and you wanted to describe it:
You'd probably intuitively want to use s=1/k for a spread estimate ... but IRL we would express 1/k in terms of the defined variance since that is a standard.
You notice I keep calling "s" a spread estimate ... ie it is not to be considered an exact value or a specific property of something in real life.
It also means that any method that gets sort-of the same thing also counts as well as a spread estimate.
So, usually, the variance works just fine anyway.

Go back to the examples for central value and spread ... think about why you'd want to pick different ones.
Just for central value - there are lots of ways. How do you decide which way to choose just for that one?
For discrete distribution, which you seem to be more familiar with, you know that the average is not always the best approach for finding a central value, you are often better to use the median instead (and the interquartile range is better as an estimate of spread). Unless you haven't got as far as box and whisker plots yet?

So the short answer is "yes" you use the estimate that is best for the situation and what you want it to tell you.
It happens to be (see "mean value theorem") that the gaussian form is extremely common.

In physics though there are other very common distributions ... ie the laplace distribution.
That one has form $p(x) = \frac{1}{2b}e^{-|x-\mu|/b}$ ... here the parameter "b" (called the diversity) is a handy measure of spread ...

We encounter laplace distribution is any particle detector ... the detected energy of the particle is just about always distributed like that and the diversity is a characteristic of the detector.

Another example, occurring whenever something is being counted over time, is the Poisson distribution ... for large numbers it is well approximated by a normal distribution but for small number is is not - so other things need to be done to deal with them.

Those "other things" are college level, will be covered in your course in due time.
Off your questions, you seem to be starting secondary school statistics and have just covered the concept of discrete samples.
Right now they are just teaching you definitions - this part of the course is aimed at people who won't continue with the subject.

13. Nov 12, 2016

### Prem1998

Yeah, you're right. But if everyone is using my method to get the value and the uncertainty range then it should be fine, right?

Are you saying that my modulus method is the more convenient one for Laplace distribution?
And, I know about the mean value theorem. It only says that if the values of a function continuous in an interval is the same for the end points of the interval, then its derivative becomes zero at at least one point in the interval. What does it have to do with this?
And, I have no dice right now to plot the graphs but are you suggesting that it will tend to become a bell shaped normal distribution as the no. of dice increases? I guessed it because we would get an average sum quite more frequently than the lower and higher sums, so there would be a maximum in the middle. So, I think it means that the mean and standard deviation become a better way to express the distribution as a discrete distribution tends toward a continuous distribution.
EDIT: Oh, I think I figured out what it has to do with mean value theorem. So, you're saying that as the distribution becomes more and more continuous, then the mean value theorem becomes more applicable to it. Since the probabilities of getting the lowest and highest values are the same and are the lowest, so the slope of the graph becomes zero somewhere in the middle and it attains a maximum there, so it would form a normal distribution.
EDIT2: But the observations in this dice example are based on probabilities. Here probability of getting the average sum is highest because a comparatively higher no. of outcomes result in the average sum.
But, some distributions are completely discrete. For example, the observations recorded in a survey or a sport-person's career graph. These may consist of ups and downs and even if we try to make it somewhat continuous by recording higher no. of observations, then there is no chance that we would get a bell shaped curve with a maximum in the middle.

Last edited: Nov 12, 2016
14. Nov 12, 2016

### Simon Bridge

As long as they account for what it actually does. The power-four is pretty flat across the central value compared with a quandratic, so comparisons will involve different intervals. I'm not bothering to crunch the numbers right now because there's nothing in it for me to do so.
You should give it a go though - use the quartic definition of the spread function on the gaussian,do the integration, look at the relationship between what you edn up with and the usual standard deviation. That will show you more than I can just by typing stuff out here.

No. Though it lends itself better there than it does in the gaussian case. It's still trouble in terms of the work you have to do to use it. A better, and more intuitive, measure is the diversity parameter (b) - which I believe I said in so many words.

Sorry about that - I meant "central limit theorem".

Yes.
It does not matter what you use as a random number generator - use coins H=1, T=0 ... you have those right?
The experiment is done at year 1 college level (because we can force students to spend 2 hours throwing dice) and the dice in question are just unevenly cut bits of wood with numbered sides (not even trying to be cubes). There are lots of ways to generate random numbers - there doesn't even need to be an even chance for any number to appear for it to work.

The example of rolling dice and adding them up results in a completely discrete distribution.
Sure - it's never continuous. In fact we will never collect data that results in a completely discrete distribution because we have no way to measure to infinite accuracy. It does not matter, I am saying that as we include more random variables, the data we collect will be better represented by a gaussian. The data points will lie on a gaussian curve.

This is how you do statistics - one or two data points do not tell you the distribution. The distribution is revealed gradually as more data is collected.

How you do sampling, the relationship between continuous distributions and the actual data, and the ontological status of the theoretical distribution functions... all this is way beyond the scope of these forums. Please understand, there is a reason it takes 4 years study (I could maybe get it down to a year if you are prepared to work really hard) to get to the full answers to the questions you are asking. All I can do here is help you get a feeling for how these things come about. For more detail on this, I am going to have to refer you to a college statistics course. (It's not that I can't tell you, it's just that I'd end up having to give you the course and someone else is being paid to do that.)

In the end - you have to realise that a lot of very smart mathematicians have, over centuries, wrestled with the same issues you have brought up ... you are being taught the end result of this.

The bottom line is that you use the spread function you feel most comfortable with which also best suits the purpose you want to put it to.
Best practise is to advise people to use the standard deviation or variance in the absence of a good reason not to.

15. Nov 12, 2016

### Simon Bridge

Aside: you can sort-of turn a discrete distribution into a continuous one using the Dirac delta function $\delta(x)$...
It has the property that $\int_a^b f(x)\delta (x)\;dx = f(0)$ on any interval that includes x=0, but zero otherwise.

So for the discrete random variable X that was defined earlier in the thread ... $p(x) = \frac{1}{N}\sum_n \delta(x-x_n)$
So the previous definitions of probability etc in terms of integrals now can now be applied to X.
$P(a<x<b) = \int_a^b p(x)\; dx$, $\mu_\infty = \int xp(x)\; dx$, $\sigma^2 = \int_\infty (x-\mu)^2p(x)\; dx$
We also want to ask about the probability of an individual result ... that would be:
$P(x=a)=\lim_{\epsilon\to 0}P(a-\epsilon < x < a+\epsilon)$ ...

If we make $x_n < x_{n+1}$ so $\{x_n\}$ is an ordered set with no repeats ... then we fave to write:
$p(x)=\sum_n p_n \delta(x-x_n)$ where you can see from above that $p_n = P(x=x_n)$.

When the discrete distribution is well modelled by the normal distribution - what we are talking about is a probability density function like this:
$p(x)= \frac{A}{\sqrt{2\pi \sigma^2}}\sum_n \delta(x-x_n)e^{-(x-\mu)^2/2\sigma^2}$
(... where A is needed so that $\int_\infty p(x)\; dx = 1$)

Notice that it is not continuous ... this sort of thing is what we mean when we talk about gaussian statistics.
(It's actually a lot tidier than what we'd get IRL - to model the real life function we need to add a "noise function", and then we need to model the statistics of the noise etc.)

Are you getting an idea for how big this subject is yet?

16. Nov 12, 2016

### Stephen Tashi

I think you don't yet appreciate the conceptual complexity of statistical scenarios. The typical scenario involves more than one probability distribution. Suppose you have a random variable X whose distribution has "ups and downs". The distribution of X is not the same as the distribution of data that is computed from taking samples of X. An example of typical sampling data would be taking 100 independent measurements of X and recording the average of the measurements. If Y is the average of the measurements, Y has a probability distribution, but it is not the same as the distribution of X.

Even if the distribution of X is not bell shaped, the distribution of sample averages Y tends to be bell shaped if Y consists of averaging a large number of independent outcomes of X.

For example if X is the probability distribution describing the roll of a die, X only has the possible outcomes 1,2,3,4,5,6. But if Y is the average of 5 rolls of die, a possible outcome for Y is (1 + 1 + 2 + 3 + 6)/5 = 2.6 . So the distribution of Y has outcomes that are not possible outcomes of the distribution of X.