I don't understand the standard deviation.

In summary, the standard deviation (SD) and variance are two commonly used measures to describe a distribution. The mean is a measure of central tendency, while the standard deviation is a measure of dispersion. The standard deviation is the square root of the variance, and it is used because it gives the original units back and can be interpreted in real-world terms. The variance is squared because it has useful properties, such as being able to add the values together, but it is not necessary to square the values in the first place. Other measures, such as taking the root or using a different power, could also be used to indicate spread, but the standard deviation is a widely accepted and useful measure.
  • #36
Also, the distribution of errors in nature "would" follow a perfect normal distribution, if everything was "truly" random and behaving perfectly according to long-run patterns.

The pedant would disagree with this rash statement.
 
Mathematics news on Phys.org
  • #37
The standard deviation is not the expected distance to the mean.

The primary reason the mean and standard deviation have been used together for so long is the primacy of the assumption of normality for data (rightly or wrongly, usually wrongly). IF your data are normally distributed, or you are willing to believe it is, these are the natural choices for measures of location and spread.

If you prefer to work backwards and say "The best measure of location is the one that gives me the smallest measure of variability from that number to my data", then

a) If you measure variability by using the sum of the squares of the residuals, then it turns out that the mean is the measure that gives the minimum dispersion - that is, you end up working with

[tex]
\sum (x-\bar x)^2
[/tex]

b) If you decide to measure variability using the sum of the absolute values, then it turns out that the appropriate meausure of location (appropriate meaning gives lowest value of variability) is the MEDIAN, not the mean. These two go together, but are not as "efficient" for normally distributed data as the mean and the median

Intersting side point: R. A. Fisher and A. Eddington had a similar discussion early in the 20th century. The "dispute" centered on this: IF you assume data is normally distributed, what is the best way to estimate the population standard deviation?

Fisher argued that an appropriate multiple of

[tex]
\sqrt{ \frac 1 n \sum{(x-\bar x)^2}
[/tex]

was the answer, while Eddington backed a multiple of

[tex]
\frac 1 n \sum |x - \bar x|
[/tex]

was better. It has since been shown that in this limited case (strict assumption of normality) Fisher was correct (his estimate has certain optimum properties as long as normality is assumed).
 
  • #38
Is the a link to somewhere which shows Fischer is correct?
 
  • #39
Bit off topic but Fisher was interested in Eugenics.

Also he did not believe smoking caused lung cancer, perhaps he got his analysis of the statistics wrong ;)

http://en.wikipedia.org/wiki/Ronald_Fisher

Fisher was opposed to the conclusions of Richard Doll and A.B. Hill that smoking caused lung cancer. He compared the correlations in their papers to a correlation between the import of apples and the rise of divorce in order to show that correlation does not imply causation.I have to say that is pretty poor form for someone who is supposed to be an expert statistician.
Perhaps it was because he was using the root mean square method. :tongue::rofl:

"He was legendary in being able to produce mathematical results without setting down the intermediate steps."

Well that does not surprise me!
 
Last edited:
  • #40
The reason Fisher was correct is this: for the problem stated, his estimate - that is, the one he backed - has the characteristic being Uniformly Minimum Variance Unbiased, or UMVU, for the standard deviation.

"Fisher was opposed to the conclusions of Richard Doll and A.B. Hill that smoking caused lung cancer. He compared the correlations in their papers to a correlation between the import of apples and the rise of divorce in order to show that correlation does not imply causation.
I have to say that is pretty poor form for someone who is supposed to be an expert statistician.
Perhaps it was because he was using the root mean square method. "

Remember that it wasn't until much later that the link between smoking and cancer was generally accepted. Fisher was not alone in this - and nobody has claimed he was omniscient.
 
  • #41
Anecdote about Eddington

Throughout this period Eddington lectured on relativity, and was particularly well known for his ability to explain the concepts in lay terms as well as scientific. He collected many of these into the Mathematical Theory of Relativity in 1923, which Albert Einstein suggested was "the finest presentation of the subject in any language." He was an early advocate of Einstein's General Relativity, and an interesting anecdote well illustrates his humor and personal intellectual investment: Ludwig Silberstein, a physicist who thought of himself as an expert on relativity, approached Eddington at the Royal Society's (6 November) 1919 meeting where he had defended Einstein's Relativity with his Brazil-Principe Solar Eclipse calculations with some degree of scepticism and ruefully charged Arthur as one who claimed to be one of three men who actually understood the theory (Silberstein, of course, was including himself and Einstein as the other two). When Eddington refrained from replying, he insisted Arthur not be "so shy", whereupon Eddington replied, "Oh, no! I was wondering who the third one might be!

Anyway interesting reading about these too as I had never heard of either before.
 
  • #42
statdad said:
The reason Fisher was correct is this: for the problem stated, his estimate - that is, the one he backed - has the characteristic being Uniformly Minimum Variance Unbiased, or UMVU, for the standard deviation.

"Fisher was opposed to the conclusions of Richard Doll and A.B. Hill that smoking caused lung cancer. He compared the correlations in their papers to a correlation between the import of apples and the rise of divorce in order to show that correlation does not imply causation.
I have to say that is pretty poor form for someone who is supposed to be an expert statistician.
Perhaps it was because he was using the root mean square method. "

Remember that it wasn't until much later that the link between smoking and cancer was generally accepted. Fisher was not alone in this - and nobody has claimed he was omniscient.

Well I am unfamiliar with the term UMVU so I can't comment on that now.

Perhaps the reason why the link was not accepted was because of the work of people such as Fischer, who incidentally was employed by the tobacco firms as a consultant, so he had a significant conflict of interest, which perhaps could be used as an excuse for his failure to see the correlation, the alternative perhaps, is being seen as a poor statistician!
He also, I think, would be seen as racist these days.
 
  • #43
Phizo is right to continue with this question - so far nobody has yet meaningfully explained here, why the standard deviation is somehow "best" or "most natural" as an approach to a measure of spread for data. Having useful mathematical properties, or neat interpretations in some other context, is not unique to the standard deviation - thus "best" or "natural" or "the appropriate choice", is not inherent in such observations. A mathematical expression having certain optimal properties, is not necessarily an explanation either, in the absence of any clear proof of uniqueness.

Also, so far nobody has mentioned the very important data-aspect of the degrees of freedom and associated denominator of (n-1) in the formulas for the sample-level variance and standard deviation -- as opposed to the denominator of (n) in the corresponding population-level formulas.

Phizo, you are asking a very good question here, and it is good of you to persist! The fact is that, in practice, standard deviation is not essentially or necessarily "best" or most natural as a measure of spread. If statistics as an applied science were to be reborn anew tomorrow - alongside all of our current widely and readily available computing technology - it is certainly possible that the standard deviation formula as we see & use it today, would not be the most popular or default choice for a measure of spread.
 
Last edited:
  • #44
G-U-E-S-T said:
Phizo is right to continue with this question - so far nobody has yet meaningfully explained here, why the standard deviation is somehow "best" or "most natural" as an approach to a measure of spread for data. Having useful mathematical properties, or neat interpretations in some other context, is not unique to the standard deviation - thus "best" or "natural" or "the appropriate choice", is not inherent in such observations. A mathematical expression having certain optimal properties, is not necessarily an explanation either, in the absence of any clear proof of uniqueness.
In this context "best" has a certain statistical meaning. IF you assume your data comes from the normal distribution, then the best estimates of mean, variance, and standard deviation are the ones being discussed. If you make different assumptions, you get different answers.

Also, so far nobody has mentioned the very important data-aspect of the degrees of freedom and associated denominator of (n-1) in the formulas for the sample-level variance and standard deviation -- as opposed to the denominator of (n) in the corresponding population-level formulas.
The denominator in the sample variance is selected to be [itex] \frac 1 {n-1} [/itex] in order to make the statistic unbiased - so that its expectation equals the sample variance.
Phizo, you are asking a very good question here, and it is good of you to persist! The fact is that, in practice, standard deviation is not essentially or necessarily "best" or most natural as a measure of spread. If statistics as an applied science were to be reborn anew tomorrow - alongside all of our current widely and readily available computing technology - it is certainly possible that the standard deviation formula as we see & use it today, would not be the most popular or default choice for a measure of spread.

Possibly - there are many other methods for measuring variability now. However,
a) It would still be the case that the same quantities would be found as "most natural" to use when people assume normality
b) It would probably be the (unfortunate) case that the normal distribution would rise to prominence as the most used (and so, mis-used) distributional assumption
c) It would be the case that non-parametric, and robust, measures, would be adopted more readily than they have been (even though their use is becoming more common) as a consequence of the widely available computing power
 
  • #45
G-U-E-S-T said:
Phizo is right to continue with this question - so far nobody has yet meaningfully explained here, why the standard deviation is somehow "best" or "most natural" as an approach to a measure of spread for data. Having useful mathematical properties, or neat interpretations in some other context, is not unique to the standard deviation - thus "best" or "natural" or "the appropriate choice", is not inherent in such observations. A mathematical expression having certain optimal properties, is not necessarily an explanation either, in the absence of any clear proof of uniqueness.

Also, so far nobody has mentioned the very important data-aspect of the degrees of freedom and associated denominator of (n-1) in the formulas for the sample-level variance and standard deviation -- as opposed to the denominator of (n) in the corresponding population-level formulas.

Phizo, you are asking a very good question here, and it is good of you to persist! The fact is that, in practice, standard deviation is not essentially or necessarily "best" or most natural as a measure of spread. If statistics as an applied science were to be reborn anew tomorrow - alongside all of our current widely and readily available computing technology - it is certainly possible that the standard deviation formula as we see & use it today, would not be the most popular or default choice for a measure of spread.

Well as I said initially I just do not really see where it comes from, and most of the answers I get seem to based on some other dubious and unexplained concept.

I mean measuring a spread is a somewhat vague concept anyway, it seems to be a process of measuring the unmeasurable, for example I think they use it in opinion polls and and they are pretty much pot luck in that you hope to pick a representive sample.
 
  • #46
phizo said:
That also seems a bit of circular arguement.
You need to explain why this is so first I think.

I need to explain why the mean and standard deviation define a normal distribution?!?
 
  • #47
Mark44 is correct - I will be a little less diplomatic: try actually studying and learning about the material BEFORE you write all of it off. It will take some work (unlike referring to the world's largest repository of unreliable material, Wikipedia)
 
  • #48
CRGreathouse said:
I need to explain why the mean and standard deviation define a normal distribution?!?

Yes please. You seem to indicate it is a simple answer, so why not just explain it?
 
  • #49
Mark44 said:
We have been laboring away, trying to explain to you that it was not plucked out of thin air. Unfortunately, your response to most of the explanations seems to be that they involve mathematics that you don't understand, or that an article is too dense with links to too many other sites, or that a sentence that seems crystal clear to me is "circular reasoning."


Just because you have never seen the need to work with variance of standard deviation doesn't mean that these statistics are unneeded. To use your example of a can of beans, manufacturers and food processors are very interested in making sure that the the variability of what goes in a can or package is tightly controlled. If they put more beans in the can than the advertised weight on the can, they are losing money. If they put too few beans in the can, they can be liable to lawsuits for failing to deliver the advertised amount. You better bet that they are keeping track of the standard deviation here.


It's not the mathematics I don't understand but the language used to hide the mathematics.
 
  • #50
phizo said:
I have had a look at that article and it is pretty 'dense' and drags me all over the place via links, so it's hard work.

Anyway it's going a bit too much into the maths of it, it does not seem to get at that the root of the problem.
I had hoped there would be a simple explanation, it is not looking like I will be getting an answers I am happy with here, I will probably have to work out an answer myself before I will be happy (as is sometimes the case).
As I am being pointed to wiki pages it does not seem like anyone will be able to post an answer here I am happy with.

If you cannot understand the math, and are not willing to do the work necessary to understand it, then there is really no point in continuing this discussion. Seems to me that you are way eager to argue and reluctant to put any effort into learning.

I am not interested in reading any more of your arguments. The answers you seek are in this thread. Please read it over a few times. Try opening your mind while making an effort to understand.

Thread locked
 

Similar threads

  • General Math
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
914
Replies
11
Views
6K
  • General Math
Replies
5
Views
2K
  • Other Physics Topics
Replies
5
Views
2K
Replies
29
Views
6K
Replies
4
Views
2K
  • Classical Physics
Replies
7
Views
579
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
Back
Top