1. Jan 24, 2017

### KFC

Hi all,
I learn some statistics some times ago. It has been while but I still remember some characteristics and property of the normal distribution. One of them is the standard deviation could be used to estimated the probability of finding the entity around the mean in the range $[-n\sigma, +n\sigma]$ with $n=1, 2, 3$ is estimated to be 68%, 95% and 99.7%, so for normal distribution, standard deviation is used to estimated the dispersion of the data. But today, I got some data from somewhere. I plot it in matlab, the distribution is not symmetric, one side has very steep edge and the other end is a long tail. From what I am understanding that standard deviation(SD) is just an algorithm so it is not a feature of normal distribution only.

So can I say the SD for arbitrary distribution has the significance of estimating the dispersion of the data?

For non-symmetric distribution with one side is a long tail. What is the significance of SD? In the data I have, the mean value is about x=40 (a sharp peak appear at 55 when plot it as histogram), the SD is about 52 but the left side to the mean drop to zero when x=29 and gradually decrease to zero on the right side at x = 300. One thing I know is the x must be positive number. So I can't say the probability of finding the data in the range [40-52=-12, 40+52=92] is 68%. It is confusing me

2. Jan 24, 2017

### Stephen Tashi

You have to distinguish between the population parameters ( mean and standard deviation) and the parameters computed from a sample of data (sample mean and sample standard deviation). What you say is true if you know the population mean and population standard deviation. However, you seem to be talking about the sample mean and sample standard deviation. For large samples, people often assume the sample mean and sample standard deviation are accurate approximations of the population mean and population standard deviation. Perhaps that is what you are doing.

You are correct that there are customary formulas for computing the sample standard deviation that are applied regardless of whether the population from which the sample is taken is a normal distribution or some other distribution.

The SD for an arbitrary distribution may not be sufficient to make statements about the probability that a random sample will fall within a certain interval. The normal distribution has the special property that it can be completely determined by two parameters, the mean and the standard deviation. Other distributions require several parameters to completely determine them.

If you select a specific family of distributions and assume you can estimate their parameters from the sample data then you can accomplish the goal of computing the probability that a random sample falls in a particular interval. However, you can't rely on the .68, .95. .997 numbers that are used with the normal distribution. You'd have to use numbers that apply to the particular family of distributions that you have selected.

3. Jan 24, 2017

### KFC

Thank you so much for your explanation. I think I get some points from here. So from the first comment you made, when you say population mean and population SD, do you mean the real mean and SD that computed from every single entities in the sampling space. But in actual case, it is not possible to sample all entities but a finite group of data, so we are actually calculating the mean and SD for that group only. However, if the size of the group is large enough, the sample mean/SD should be pretty close to the population mean/SD, is that what you mean?

Ok, I then have another question. Let's say I make 100 groups, all data each group is sampled independently from the population. I calculate 100 separate mean and SD. And I figure out all those 100 means and 100 SDs are very close. Can I make an conclusion that they are good enough to reflect the population mean and SD? It is confusing me because how do we know 100 groups is good enough? For example, I try 100 groups, I calculate 100 mean values, they are about the same. But I could try 1000 groups, but those 1000 means may not all close to each other, some may be very off to others. So what is the criterion for us to choose how many group of sample are good enough and what size of each group should be used to give good approximation?

Second, if I get a bunch of samples. Instead of plotting those numbers, is that any way I can tell if they are normal distribution or not?

Sorry for the long question. I appreciate again your help.

4. Jan 25, 2017

### Stephen Tashi

Yes.
In practical problems, we often assume they are close. There is no mathematical proof that they are always close.

What you must understand about scenarios that involve probability is that there can be no absolute guarantees. The assumption that an event has a probability (different than 1.0) contradicts any attempts to give absolute guarantees about what the result will be or what will happen if the event is repeated many times. If we put our minds to it, we could compute the probability that the mean of the population is within such-and-such of the estimate produced by each of the procedures you described above. But we cannot give an absolute guarantee that this degree of accuracy will be accomplished. The mathematics of probability lets us compute "the probability of" of things. In practical applications, people assume if the probability of something is very high then it will actually happen. That assumption is not based on probability theory. It's based on the feeling people have that they will enjoy at least average luck.

5. Jan 29, 2017

### FactChecker

If you have enough data, you can group your data in "bins" that are a range of intervals and apply a Chi-squared Goodness of Fit Test. Compare the number of your data in the bins with what you would expect from a Normal distribution.

Regarding the use of standard deviation of a highly asymmetric distribution: I recommend that you specify an asymmetric interval around the mean with equal probability on each side of the mean. I think that would be better than using the standard deviation.