Register to reply 
Estimating the standard deviation from sampled data 
Share this thread: 
#1
Dec412, 07:25 AM

Sci Advisor
PF Gold
P: 2,245

I am en experimentalist and in most of my experiments I am interested in measuring the properties of distributions, i.e. the phenomenon I am measuring is stochastic and the parameters I am interested in are (in the simplest case) say the mean value and the width of the distribution (variance of standard distribution).
My "in" data is a times series wth n samples sampled at some frequency fs which is then postprocessed in Matlab. If often deal with quite long timeseries (millions of points) that take hours or days to aquire, and I am therefore interested in understanding how much here really is to gain my say doubling the number of aquired points. My question is a practical one: how many samples do I need in order to estimate the shape of the distribution? I know that the accuracy by which I can estimate the mean improves as √n, at least if one assumes a normal distribution. But how quickly does the estimate of the std improve? Also, can one say something about many samples one need to estimate the parameters for other common distributions (Poissonian etc)? 


#2
Dec412, 12:58 PM

Mentor
P: 11,819

If I remember correctly, the relative uncertainty for the variance scales with ##\frac{1}{\sqrt{n}}## (probably with some prefactor close to 1). You can check this with your data. This should be similar for other distributions with a welldefined standard deviation.



#3
Dec412, 02:10 PM

Sci Advisor
PF Gold
P: 2,245

This will of course depend on the type of distribution, so I guess what I am looking for is e.g. a table where this is listed; so far I haven't been able to find one. The best I have found so far was a very incomplete table (with no proper references) which I for various reasons do not quite trust. The 1/√n for the variance seems reasonble, but I'd like to have more details (including the prefactors) 


#4
Dec412, 11:19 PM

Sci Advisor
P: 3,282

Estimating the standard deviation from sampled data
More precise terminology may clarify what you want. To estimate a parameter of a distribution, we use functions of the sample data. These functions are called "estimators". When you say you want to estimate a parameter of a distribution, this does not automatically say what estimator you are using. For example post #12 of the thread http://www.physicsforums.com/showthread.php?t=616643 mentions 3 different estimators for the variance of a normal distribution, each of which is, in some sense, the best one. To inquire about confidence intervals for a parameter you must give more information than simply stating the parameter. You have to say what estimator you are asking about. A parameter of a distribution is not necessarily a "moment" of the distribution. Moments (like the variance) are a special set of parameters. For some distributions, various moments don't exist since they would involve evaluating divergent integrals. The shape of the graph of a distribution doesn't necessarily have a simple relation to the moments of the distribution. if you are actually interested in shape, you should try to define the criteria for judging it. i.e. If you were given the true formula of a distribution and two other formulas that gave imperfect approximations of the shape, what computation would you do to decide which of the imperfect approximations was better? A confidence interval can have a definite size but it doesn't have a definite center. For example, suppose you have enough samples to be 95% "confident" that the an interval of plus or minus 2.3 about the estimated standard deviation of the distribution contains the true value of the standard deviation of the distribution. If the estimate of the standard deviation from your sample is 6.0, you can't claim there is a 95% probability that the population standard deviation is in the interval 6.0 plus or minus 2.3. (If you could make such a claim, then then we could speak of a "95% probability interval" instead of having to say "confidence interval". In mathematical statistics, "confidence" is not a synonym for "probability".) Given all the above, if you are still interested in moments and confidence, the relevant search terms are "confidence intervals for estimators of moments". If you have a 95% confidence interval for estimating one parameter (say the mean) and a 95% confidence for estimating another parameter (say the standard deviation), this does not imply that the two intervals together give you 95% confidence for estimating both parameters jointly. Search for "joint confidence intervals" if that's what you want. You mention time series. Most results you find about estimators of moments are going to assume independent random samples. If your sample data is processed from a time series by some method that uses overlapping sequences of the raw data, then it isn't plausible that the processed values are independent. For example if the raw data is x[1],x[2],x[3],... and the processed data is y[1] = f(x[1],x[2],x[3]), y[2] = f[x[2],x[3],x[4]),... then y[1] and y[2] both have a dependence on x[2] and x[3]. 


#5
Dec512, 05:32 AM

P: 570

> how many samples do I need in order to estimate the shape of the distribution? Well, it depends. The simplest way is to make a running graph of the distribution as the data comes in. Once the graph stops changing shape then you have enough data. The sample mean is usually normally distributed, regardless of the distribution of the population. (The exceptions are some distributions that occasionally give a wild outlier.) I seem to recall that the variance has a chisquared distribution, but I've forgotten. That should be easy to find. You have a time series, so that may change things. Often there is autocorrelation. If so, then you have to handle it differently than a random sample. 


#6
Dec612, 10:28 AM

Sci Advisor
PF Gold
P: 2,245

When I wrote "sample" I meant a datapoint in my time series (the data is sampled at some rate fs). Also, the "enough data" bit is my problem. At the moment each measurement run takes about 40 hours and we have a large parameter space to explore meaning we have to think hard about what and how long we measure; increasing the number of points by say a factor of ten is expensive both in terms of time and money; and is also experimentally very difficult. This is why I am interested in trying to understand more about how much we actually gain by measuring longer etc. In case someone is interested: we are actually measuring noise, over some timescales white noise dominates, but for longer times (secondsdays) we have 1/f^[itex]\alpha[/itex] noise with [itex]\alpha[/itex] in the range 12. One thing we are interested in is to get good estimates for alpha; which we can get by fitting to plots of the Allan deviation (which is less known than, but very similar to the STD plotted as a function of time). So ultimately we want to produce Allan deviation plots with small error bars, and at the moment I only have some very crude estimates of the size of those and how they scale with n. 


#7
Dec612, 12:19 PM

Mentor
P: 15,147

That answer is "it depends." If I can do a quick back of the envelope calculation and show that the operation is obviously safe even under a worst case combination of errors and random fluctuations, well, it's safe. The answer is zero Monte Carlo runs are needed. The operation is almost certainly unsafe if a failure occurs during a small number of Monte Carlo trials. I shouldn't see any failures with a small number of trials. The operation is probably safe if this small number of trials indicates that a failure would require a ten standard deviation departure from the mean. On the other hand, we have our work cut out for us if this small number of trials indicates that the yes/no answer is right on the cusp. Back to your problem, I have some questions:



#8
Dec1012, 07:54 AM

Sci Advisor
P: 1,716

If you have no idea what the distribution is you could try bootstrapping the data or taking averages of the data to approximate a normal distribution. For sime series averaging is risky since you run the risk of reusing the same data points in different average point. That messes everything up. 


#9
Dec1012, 09:09 AM

Sci Advisor
PF Gold
P: 2,245

Sorry for not replying sooner. Had a very busy weekend.



#10
Dec1112, 03:14 PM

Sci Advisor
P: 3,282

It would help the statistical audience to know some things about the specialized statistics you've just mentioned.



Register to reply 
Related Discussions  
Finding standard deviation or error from normalized data.  Precalculus Mathematics Homework  1  
Estimating the standard deviation  Set Theory, Logic, Probability, Statistics  2  
Standard deviation of aggregated data  Set Theory, Logic, Probability, Statistics  3  
Standard deviation of weighted data  Precalculus Mathematics Homework  4  
How do I determine which standard deviation to use after normalizing a set of data?  Set Theory, Logic, Probability, Statistics  1 