# Estimating the standard deviation from sampled data

1. Dec 4, 2012

### f95toli

I am en experimentalist and in most of my experiments I am interested in measuring the properties of distributions, i.e. the phenomenon I am measuring is stochastic and the parameters I am interested in are (in the simplest case) say the mean value and the width of the distribution (variance of standard distribution).

My "in" data is a times series wth n samples sampled at some frequency fs which is then post-processed in Matlab. If often deal with quite long time-series (millions of points) that take hours or days to aquire, and I am therefore interested in understanding how much here really is to gain my say doubling the number of aquired points.

My question is a practical one: how many samples do I need in order to estimate the shape of the distribution?
I know that the accuracy by which I can estimate the mean improves as √n, at least if one assumes a normal distribution.
But how quickly does the estimate of the std improve?

Also, can one say something about many samples one need to estimate the parameters for other common distributions (Poissonian etc)?

2. Dec 4, 2012

### Staff: Mentor

If I remember correctly, the relative uncertainty for the variance scales with $\frac{1}{\sqrt{n}}$ (probably with some prefactor close to 1). You can check this with your data. This should be similar for other distributions with a well-defined standard deviation.

That is a very vague question.

3. Dec 4, 2012

### f95toli

Sorry, what I meant was: how does the accuracy (or confidence interval if that is easier) by which I can estimate parameters for various distribution scale with n?
This will of course depend on the type of distribution, so I guess what I am looking for is e.g. a table where this is listed; so far I haven't been able to find one.
The best I have found so far was a very incomplete table (with no proper references) which I for various reasons do not quite trust.

The 1/√n for the variance seems reasonble, but I'd like to have more details (including the prefactors)

4. Dec 4, 2012

### Stephen Tashi

The distribution of what?

More precise terminology may clarify what you want.

To estimate a parameter of a distribution, we use functions of the sample data. These functions are called "estimators". When you say you want to estimate a parameter of a distribution, this does not automatically say what estimator you are using. For example post #12 of the thread https://www.physicsforums.com/showthread.php?t=616643 mentions 3 different estimators for the variance of a normal distribution, each of which is, in some sense, the best one.

To inquire about confidence intervals for a parameter you must give more information than simply stating the parameter. You have to say what estimator you are asking about.

A parameter of a distribution is not necessarily a "moment" of the distribution. Moments (like the variance) are a special set of parameters. For some distributions, various moments don't exist since they would involve evaluating divergent integrals.

The shape of the graph of a distribution doesn't necessarily have a simple relation to the moments of the distribution. if you are actually interested in shape, you should try to define the criteria for judging it. i.e. If you were given the true formula of a distribution and two other formulas that gave imperfect approximations of the shape, what computation would you do to decide which of the imperfect approximations was better?

A confidence interval can have a definite size but it doesn't have a definite center. For example, suppose you have enough samples to be 95% "confident" that the an interval of plus or minus 2.3 about the estimated standard deviation of the distribution contains the true value of the standard deviation of the distribution. If the estimate of the standard deviation from your sample is 6.0, you can't claim there is a 95% probability that the population standard deviation is in the interval 6.0 plus or minus 2.3. (If you could make such a claim, then then we could speak of a "95% probability interval" instead of having to say "confidence interval". In mathematical statistics, "confidence" is not a synonym for "probability".)

Given all the above, if you are still interested in moments and confidence, the relevant search terms are "confidence intervals for estimators of moments".

If you have a 95% confidence interval for estimating one parameter (say the mean) and a 95% confidence for estimating another parameter (say the standard deviation), this does not imply that the two intervals together give you 95% confidence for estimating both parameters jointly.
Search for "joint confidence intervals" if that's what you want.

You mention time series. Most results you find about estimators of moments are going to assume independent random samples. If your sample data is processed from a time series by some method that uses overlapping sequences of the raw data, then it isn't plausible that the processed values are independent. For example if the raw data is x[1],x[2],x[3],... and the processed data is y[1] = f(x[1],x[2],x[3]), y[2] = f[x[2],x[3],x[4]),... then y[1] and y[2] both have a dependence on x[2] and x[3].

Last edited: Dec 5, 2012
5. Dec 5, 2012

### ImaLooser

Like most people, you seem to be confusing the sample mean and variance with the population mean and variance. The population mean and variance are the parameters you are estimating. The sample mean and variance are random variables, so they vary from sample to sample.

> how many samples do I need in order to estimate the shape of the distribution?

Well, it depends. The simplest way is to make a running graph of the distribution as the data comes in. Once the graph stops changing shape then you have enough data.

The sample mean is usually normally distributed, regardless of the distribution of the population. (The exceptions are some distributions that occasionally give a wild outlier.) I seem to recall that the variance has a chi-squared distribution, but I've forgotten. That should be easy to find.

You have a time series, so that may change things. Often there is auto-correlation. If so, then you have to handle it differently than a random sample.

6. Dec 6, 2012

### f95toli

I think it is the terminology that is a bit confusing:shy:. My fault.
When I wrote "sample" I meant a data-point in my time series (the data is sampled at some rate fs).

Also, the "enough data" bit is my problem. At the moment each measurement run takes about 40 hours and we have a large parameter space to explore meaning we have to think hard about what and how long we measure; increasing the number of points by say a factor of ten is expensive both in terms of time and money; and is also experimentally very difficult.
This is why I am interested in trying to understand more about how much we actually gain by measuring longer etc.

In case someone is interested: we are actually measuring noise, over some time-scales white noise dominates, but for longer times (seconds-days) we have 1/f^$\alpha$ noise with $\alpha$ in the range 1-2. One thing we are interested in is to get good estimates for alpha; which we can get by fitting to plots of the Allan deviation (which is less known than, but very similar to the STD plotted as a function of time).

So ultimately we want to produce Allan deviation plots with small error bars, and at the moment I only have some very crude estimates of the size of those and how they scale with n.

Last edited by a moderator: Dec 6, 2012
7. Dec 6, 2012

### D H

Staff Emeritus
I'm going to give you the same answer I give to people who ask "how many Monte Carlo runs do I need to in order to determine whether operation X is 'safe' or not?" (Here, 'safeness' is defined in a statistical sense.)

If I can do a quick back of the envelope calculation and show that the operation is obviously safe even under a worst case combination of errors and random fluctuations, well, it's safe. The answer is zero Monte Carlo runs are needed.

The operation is almost certainly unsafe if a failure occurs during a small number of Monte Carlo trials. I shouldn't see any failures with a small number of trials. The operation is probably safe if this small number of trials indicates that a failure would require a ten standard deviation departure from the mean. On the other hand, we have our work cut out for us if this small number of trials indicates that the yes/no answer is right on the cusp.

Back to your problem, I have some questions:
If you don't know that answer to this question it makes it very tough to tell you how much data are needed.

• Can you slice and dice your data, examining just a subset rather than the whole enchilada?
If you can subset the data, resampling techniques such as jackknife and bootstrap will give you an idea of the variance in the parameters you are trying to estimate.

• Do you have any controls that affect the outcome of the experiment?
If this is the case, a whole new set of techniques become available that are a whole lot better than the typical 1/√N awfulness that results from simple statistical analyses. You can use a particle filter (aka sequential Monte Carlo) to move the experiment in a direction that will best improve your estimates.

8. Dec 10, 2012

### lavinia

Maybe the first thing to do is some data analysis to see how close your data is to normal. If it is close then standard estimates are fine.

If you have no idea what the distribution is you could try bootstrapping the data or taking averages of the data to approximate a normal distribution. For sime series averaging is risky since you run the risk of reusing the same data points in different average point. That messes everything up.

9. Dec 10, 2012

### f95toli

• That is a good question. In the experiment we are essentially measuring the jitter of a 15 MHz signal (as standard- or Allan deviation) for different gate times (up to say a 1000s). We do this as a function of a couple of experimental parameters (temperature and power), and as we vary those we can clearly see a dependence. I am not quite sure how to quantify the uncertainty I need. However, if I plot the jitter for t=10s as a function of e.g. temperature the resulting curve is very "noisy"; the range of the y axis (=the deviation) is usally in the range 5-10 Hz to maybe 200-300 Hz.

In order to make a reasonably good fit I probably need the an estimate of the std (or ADEV which is more or less the same thing) that is within maybe 5 Hz or so.

And before anyone asks, there is not theory for how it "should" behave (this is research after all :tongue2: ); meaning I don't even know which function to fit to. One of the goals of the experiment is to try to figure out the type of behaviour, since this might give us some clues about the underlying phenomena.

It depends on the measurement we are doing. In some measurements we can and then the data (the frequency sampled at some rate fs which is usually around 300 Hz) is saved in file. We usually save 5-10 million points which is enough to get good statistics for short times (up to 0.1s or so) but the data above 50-100s is not great (since it is effectivly calculated from fewer points). Aquiing more data takes more time, but it also makes the files difficult to work with (I am already having to use a workstation to handle our 10 million point files, calculating the ADEV takes time)

The same measurement can instead be done at a fixed gate time using a counter which averages for example 100 measurements and then all we get is a single number. Averaging 100 times is clearly not enough since the curve is too noisy (see above), but increasing the time is difficult: 100 averages at 0.1 s is fine (100*0.1 =10s per point) but 100x10s starts to become a very long measurement if you need to do it as a function of two variables.

We can change the jitter using temperature and power. But there is as yet no theory for HOW this should affect the outcome

10. Dec 11, 2012

### Stephen Tashi

It would help the statistical audience to know some things about the specialized statistics you've just mentioned.

I take it that "jitter", "Allan deviation" and "standard deviation" all have exactly the same meaning in this problem. From those three, I'll use "Allan Deviation". The Wikipedia article on Allan variance http://en.wikipedia.org/wiki/Allan_variance gives several estimators for Allan variance. Which estimator are you using? I haven't searched for articles on the behaviors of these estimators yet - has anyone else?

Mathematical modeling can still be useful just to get the statitistical techniques right. If you model the situation with a wrong deterministic relation to power and temperature and understand how to best recover that wrong deterministic relation from simulated data, then you get a better understanding about how to recover the correct deterministic relation from actual data. So if you are skilled at writing simulations, don't worry about knowing the right function to fit. Just guess at one. See how well your statistical methods can recover the guess from simulated data that has "noise".

The Allan variance has very technical definition and several different estimators, so it is hard for a non-specialist to understand the definition of the data that is being sampled. Perhaps the instrument you are using has some estimator hard-coded in its firmware and only the manufacturer's documentation can explain what a measurement measures. For example, in a 300 HZ sample from a 15 MHZ signal, what does one data value taken at time T represent. Is it an estimator of the Allan variance over the "observation period" 1/300 sec. Or ist it an estimator for a smaller observational period, centered about T?