# Confidence in the means despite high standard deviation?

1. Jul 5, 2013

### Hypatio

I have a one-dimensional set of data which I have attached an image of. This data is an intensity z, as a function of time, t. There are many data points over time with a large amount of scatter. However, if the data is binned over very short wavelengths we see that there is smooth variation over time. The standard deviation is large due to the scatter in the data, but it is also about the same for all times. In addition, we can create a model of the phenomenon at work and fit the means very well. On the other hand, if we simply change parameters in the model, we can get a model which fits the means poorly, but is still within the standard deviation because the standard deviation is so large.

What I want to know is what kind of statistical problems are encountered if you want to argue that a model which does not fit the bins well (even if it lay within the standard deviation) is actually a bad fit? I can think of at least two things but I don't know how to talk about them as well as I would like:

1) There is clearly a systematic long-wavelength variation which can be fit by a model, but how does that translate to a statistical argument for confidence that the means are more important than the standard deviation?

2) The roughness of the data may simply be noise which is randomly distributed. If the 'roughness' of the noise (e.g. root-mean squared residual) is about the same magnitude as the standard deviation, would this not prove that the means are more robust than it would seem, given the large standard deviation?

In short, how do you argue about the robustness the means in data with large standard deviation?

What do you think?

File size:
27 KB
Views:
119
2. Jul 5, 2013

### Stephen Tashi

Your questions about standard deviation seem to assume that all the "error" is in the luminosity measurement and that there is no error in the time measurement. Is that your assumption?

3. Jul 6, 2013

### Hypatio

Actually, the assumption is that the "error" (standard deviation) lay in a source of non-experimental random noise. I say non-experimental because the noise is real but could be linked to a second-order process. The measurements themselves (luminosity and time) could be treated as exact.

So we have a main process A which generate the large-scale variation, and then a small-scale process B which generates the "noise".

So if I want to constrain the variation due to process A, it is correct to fit to the means only and incorrect to just fit any curve within the standard deviation. But what kind of statistical ideas are involved here? How can you statistically argue that you must fit to the means?

Last edited: Jul 6, 2013
4. Jul 6, 2013

### chiro

So basically are you trying to say given some assumption for an error residual, what is the effect of this residual in terms of how it affects the measured distribution (with respect to the true population) and how we should deal with this when we want to estimate the population distribution (without the noise)?

5. Jul 6, 2013

### Stephen Tashi

The "best fit" is a subjective decision until you have an objective definition of "best". Some people define "best" by expressing faith in some statistic, like the (estimator of the) mean. Some people cn define a "loss function' or "merit function" and try to minimized the expected loss or maximize the expected merit. As I recall, using the mean value of a population to predict samples from a population minimizes a quadratic loss function. The mean value of a sample is one estimator of the mean value of the population.

If you are working on something that you'll try to publish, then look in the journals that you're submitting it to and see what kind of statistical methods the editors of the journal have accepted.

6. Jul 6, 2013

### Hypatio

Chiro,

I think that sounds right. I want to say that the error residual, with respect to the sample mean (as an approximation of the true population mean), is superfluous to "process A". I suppose that this requires an estimate of the population distribution without the noise by determining the effect of the error residual on the measured distribution?

7. Jul 6, 2013

### chiro

You might need to invent some theory yourself but what I would recommend you do is basically look at the standard goodness of fit tests and allow some "slack" space to take into account the random noise.

Without noise, something like a Chi-Square Goodness of Fit test would be the exact way to detect a disturbance and when you add noise, then what happens is this uncertainty (and hence variance) becomes larger as you are essentially adding more variance to the distribution.

One thing I definitely think you should do is to use some statistical theory to "average" over all possible distributions to get a "mean" distribution and then use that for your chi-square goodness of fit.

You could also use a Bayesian approach.

The above ideas should all provide some way of factoring in the random noise.