Confidence in the means despite high standard deviation?

Click For Summary

Discussion Overview

The discussion revolves around statistical challenges in interpreting data with high standard deviation, particularly in the context of fitting models to means versus considering the variability of the data. Participants explore the implications of noise in measurements and how it affects the robustness of the means in statistical analysis.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes a dataset with high scatter and questions how to statistically argue that a model fitting the means is more valid than one fitting within the standard deviation.
  • Another participant challenges the assumption that all error is in luminosity measurement, suggesting that time measurement may also contain error.
  • A participant clarifies that the error is attributed to non-experimental random noise linked to a secondary process, distinguishing between large-scale variation and small-scale noise.
  • There is a discussion about the implications of residual errors on the measured distribution and how to estimate the population distribution without noise.
  • One participant emphasizes that defining the "best fit" is subjective and can depend on the statistical methods or loss functions used.
  • Another suggests that goodness of fit tests should account for random noise, proposing the use of Chi-Square tests and Bayesian approaches to factor in this uncertainty.

Areas of Agreement / Disagreement

Participants express differing views on the nature of error in measurements and the implications for model fitting. There is no consensus on the best approach to argue for the robustness of means in the presence of high standard deviation and noise.

Contextual Notes

Participants highlight the need for assumptions regarding the nature of noise and error in measurements, as well as the potential impact of these assumptions on statistical arguments and model fitting.

Hypatio
Messages
147
Reaction score
1
I have a one-dimensional set of data which I have attached an image of. This data is an intensity z, as a function of time, t. There are many data points over time with a large amount of scatter. However, if the data is binned over very short wavelengths we see that there is smooth variation over time. The standard deviation is large due to the scatter in the data, but it is also about the same for all times. In addition, we can create a model of the phenomenon at work and fit the means very well. On the other hand, if we simply change parameters in the model, we can get a model which fits the means poorly, but is still within the standard deviation because the standard deviation is so large.

What I want to know is what kind of statistical problems are encountered if you want to argue that a model which does not fit the bins well (even if it lay within the standard deviation) is actually a bad fit? I can think of at least two things but I don't know how to talk about them as well as I would like:

1) There is clearly a systematic long-wavelength variation which can be fit by a model, but how does that translate to a statistical argument for confidence that the means are more important than the standard deviation?

2) The roughness of the data may simply be noise which is randomly distributed. If the 'roughness' of the noise (e.g. root-mean squared residual) is about the same magnitude as the standard deviation, would this not prove that the means are more robust than it would seem, given the large standard deviation?

In short, how do you argue about the robustness the means in data with large standard deviation?

What do you think?
 

Attachments

  • TEST.jpg
    TEST.jpg
    27 KB · Views: 461
Physics news on Phys.org
Your questions about standard deviation seem to assume that all the "error" is in the luminosity measurement and that there is no error in the time measurement. Is that your assumption?
 
Actually, the assumption is that the "error" (standard deviation) lay in a source of non-experimental random noise. I say non-experimental because the noise is real but could be linked to a second-order process. The measurements themselves (luminosity and time) could be treated as exact.

So we have a main process A which generate the large-scale variation, and then a small-scale process B which generates the "noise".

So if I want to constrain the variation due to process A, it is correct to fit to the means only and incorrect to just fit any curve within the standard deviation. But what kind of statistical ideas are involved here? How can you statistically argue that you must fit to the means?
 
Last edited:
So basically are you trying to say given some assumption for an error residual, what is the effect of this residual in terms of how it affects the measured distribution (with respect to the true population) and how we should deal with this when we want to estimate the population distribution (without the noise)?
 
Hypatio said:
So if I want to constrain the variation due to process A, it is correct to fit to the means only and incorrect to just fit any curve within the standard deviation. But what kind of statistical ideas are involved here? How can you statistically argue that you must fit to the means?

The "best fit" is a subjective decision until you have an objective definition of "best". Some people define "best" by expressing faith in some statistic, like the (estimator of the) mean. Some people cn define a "loss function' or "merit function" and try to minimized the expected loss or maximize the expected merit. As I recall, using the mean value of a population to predict samples from a population minimizes a quadratic loss function. The mean value of a sample is one estimator of the mean value of the population.

If you are working on something that you'll try to publish, then look in the journals that you're submitting it to and see what kind of statistical methods the editors of the journal have accepted.
 
Chiro,

I think that sounds right. I want to say that the error residual, with respect to the sample mean (as an approximation of the true population mean), is superfluous to "process A". I suppose that this requires an estimate of the population distribution without the noise by determining the effect of the error residual on the measured distribution?
 
You might need to invent some theory yourself but what I would recommend you do is basically look at the standard goodness of fit tests and allow some "slack" space to take into account the random noise.

Without noise, something like a Chi-Square Goodness of Fit test would be the exact way to detect a disturbance and when you add noise, then what happens is this uncertainty (and hence variance) becomes larger as you are essentially adding more variance to the distribution.

One thing I definitely think you should do is to use some statistical theory to "average" over all possible distributions to get a "mean" distribution and then use that for your chi-square goodness of fit.

You could also use a Bayesian approach.

The above ideas should all provide some way of factoring in the random noise.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 24 ·
Replies
24
Views
7K
  • · Replies 11 ·
Replies
11
Views
8K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 15 ·
Replies
15
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K