Computing the standard deviation in y

Click For Summary

Discussion Overview

The discussion revolves around the computation of standard deviations for x and y errors in a dataset that has been binned for analysis. Participants explore the implications of binning data, the calculation of weighted averages, and the potential errors introduced through these processes. The context includes statistical methods and error analysis in data fitting, particularly in relation to orthogonal distance regression.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes their method of calculating x-errors and y-errors for binned data, questioning whether treating these errors independently is appropriate.
  • Another participant challenges the need for binning, suggesting that a regular fit to the original data might yield better results and simplify error handling.
  • Some participants express concern that binning could lead to loss of information and complicate the analysis, particularly when dealing with numerous data points.
  • There is a discussion about the source of weights used in the weighted average and whether averaging over the entire dataset might be more effective.
  • One participant acknowledges the potential for increased error due to binning and seeks sources on the errors associated with this process.
  • Questions arise regarding how to handle multiple sources of error, particularly when errors on x and y are considered separately.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the appropriateness of binning data or the best method for calculating errors. Multiple competing views remain regarding the handling of errors and the validity of the chosen statistical methods.

Contextual Notes

Participants note that the errors associated with binning data and the calculation of weighted averages may not be straightforward, and there is uncertainty about how to combine these errors effectively.

TheCanadian
Messages
361
Reaction score
13
Hi,

I've made a "probability" histogram for my data and it's based on 14000 datapoints in total, BUT each bin is not the same (e.g. bin 1 might be composed on 200 total datapoints while bin 50 is only 3 data points). You can find it in image 1. Now, based on those relative frequencies, I constructed the average weight of each bin independently and plotted it up to a maximum of 1.7 on the x-axis (this results in 34 bins). This can be found in image 2.

There are a couple things to note, though. I've attached a third image, and this essentially just shows a bunch of random points plotted, before they were ever put into bins or averaged (my apologies for the different scales used in each of the images). All these points have an error in the independent variable that is not constant (not even in a particular bin), but none in the dependent variable. Thus, when computing the average x-error in each particular bin, I used:

## σ_x = \frac {\sqrt {\sum_{n=1}^{length(bin)} ({σ_x}_n)^2}}{length(bin)} ## where length(bin) is simply the number of elements in that particular bin. Now, when I have 34 independent bins that I each calculate an x-error for (in this prescribed method), I also want to find the y-error for combining these datapoints in each particular bin. I did this by the method shown here (i.e. from Cochran (1977)): http://stats.stackexchange.com/questions/25895/computing-standard-error-in-weighted-mean-estimation.
Now, by doing this, I have calculated 34 y-errors and 34 x-errors, one corresponding to each bin's datapoint for the plot of the weighted sum.

Now I am simply wondering, have I done anything wrong? I believe I've been essentially treating the x- and y-errors independently, but in this case, is this the correct approach? For example, since the y-error depends on the samples in each bin, the error in x indicates the likelihood of the data possibly being outside of the bin's limits. Shouldn't this be accounted for somewhat in the error in y or shouldn't there be an additional covariance parameter? If so, how would exactly would I go about doing that?

I am ultimately then trying to fit a linear function to this data by orthogonal distance regression. Any suggestions and help with regards to what I've done (possibly incorrectly) and maybe better formulas to use would be greatly appreciated!
 

Attachments

Last edited:
Physics news on Phys.org
Why did you introduce those bins? A regular fit to the original data should be the best thing you can do, and it also makes error handling easier.
 
mfb said:
Why did you introduce those bins? A regular fit to the original data should be the best thing you can do, and it also makes error handling easier.

Introducing those bins was needed to find a weighted average. The data points are numerous and go to a precision of 5 decimal places, so to couple them, I introduced the bins. The weighted average essentially gives me the average value I can expect--whether this is a good method for figuring out the most probable outcome is unknown, but hopefully the errors can tell me that. Any other suggestions you have would be great, too.
 
A weighted average in which way? Where do the weights come from?
What would be wrong with an average over the whole sample?

You have a set of x/y-data. If you don't have to (e.g. millions of datapoints), you should not bin them. In the best case you lose some information, in the worst case you mess up the whole analysis.
 
mfb said:
A weighted average in which way? Where do the weights come from?
What would be wrong with an average over the whole sample?

You have a set of x/y-data. If you don't have to (e.g. millions of datapoints), you should not bin them. In the best case you lose some information, in the worst case you mess up the whole analysis.

In the "histogram", the different colours represent the different weights. Essentially, based on my independent variable alone, I want to be able to predict what the most likely answer is (i.e. the answer on y). As you might be able to see in the general scatter plot, this is not a one-to-one function despite me wishing to treat it as such, so I binned the data in intervals of 0.05 but of course this does result in an additional error. (If you don't mind me asking, could you refer me to any sources that discuss the errors associated with binning data?--http://stats.stackexchange.com/questions/60256/standard-deviation-of-binned-observations this seems to be the only source I can find.)

Your suggestion certainly does make sense. I am definitely losing information and allowing for more error. A simple linear fit would be doing the same thing, but with less errors.

Just out of curiosity: how are all these different errors handled? If I have an inherent error on each datapoint that is not constant (not even if the x-input is the same), if I have binned the data and have an error from simply doing this, and if I have another error from finding the weighted-mean, I have 3 separate errors here. The first 2 seem like errors on x, while the third seems like an error on y. Do I just simply add the two errors on x (they appear independent)?
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
Replies
28
Views
4K
  • · Replies 37 ·
2
Replies
37
Views
5K
  • · Replies 40 ·
2
Replies
40
Views
5K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 18 ·
Replies
18
Views
3K
  • · Replies 3 ·
Replies
3
Views
8K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K