Computing the standard deviation in y

TheCanadian · Jul 27, 2015

Hi,

I've made a "probability" histogram for my data and it's based on 14000 datapoints in total, BUT each bin is not the same (e.g. bin 1 might be composed on 200 total datapoints while bin 50 is only 3 data points). You can find it in image 1. Now, based on those relative frequencies, I constructed the average weight of each bin independently and plotted it up to a maximum of 1.7 on the x-axis (this results in 34 bins). This can be found in image 2.

There are a couple things to note, though. I've attached a third image, and this essentially just shows a bunch of random points plotted, before they were ever put into bins or averaged (my apologies for the different scales used in each of the images). All these points have an error in the independent variable that is not constant (not even in a particular bin), but none in the dependent variable. Thus, when computing the average x-error in each particular bin, I used:

## σ_x = \frac {\sqrt {\sum_{n=1}^{length(bin)} ({σ_x}_n)^2}}{length(bin)} ## where length(bin) is simply the number of elements in that particular bin. Now, when I have 34 independent bins that I each calculate an x-error for (in this prescribed method), I also want to find the y-error for combining these datapoints in each particular bin. I did this by the method shown here (i.e. from Cochran (1977)): http://stats.stackexchange.com/questions/25895/computing-standard-error-in-weighted-mean-estimation.
Now, by doing this, I have calculated 34 y-errors and 34 x-errors, one corresponding to each bin's datapoint for the plot of the weighted sum.

Now I am simply wondering, have I done anything wrong? I believe I've been essentially treating the x- and y-errors independently, but in this case, is this the correct approach? For example, since the y-error depends on the samples in each bin, the error in x indicates the likelihood of the data possibly being outside of the bin's limits. Shouldn't this be accounted for somewhat in the error in y or shouldn't there be an additional covariance parameter? If so, how would exactly would I go about doing that?

I am ultimately then trying to fit a linear function to this data by orthogonal distance regression. Any suggestions and help with regards to what I've done (possibly incorrectly) and maybe better formulas to use would be greatly appreciated!

mfb · Jul 28, 2015

Why did you introduce those bins? A regular fit to the original data should be the best thing you can do, and it also makes error handling easier.

TheCanadian · Jul 28, 2015

mfb said:

Why did you introduce those bins? A regular fit to the original data should be the best thing you can do, and it also makes error handling easier.

Introducing those bins was needed to find a weighted average. The data points are numerous and go to a precision of 5 decimal places, so to couple them, I introduced the bins. The weighted average essentially gives me the average value I can expect--whether this is a good method for figuring out the most probable outcome is unknown, but hopefully the errors can tell me that. Any other suggestions you have would be great, too.

mfb · Jul 28, 2015

A weighted average in which way? Where do the weights come from?
What would be wrong with an average over the whole sample?

You have a set of x/y-data. If you don't have to (e.g. millions of datapoints), you should not bin them. In the best case you lose some information, in the worst case you mess up the whole analysis.

TheCanadian · Jul 28, 2015

mfb said:

A weighted average in which way? Where do the weights come from?
What would be wrong with an average over the whole sample?

You have a set of x/y-data. If you don't have to (e.g. millions of datapoints), you should not bin them. In the best case you lose some information, in the worst case you mess up the whole analysis.

In the "histogram", the different colours represent the different weights. Essentially, based on my independent variable alone, I want to be able to predict what the most likely answer is (i.e. the answer on y). As you might be able to see in the general scatter plot, this is not a one-to-one function despite me wishing to treat it as such, so I binned the data in intervals of 0.05 but of course this does result in an additional error. (If you don't mind me asking, could you refer me to any sources that discuss the errors associated with binning data?--http://stats.stackexchange.com/questions/60256/standard-deviation-of-binned-observations this seems to be the only source I can find.)

Your suggestion certainly does make sense. I am definitely losing information and allowing for more error. A simple linear fit would be doing the same thing, but with less errors.

Just out of curiosity: how are all these different errors handled? If I have an inherent error on each datapoint that is not constant (not even if the x-input is the same), if I have binned the data and have an error from simply doing this, and if I have another error from finding the weighted-mean, I have 3 separate errors here. The first 2 seem like errors on x, while the third seems like an error on y. Do I just simply add the two errors on x (they appear independent)?

Computing the standard deviation in y

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Attachments

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad The countability paradox of computable numbers

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect