Computing the standard deviation in y

In summary, the conversation discusses the process of creating a "probability" histogram for a dataset, using weighted averages and bins to find the most likely outcome. The speaker introduces their method for finding the weighted average and asks for suggestions on how to handle the different errors associated with binning the data and finding the weighted mean. The other person suggests not binning the data and using a simple linear fit instead. They also question how to handle the multiple errors involved in the process.
  • #1
TheCanadian
367
13
Hi,

I've made a "probability" histogram for my data and it's based on 14000 datapoints in total, BUT each bin is not the same (e.g. bin 1 might be composed on 200 total datapoints while bin 50 is only 3 data points). You can find it in image 1. Now, based on those relative frequencies, I constructed the average weight of each bin independently and plotted it up to a maximum of 1.7 on the x-axis (this results in 34 bins). This can be found in image 2.

There are a couple things to note, though. I've attached a third image, and this essentially just shows a bunch of random points plotted, before they were ever put into bins or averaged (my apologies for the different scales used in each of the images). All these points have an error in the independent variable that is not constant (not even in a particular bin), but none in the dependent variable. Thus, when computing the average x-error in each particular bin, I used:

## σ_x = \frac {\sqrt {\sum_{n=1}^{length(bin)} ({σ_x}_n)^2}}{length(bin)} ## where length(bin) is simply the number of elements in that particular bin. Now, when I have 34 independent bins that I each calculate an x-error for (in this prescribed method), I also want to find the y-error for combining these datapoints in each particular bin. I did this by the method shown here (i.e. from Cochran (1977)): http://stats.stackexchange.com/questions/25895/computing-standard-error-in-weighted-mean-estimation.
Now, by doing this, I have calculated 34 y-errors and 34 x-errors, one corresponding to each bin's datapoint for the plot of the weighted sum.

Now I am simply wondering, have I done anything wrong? I believe I've been essentially treating the x- and y-errors independently, but in this case, is this the correct approach? For example, since the y-error depends on the samples in each bin, the error in x indicates the likelihood of the data possibly being outside of the bin's limits. Shouldn't this be accounted for somewhat in the error in y or shouldn't there be an additional covariance parameter? If so, how would exactly would I go about doing that?

I am ultimately then trying to fit a linear function to this data by orthogonal distance regression. Any suggestions and help with regards to what I've done (possibly incorrectly) and maybe better formulas to use would be greatly appreciated!
 

Attachments

  • Combined Probability Chart.pdf
    29.3 KB · Views: 400
  • PARTIAL_SIGMAChartlinear_S=1-7.pdf
    34 KB · Views: 199
  • Noise_SIGMA_R0-4_S17_A=all.pdf
    26.7 KB · Views: 145
Last edited:
Physics news on Phys.org
  • #2
Why did you introduce those bins? A regular fit to the original data should be the best thing you can do, and it also makes error handling easier.
 
  • #3
mfb said:
Why did you introduce those bins? A regular fit to the original data should be the best thing you can do, and it also makes error handling easier.

Introducing those bins was needed to find a weighted average. The data points are numerous and go to a precision of 5 decimal places, so to couple them, I introduced the bins. The weighted average essentially gives me the average value I can expect--whether this is a good method for figuring out the most probable outcome is unknown, but hopefully the errors can tell me that. Any other suggestions you have would be great, too.
 
  • #4
A weighted average in which way? Where do the weights come from?
What would be wrong with an average over the whole sample?

You have a set of x/y-data. If you don't have to (e.g. millions of datapoints), you should not bin them. In the best case you lose some information, in the worst case you mess up the whole analysis.
 
  • #5
mfb said:
A weighted average in which way? Where do the weights come from?
What would be wrong with an average over the whole sample?

You have a set of x/y-data. If you don't have to (e.g. millions of datapoints), you should not bin them. In the best case you lose some information, in the worst case you mess up the whole analysis.

In the "histogram", the different colours represent the different weights. Essentially, based on my independent variable alone, I want to be able to predict what the most likely answer is (i.e. the answer on y). As you might be able to see in the general scatter plot, this is not a one-to-one function despite me wishing to treat it as such, so I binned the data in intervals of 0.05 but of course this does result in an additional error. (If you don't mind me asking, could you refer me to any sources that discuss the errors associated with binning data?--http://stats.stackexchange.com/questions/60256/standard-deviation-of-binned-observations this seems to be the only source I can find.)

Your suggestion certainly does make sense. I am definitely losing information and allowing for more error. A simple linear fit would be doing the same thing, but with less errors.

Just out of curiosity: how are all these different errors handled? If I have an inherent error on each datapoint that is not constant (not even if the x-input is the same), if I have binned the data and have an error from simply doing this, and if I have another error from finding the weighted-mean, I have 3 separate errors here. The first 2 seem like errors on x, while the third seems like an error on y. Do I just simply add the two errors on x (they appear independent)?
 

1. What is the formula for computing the standard deviation in y?

The formula for computing the standard deviation in y is √(∑(y-ȳ)^2 / n), where y is the data point, ȳ is the mean of the data set, and n is the total number of data points.

2. Why is it important to compute the standard deviation in y?

Computing the standard deviation in y allows us to understand the spread or variability of the data set. It helps us to determine how much the data points deviate from the mean, and thus gives us a better understanding of the distribution of the data.

3. How is the standard deviation in y different from the standard deviation in x?

The standard deviation in y measures the spread or variability of the dependent variable (y) in a data set, while the standard deviation in x measures the spread of the independent variable (x). This means that the standard deviation in y is influenced by changes in the y-values, while the standard deviation in x is influenced by changes in the x-values.

4. Can the standard deviation in y be negative?

No, the standard deviation in y cannot be negative. It is always a positive value as it is the square root of the sum of squared differences between the data points and the mean. However, a low standard deviation in y may indicate that the data points are close to the mean, while a high standard deviation in y may indicate that the data points are spread out.

5. How does the sample size affect the computation of the standard deviation in y?

The sample size (n) is a key factor in computing the standard deviation in y. As the sample size increases, the standard deviation in y tends to decrease and become more accurate. This is because a larger sample size provides more data points, which leads to a more accurate representation of the population and a more precise calculation of the standard deviation in y.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
778
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
2K
  • Set Theory, Logic, Probability, Statistics
2
Replies
37
Views
3K
  • Set Theory, Logic, Probability, Statistics
2
Replies
40
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
971
  • Set Theory, Logic, Probability, Statistics
Replies
18
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
7K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Calculus and Beyond Homework Help
Replies
3
Views
845
Back
Top