# Homework Help: Basic data analysis problem

1. May 9, 2014

### aaaa202

I have a lot of measurements of some quantity y as a function of x. All these data points are such that no y_i is taken at the same x_i.
So I want to fit some kind of function to all these data point, but I want an uncertainty in the y_i's. Normally if I had say 10 y_i measured at the same x_i I would calculate the standard deviation and use that as the uncertainty. But since all my y_i are taken at different x_i, I can't do that. How do I give a meaningful uncertainty to all the y_i.
It seems weird that if I have 10 measurements of 10 different y_i (100 data points in total) that I should get something much less uncertain than if I have 100 measurements of 100 all different y_i.

2. May 9, 2014

### voko

If you postulate some function $f(x)$, then you have $f(x_i)$. Then you could consider, for example, $\sum \left( f(x_i) - y_i \right)^2$.

One thing to note is that there is probably some error in $x_i$, too.

3. May 9, 2014

### aaaa202

I assume that the error in x_i is negligible. The problem is that in the sum above I only have one x_i and one y_i since all my measurements are done for different pairs of (x_i,y_i)

4. May 9, 2014

### voko

Again. You are fitting a function. That means you have a function. You have points, calculated values of the function at the points, and the measured values. That gives you the sum of squares I mentioned above.

Actually, how do you even fit the function in the first place? Aren't you using a least-squares approach?

5. May 9, 2014

### haruspex

It depends whether the form of the function is known (by theory, say) or whether you are making up the form of the function based on the data.
If the form is given, then as voko indicates, there is nothing special about having multiple y values for the same x. The least squares method gives you a fit, and the R value tells you about the uncertainty in the y. It is unnecessary to evaluate the uncertainty at each x separately.
One complication that can arise is when the uncertainty in y is also a function of x (heteroscedasticity). To handle that, you also need a model for that relationship, e.g. linear with x.

If you are inventing the form to match the data then you need to use model theory to justify the number of free variables in your model. Otherwise you may end up putting a nonsense curve in that goes through all the datapoints.

6. May 10, 2014

### aaaa202

Oh okay I get it now. So I fit my data to my model and it gives a standard deviation based on a least-squares approach. This standard deviation should be my errorbars on y_i in the plot right?

7. May 10, 2014

### voko

8. May 10, 2014

### haruspex

Error bars are at their most useful when you have some other basis for estimating different uncertainties for different datapoints.
When you have no such basis, and you are going to show all datapoints on the chart (including multiple y values for the same x value), error bars are of limited value. You could show two other curves, one a standard deviation (from the R value) above and one below. This would just create a constant width band around the mean curve. If there are places where there are multiple y values for the same (or very close together) x values, the 'local uncertainty' is visible from the spread of the y values.
To make the chart less cluttered, it is common to represent multiple y's at the same x by a single datapoint, but put an error bar around it which illustrates the spread of that cluster. (Correspondingly, they can be represented for the purposes of the curve fitting by a single datapoint and a weighting related to the spread. It should produce the same curve as weighting equally and letting each datapoint be counted separately.)

Of course, what you'd really like to show is the confidence with which the value of y at other x values can be interpolated. This suggests to me a pair of bounding curves which would not merely be parallel to the mean curve. They would be closer together in regions where the observed y values are close to the mean curve - whether the x values are identical or merely close together. I've never seen this done, but it's an interesting idea. E.g., you could imagine adding another datapoint (x, y), and seeing how sensitive the R value is to the value of y. The more sensitive, the more confident you are of the fitted curve at co-ordinate x.

Last edited: May 10, 2014