# Linear fit on the difference of data points

• I

## Main Question or Discussion Point

Hello! I have some data points obtained from a measurement and one of them is defined as the reference point. I need to compute the difference between that reference point and all the others (including itself) and plot the difference as a function of another variable (which doesn't have an error associated to it) and make a linear fit to the data (I know from theory that I need a linear fit). For concreteness, say that I have 3 data points with values $10\pm1$, $20\pm2$, $30\pm3$ (the ratio between errors and actual values is not constant in my case, I just picked these numbers for simplicity) and that the middle one, 20, is the reference. The differences are thus: 10, 0 and -10. I am not sure what errors to put on these differences. If I treat the measured reference value as a constant, I would end up with: $10 \pm 1$, $0 \pm 2$, $-10 \pm 3$. Doing a linear, chi-square fit now it's straightforward, but I am not sure if defining the errors like this is correct. Mainly I am not sure about the error on the reference point. That point is DEFINED as the reference point, so it shouldn't have an error associated to it, right? I could also do a propagation of errors (using the error on the 20 as the one to be propagated) and get: $10 \pm \sqrt 5$, $0 \pm 2\sqrt 2$, $-10 \pm \sqrt 13$, but again, I have an error on the reference point and I am not sure if that is right. Lastly I could try one of the 2 methods mentioned above, but just completely drop the error on the reference point. But in that case a simple chi-square fit would give me infinities (as I would divide by an error of zero), so I am not sure if that is correct either. Can someone advise me on what is the right way to do this?

For completeness, the data I have is the transition frequency between 2 energy levels in an atom. I measure this frequency for different isotopes of that atom, and one of these isotopes is defined as the reference one (this is called an isotope shift measurement).

Related Set Theory, Logic, Probability, Statistics News on Phys.org
BvU
Homework Helper
2019 Award
Hi,

I think you are basically asking for error analysis on simple linear regression.
for the errors: http://en.wikipedia.org/wiki/Simple_linear_regression#Normality_assumption

More tutorial-like: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
and http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis (with excel recipes)
and https://people.duke.edu/~rnau/Notes_on_linear_regression_analysis--Robert_Nau.pdf

Quoting 'myself' from a old thread:
Let
$$\quad \overline Y = \sum {y_i\over \sigma_i^2} \Big / \sum {1\over \sigma_i^2}, \\ \quad \overline X = \sum {x_i\over \sigma_i^2} \Big / \sum {1\over \sigma_i^2},\\ \quad \overline {XY} = \sum {x_i y_i\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2}, \\ \quad \overline {Y^2} = \sum {y_i^2\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2} \\ \quad \overline {X^2} = \sum {x_i^2\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2}$$
Then let
$$\quad SS_X = \overline {X^2} - \overline {X}^2,\qquad SS_Y = \overline {X^2} - \overline {X}^2,\qquad SS_{XY} = \overline {XY^2} - \overline {X}\;\overline {Y}$$
These are Kirchners (11)-(13) but divided by n, so from here on we can use his expressions as long as we have the same power in numerator and denominator.

For the record:
$$\quad r^2 = {S_{XY}^2 \over SS_X\;SS_Y} \\ \quad b = r\; {S_Y\over S_X} \qquad \left ( \; = {S_{XY} \over S_X^2} \right )\\ \quad a = \overline Y - b \overline X$$
And now come the all important $\sigma$:
$$\qquad\sigma_b^2 = {SS_Y/SS_X - b^2\over n-2}\\ \ \\ \qquad\sigma_a^2 = \sigma_b^2 \left ( {SS_X\over n} +\overline {X}^2 \right )$$
The Bevington reference by QD (mine is from the seventies :) ) is really excellent: it has everything, even clearer and more extensive (at various levels), plus a fortran (!) listing that shouldn't be too difficult to re-engineer to python.

Again: if possible use physics sense: your y look like calculation results; systematic errors don't average out, so be sure you keep common factors separate. You can even analyze the results for this: if weighted and unweighted show quite different results, there might be something wrong with the error estimates.

Your results don't really need ten digit accuracy. And you have to ask if your sigma's are really distinct: the relative accuracy of a sigma based on averaging k measurements is around $1/\sqrt k$. The 0.02 differs considerably from the 0.08 -- there might be an experimental reason for that.

To top it all off, I add a few pictures from what I got using your data.
Red dot is center of gravity unweighted, green weighted. Unweighted result is identical to excel linear trend. Dashed lines are fit (middle one ) and idem +/- uncertainty in predicted $y_i$ (Kirchner (20)).

I'll be glad to share the numerical results (in the sense of: sharing. You show your working and then I'll do the same :) ). I am also interested in the context: what's x, y, how did the y and $\sigma_i$ come about ?

Oh, and: anyone with comments/corrections: very welcome!

Key points:
A linear fit (y = ax + b) goes through the average y, average x point. It can wiggle (a) and slide up and down parallel (b).

If the average y, average x point is your origin, the fit (y = ax + b) parameter errors ($\sigma_\sf a, \sigma_\sf b$) are uncorrelated. Otherwise you have to take their correlation coefficient into account when estimating the error on a predicted y.

Your artifical example data are worthless to me, but if you want to, you can work them out and solicit further comments.

Dale
Mentor
Mainly I am not sure about the error on the reference point. That point is DEFINED as the reference point, so it shouldn't have an error associated to it, right?
If you were to repeat the experiment a hundred times would you measure the same value for the reference each time?

When you are in doubt about something like this you can simply simulate it both ways and see what happens. Do your analysis and look for bias in particular.

If you were to repeat the experiment a hundred times would you measure the same value for the reference each time?

When you are in doubt about something like this you can simply simulate it both ways and see what happens. Do your analysis and look for bias in particular.
Thank you for your reply. Yes, the reference value would be the same all the time (as it is given by a quantum transition). Of course, doing the measurement will have some errors to it. But for each individual measurement, the difference between the reference and itself will be zero (obviously). So the average of the shift over any number of measurements will always be zero. I am not sure if I should associate an error to that zero or not.

Hi,

I think you are basically asking for error analysis on simple linear regression.
for the errors: http://en.wikipedia.org/wiki/Simple_linear_regression#Normality_assumption

More tutorial-like: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
and http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis (with excel recipes)
and https://people.duke.edu/~rnau/Notes_on_linear_regression_analysis--Robert_Nau.pdf

Quoting 'myself' from a old thread:

Key points:
A linear fit (y = ax + b) goes through the average y, average x point. It can wiggle (a) and slide up and down parallel (b).

If the average y, average x point is your origin, the fit (y = ax + b) parameter errors ($\sigma_\sf a, \sigma_\sf b$) are uncorrelated. Otherwise you have to take their correlation coefficient into account when estimating the error on a predicted y.

Your artifical example data are worthless to me, but if you want to, you can work them out and solicit further comments.
My question is not mainly about the fit. It is about what error to associate to the data points, especially to the reference one, which by definition should be zero. Would that point have an error to it?

BvU