Linear fit on the difference of data points

kelly0303 · Mar 27, 2020

Hello! I have some data points obtained from a measurement and one of them is defined as the reference point. I need to compute the difference between that reference point and all the others (including itself) and plot the difference as a function of another variable (which doesn't have an error associated to it) and make a linear fit to the data (I know from theory that I need a linear fit). For concreteness, say that I have 3 data points with values ##10\pm1##, ##20\pm2##, ##30\pm3## (the ratio between errors and actual values is not constant in my case, I just picked these numbers for simplicity) and that the middle one, 20, is the reference. The differences are thus: 10, 0 and -10. I am not sure what errors to put on these differences. If I treat the measured reference value as a constant, I would end up with: ##10 \pm 1##, ##0 \pm 2##, ##-10 \pm 3##. Doing a linear, chi-square fit now it's straightforward, but I am not sure if defining the errors like this is correct. Mainly I am not sure about the error on the reference point. That point is DEFINED as the reference point, so it shouldn't have an error associated to it, right? I could also do a propagation of errors (using the error on the 20 as the one to be propagated) and get: ##10 \pm \sqrt 5##, ##0 \pm 2\sqrt 2##, ##-10 \pm \sqrt 13##, but again, I have an error on the reference point and I am not sure if that is right. Lastly I could try one of the 2 methods mentioned above, but just completely drop the error on the reference point. But in that case a simple chi-square fit would give me infinities (as I would divide by an error of zero), so I am not sure if that is correct either. Can someone advise me on what is the right way to do this?

For completeness, the data I have is the transition frequency between 2 energy levels in an atom. I measure this frequency for different isotopes of that atom, and one of these isotopes is defined as the reference one (this is called an isotope shift measurement).

BvU · Mar 27, 2020

Hi,

I think you are basically asking for error analysis on simple linear regression.
A few links: http://en.wikipedia.org/wiki/Simple_linear_regression
for the errors: http://en.wikipedia.org/wiki/Simple_linear_regression#Normality_assumption

More tutorial-like: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
and http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis (with excel recipes)
and https://people.duke.edu/~rnau/Notes_on_linear_regression_analysis--Robert_Nau.pdf

Quoting 'myself' from a old thread:

BvU said:

Let
$$
\quad \overline Y = \sum {y_i\over \sigma_i^2} \Big / \sum {1\over \sigma_i^2}, \\
\quad \overline X = \sum {x_i\over \sigma_i^2} \Big / \sum {1\over \sigma_i^2},\\
\quad \overline {XY} = \sum {x_i y_i\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2}, \\
\quad \overline {Y^2} = \sum {y_i^2\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2} \\
\quad \overline {X^2} = \sum {x_i^2\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2}
$$
Then let
$$
\quad SS_X = \overline {X^2} - \overline {X}^2,\qquad SS_Y = \overline {X^2} - \overline {X}^2,\qquad SS_{XY} = \overline {XY^2} - \overline {X}\;\overline {Y}
$$
These are Kirchners (11)-(13) but divided by n, so from here on we can use his expressions as long as we have the same power in numerator and denominator.

For the record:
$$
\quad r^2 = {S_{XY}^2 \over SS_X\;SS_Y} \\
\quad b = r\; {S_Y\over S_X} \qquad \left ( \; = {S_{XY} \over S_X^2} \right )\\
\quad a = \overline Y - b \overline X
$$
And now come the all important ##\sigma##:
$$
\qquad\sigma_b^2 = {SS_Y/SS_X - b^2\over n-2}\\ \ \\
\qquad\sigma_a^2 = \sigma_b^2 \left ( {SS_X\over n} +\overline {X}^2 \right )
$$
The Bevington reference by QD (mine is from the seventies :) ) is really excellent: it has everything, even clearer and more extensive (at various levels), plus a fortran (!) listing that shouldn't be too difficult to re-engineer to python.

Again: if possible use physics sense: your y look like calculation results; systematic errors don't average out, so be sure you keep common factors separate. You can even analyze the results for this: if weighted and unweighted show quite different results, there might be something wrong with the error estimates.

Your results don't really need ten digit accuracy. And you have to ask if your sigma's are really distinct: the relative accuracy of a sigma based on averaging k measurements is around ##1/\sqrt k ##. The 0.02 differs considerably from the 0.08 -- there might be an experimental reason for that.

To top it all off, I add a few pictures from what I got using your data.
Red dot is center of gravity unweighted, green weighted. Unweighted result is identical to excel linear trend. Dashed lines are fit (middle one ) and idem +/- uncertainty in predicted ##y_i## (Kirchner (20)).

I'll be glad to share the numerical results (in the sense of: sharing. You show your working and then I'll do the same :) ). I am also interested in the context: what's x, y, how did the y and ##\sigma_i## come about ?

Oh, and: anyone with comments/corrections: very welcome!

Key points:
A linear fit (y = ax + b) goes through the average y, average x point. It can wiggle (a) and slide up and down parallel (b).

If the average y, average x point is your origin, the fit (y = ax + b) parameter errors (##\sigma_\sf a, \sigma_\sf b##) are uncorrelated. Otherwise you have to take their correlation coefficient into account when estimating the error on a predicted y.

Your artifical example data are worthless to me, but if you want to, you can work them out and solicit further comments.

Dale · Mar 27, 2020

kelly0303 said:

Mainly I am not sure about the error on the reference point. That point is DEFINED as the reference point, so it shouldn't have an error associated to it, right?

If you were to repeat the experiment a hundred times would you measure the same value for the reference each time?

When you are in doubt about something like this you can simply simulate it both ways and see what happens. Do your analysis and look for bias in particular.

kelly0303 · Mar 27, 2020

Dale said:

If you were to repeat the experiment a hundred times would you measure the same value for the reference each time?

When you are in doubt about something like this you can simply simulate it both ways and see what happens. Do your analysis and look for bias in particular.

Thank you for your reply. Yes, the reference value would be the same all the time (as it is given by a quantum transition). Of course, doing the measurement will have some errors to it. But for each individual measurement, the difference between the reference and itself will be zero (obviously). So the average of the shift over any number of measurements will always be zero. I am not sure if I should associate an error to that zero or not.

kelly0303 · Mar 27, 2020

BvU said:

Hi,

I think you are basically asking for error analysis on simple linear regression.
A few links: http://en.wikipedia.org/wiki/Simple_linear_regression
for the errors: http://en.wikipedia.org/wiki/Simple_linear_regression#Normality_assumption

More tutorial-like: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
and http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis (with excel recipes)
and https://people.duke.edu/~rnau/Notes_on_linear_regression_analysis--Robert_Nau.pdf

Quoting 'myself' from a old thread:
Key points:
A linear fit (y = ax + b) goes through the average y, average x point. It can wiggle (a) and slide up and down parallel (b).

If the average y, average x point is your origin, the fit (y = ax + b) parameter errors (##\sigma_\sf a, \sigma_\sf b##) are uncorrelated. Otherwise you have to take their correlation coefficient into account when estimating the error on a predicted y.

Your artifical example data are worthless to me, but if you want to, you can work them out and solicit further comments.

My question is not mainly about the fit. It is about what error to associate to the data points, especially to the reference one, which by definition should be zero. Would that point have an error to it?

BvU · Mar 27, 2020

Dale · Mar 27, 2020

kelly0303 said:

Of course, doing the measurement will have some errors to it

Then you need to include its uncertainty

kelly0303 · Apr 1, 2020

Dale said:

Then you need to include its uncertainty

I am just not sure what is the meaning of uncertainty in this case. The value of the reference frequency has an error associated to it. But the difference between that value and itself is zero, no matter what the actual value is. So what I mean is, the error on a value reflects basically the fact that we don't know what the true value is. But in this case we know the true value of the difference, which is zero. We can never measure the absolute frequency perfectly, but no matter what the absolute value is, the difference between that value and itself must obviously be zero. So what would an error associated to that value of zero mean in this case? Thank you!

BvU · Apr 1, 2020

BvU said:

Your artifical example data are worthless to me, but if you want to, you can work them out and solicit further comments.

BvU · Apr 1, 2020

From your lack of reaction I gather the formulas in post #2 mean nothing to you. That's what the links are for: to work through them to get a feeling for fitting errors. If you can, post your data and we'll work through them to come to an answer. Talking in general terms is a waste of time and too error

prone

Dale · Apr 1, 2020

From a general standpoint, whenever you don’t know what approach to take in a situation like this one good approach is Monte Carlo simulation. Simulate 10000 complete data sets with known parameters, analyze them both ways, and see what happens to the parameter estimates.

BvU · Apr 1, 2020

Bit early for giving up linear regression and interpreting the results, I would say. To me MC is something of a last resort before quitting altogether (no offense intended for the HEP colleagues).

kelly0303 · Apr 1, 2020

BvU said:

Bit early for giving up linear regression and interpreting the results, I would say. To me MC is something of a last resort before quitting altogether (no offense intended for the HEP colleagues).

Here is the actual data. The fist column is the x-axis (the error is insignificantly low), the second column is the dependent variable and the third column the error on the measured value (the frequency). I need to plot the frequencies relative (differences) to the fourth one (13281.586) and fit a straight line to the differences vs x and extract the slope. My main question is what error should I quote on the 5 data points for the fit? Thank you!

-0.31240977 13281.864 0.006
-0.21655197 13281.764 0.015
-0.08060223 13281.687 0.023
0 13281.586 0.006
0.21088 13281.408 0.016

BvU · Apr 1, 2020

Good. So we are looking at this:

Excel (

) and I added a trendline without taking the individual errors into account (because I don't know how to coerce XL into doing that).
A linear fit is acceptable; assuming equal weights is disputable. (how come the errors vary so much?)
You can be the judge whether that is sensible -- if not, retreat to the formulas in post #2.

I see nothing suspicious in the data - but a sample size of 5 is of course not all that overwhelming.

Regression (equal weights, same reason as above

) gives:

	Coefficients	Standard Error
Intercept	13281.59318	0.008506
X Variable 1	-0.860583638	0.043019

One can subtract the averages

-0.07973679

13281.6618

to ensure that the fit goes through (0,0) and repeat the regression:

	Coefficients	Standard Error
Intercept	1.45519E-12	0.007783
X Variable 1	-0.860583638	0.043019

as you see, the slope and its error remain the same, the intercept is 0 and the error is a little bit smaller. Big advantage: the errors are now uncorrelated.

kelly0303 said:

My main question is what error should I quote on the 5 data points for the fit?

This would be for the fit as you describe it: wrt (0, 13281.586)

I would add 0.08 in quadrature with 0.043 x and take the square root. Then round off to one significant digit (two if the first is a 1).

You will also have to worry about systematic errors: the 0.08 is less than one in a million ! But that's physics, not statistics

Dale · Apr 1, 2020

BvU said:

Bit early for giving up linear regression and interpreting the results, I would say.

That is absolutely not what I am suggesting. I am suggesting that he simulate his data *and* the linear regression using both of his proposed approaches so that he can understand how linear regression treats his data under those two approaches. This has nothing at all to do with giving up linear regression. It has to do with him developing a practical understanding of his choices.

kelly0303 · Apr 2, 2020

BvU said:

Good. So we are looking at this:

View attachment 259786
Excel ( ) and I added a trendline without taking the individual errors into account (because I don't know how to coerce XL into doing that).
A linear fit is acceptable; assuming equal weights is disputable. (how come the errors vary so much?)
You can be the judge whether that is sensible -- if not, retreat to the formulas in post #2.

I see nothing suspicious in the data - but a sample size of 5 is of course not all that overwhelming.

Regression (equal weights, same reason as above ) gives:

Coefficients Standard Error
Intercept
13281.59318
0.008506
X Variable 1
-0.860583638
0.043019

One can subtract the averages

-0.07973679
13281.6618

to ensure that the fit goes through (0,0) and repeat the regression:

Coefficients Standard Error
Intercept
1.45519E-12
0.007783
X Variable 1
-0.860583638
0.043019

as you see, the slope and its error remain the same, the intercept is 0 and the error is a little bit smaller. Big advantage: the errors are now uncorrelated.

This would be for the fit as you describe it: wrt (0, 13281.586)

I would add 0.08 in quadrature with 0.043 x and take the square root. Then round off to one significant digit (two if the first is a 1).

You will also have to worry about systematic errors: the 0.08 is less than one in a million ! But that's physics, not statistics

Thank you for your reply! I am a bit confused as to why did you subtract the mean of the values. As we use the 4th point as the reference, shouldn't we subtract 13281.586? I assume that the value of the slope will be the same, as well as the error, but I was just curious why the mean?

[/QUOTE]
I would add 0.08 in quadrature with 0.043 x and take the square root. Then round off to one significant digit (two if the first is a 1).
[/QUOTE]

I am not sure I understand this part. What error would be this? And where is the 0.08 coming from?

Thank you!

kelly0303 · Apr 2, 2020

Dale said:

From a general standpoint, whenever you don’t know what approach to take in a situation like this one good approach is Monte Carlo simulation. Simulate 10000 complete data sets with known parameters, analyze them both ways, and see what happens to the parameter estimates.

Thank you for your reply! I am not sure what kind of simulations should I do. The experiment is quite complex and I don't really have access to the simulations made (and doing them by myself would probably take a very long time).

BvU · Apr 2, 2020

kelly0303 said:

why did you subtract the mean of the values

Was explained in #2:

BvU said:

Key points:
A linear fit (y = ax + b) goes through the average y, average x point. It can wiggle (a) and slide up and down parallel (b).

If the average y, average x point is your origin, the fit (y = ax + b) parameter errors (##\sigma_\sf a, \sigma_\sf b##) are uncorrelated. Otherwise you have to take their correlation coefficient into account when estimating the error on a predicted y.

kelly0303 said:

I am not sure I understand this part. What error would be this? And where is the 0.08 coming from?

Do you know a little bit about simple error propagation ? If you have an estimate ##y = ax +b## with ##a\pm\sigma_a## and ##b\pm\sigma_b## from a fit and ##\sigma_a## and ##\sigma_b## are uncorrelated, then your best guess for the predicted ##y## is $$y = (ax +b) \pm \sqrt{\sigma_a^2\; x^2 +\sigma_b^2} $$

The 0.008 is the estimated standard deviation of the intercept (It says 0.0077, but the error is only ##\approx## 50% accurate with five points sample). This error is independent of the error in the slope when the origin is ##(x_{\sf average}, y_{\sf average})## . (I see I typed 0.08 and confused you, sorry)Caveat: as said, for a fit with individual weights you have to retreat to the expressions in #2.

Still curious: why do the individual points have different errrors ? Widely different intensities ? Or do they come out of a complicated fitting procedure, too ?

Dale · Apr 2, 2020

kelly0303 said:

Thank you for your reply! I am not sure what kind of simulations should I do. The experiment is quite complex and I don't really have access to the simulations made (and doing them by myself would probably take a very long time).

You should do the simplest simulation possible to answer your specific question. You have some data with associated errors that you believe follows a linear fit and you want to know if you should treat the reference point as having zero error.

So simulate a fake data set with a known true linear effect and some added noise that is approximately consistent with the real errors. Then analyze that simulated data both ways and compare the result to the known true linear fit. Repeat 10000 times and compare.

kelly0303 · Apr 2, 2020

BvU said:

Was explained in #2:
Do you know a little bit about simple error propagation ? If you have an estimate ##y = ax +b## with ##a\pm\sigma_a## and ##b\pm\sigma_b## from a fit and ##\sigma_a## and ##\sigma_b## are uncorrelated, then your best guess for the predicted ##y## is $$y = (ax +b) \pm \sqrt{\sigma_a^2\; x^2 +\sigma_b^2} $$

The 0.008 is the estimated standard deviation of the intercept (It says 0.0077, but the error is only ##\approx## 50% accurate with five points sample). This error is independent of the error in the slope when the origin is ##(x_{\sf average}, y_{\sf average})## . (I see I typed 0.08 and confused you, sorry)Caveat: as said, for a fit with individual weights you have to retreat to the expressions in #2.

Still curious: why do the individual points have different errrors ? Widely different intensities ? Or do they come out of a complicated fitting procedure, too ?

Thank you for this. The points came from another fitting procedure. Each peak of a spectrum was fitted with a skewed Voigt profile and some peaks were more clear than others, hence the different errors.

So about subtracting the mean, I understand mathematically what you mean, but physically, assuming our data was perfect and based on the way we define the reference point, we must have the (0,0) point, in an isotope shift plot as the one I mentioned, coinciding with the 4th data point above. If I subtract the mean, the reference point will not be at (0,0) anymore which doesn't make physical sense.

Dale · Apr 2, 2020

kelly0303 said:

physically, assuming our data was perfect

kelly0303 · Apr 2, 2020

Dale said:

Fair! What I meant to say was theoretically speaking.

kelly0303 · Apr 2, 2020

Dale said:

You should do the simplest simulation possible to answer your specific question. You have some data with associated errors that you believe follows a linear fit and you want to know if you should treat the reference point as having zero error.

So simulate a fake data set with a known true linear effect and some added noise that is approximately consistent with the real errors. Then analyze that simulated data both ways and compare the result to the known true linear fit. Repeat 10000 times and compare.

Thank you for this. So I tried to do a basic simulation. I generated 5 points on a straight line, of slope 30 and intercept 0 and attached some errors to i: ##(x_i, y_i, err_{y_i})##. Then 10000 times I did this: for each point I generated a number from a gaussian of mean ##y_i## and standard deviation ##err_{y_i}##, call it ##y_i'## and I associated to it the same error ##err_{y_i}## so I have 10000 sets of 5 points of the form ##(x_i, y_i', err_{y_i})##. For each of these 10000 I took the difference between each y' value and the 4th one (same as in my real data), so I end up with 10000 sets of 5 points of the form: ##(x_i, y_i'-y_4', err_{y_i})##. I fit this with a straight line and I get 10000 values for the slope and intercept. I paste below few of them in the order: slope, intercept, slope error, intercept error (they are as given by my code, I wouldn't keep all the significant digits).

[ 3.07557250e+01, 1.78472959e-02, 3.23231782e-01, 1.27835065e-02]
[ 3.09080924e+01, 6.65764699e-02, 1.78275245e+00, 7.05059741e-02]
[ 2.87776702e+01, -1.56217556e-01, 1.28856958e+00, 5.09615288e-02]
[ 3.09878549e+01, -1.24135958e-03, 9.83929421e-01, 3.89139376e-02]
[ 3.04860604e+01, 1.09770147e-01, 4.06615747e-01, 1.60812180e-02]

They look consistent with the input values, but now I am not sure how to combine all the 10000 estimates in a single final number. I tried to do a weighted average like this:

$$slope_{mean}= \frac{\sum_i slope_i/error_i}{\sum_i 1/error_i}$$
$$slope_{error }= \frac{1}{\sum_i 1/error_i}$$

and same for the intercept, but using this formula i get these values:

$$30.021878997654973 \pm 2.122516723462284\times10^{-5}$$
$$-0.00018113826319896513 \pm 8.394340792573846\times10^{-7}$$

The means value are not that bad, but looking at the error associated to them, they seem to be few thousands sigmas away from the true value. What am I doing wrong? How should I combing all the 10000 values into a single one and what error should I put on it? Also, why is the weighted average failing? Thank you!

Dale · Apr 2, 2020

kelly0303 said:

They look consistent with the input values, but now I am not sure how to combine all the 10000 estimates in a single final number.

Excellent work! I wouldn’t combine the estimates, I would just plot a histogram of the results and compare that to the histogram obtained with the other method.

kelly0303 said:

I fit this with a straight line and I get 10000 values for the slope and intercept.

Did you do a weighted least squares with a weighting determined by the errors?

kelly0303 · Apr 2, 2020

Dale said:

Excellent work! I wouldn’t combine the estimates, I would just plot a histogram of the results and compare that to the histogram obtained with the other method.

Did you do a weighted least squares with a weighting determined by the errors?

Thanks for this! But just out of curiosity, assuming that these 10000 points come from (a lot of) different experiments, how would you combine them into one value with an error on it? Why does a weighted average gives such a bad result? (I am more interested for future cases where I'll actually have to do something similar)

Dale · Apr 2, 2020

Well, you don’t actually want to combine them at all. If you combine them then you will lose important information. What you want to do is plot them as a histogram and mark or compare the histogram to the known true value. The histogram will show you if there is any bias as well as the range and any skewness.

If you really want just a few numbers then the minimum I would boil it down to would be five: the mean (no weighting) and standard deviation the median, and the first and third quartiles

kelly0303 · Apr 2, 2020

Dale said:

Well, you don’t actually want to combine them at all. If you combine them then you will lose important information. What you want to do is plot them as a histogram and mark or compare the histogram to the known true value. The histogram will show you if there is any bias as well as the range and any skewness.

If you really want just a few numbers then the minimum I would boil it down to would be five: the mean (no weighting) and standard deviation the median, and the first and third quartiles

Oh I see. So when is using a weighted average a good idea? Using the standard deviation here works well, but in real life, I would probably have the values of the slope from 2-3 experiments. How would I quote a value and an error based on these 2-3 values (to use in theoretical calculations for example)? Thank you!

Linear fit on the difference of data points

1. What is a linear fit on the difference of data points?

2. How is a linear fit on the difference of data points calculated?

3. What does the slope of a linear fit on the difference of data points represent?

4. How do you interpret the R-squared value in a linear fit on the difference of data points?

5. What are the assumptions of a linear fit on the difference of data points?

Similar threads

Hot Threads

Recent Insights