Linear fit on the difference of data points

In summary, the conversation discusses a problem with error analysis for a simple linear regression. The context is a measurement of the transition frequency between energy levels in an atom. One of the isotopes is defined as the reference point and the goal is to plot the difference between the reference point and others as a function of another variable and make a linear fit to the data. The issue is determining the correct errors to use for the reference point, with various methods and calculations being considered. The conversation also includes links and resources for further information on simple linear regression and error analysis.
  • #1
kelly0303
561
33
Hello! I have some data points obtained from a measurement and one of them is defined as the reference point. I need to compute the difference between that reference point and all the others (including itself) and plot the difference as a function of another variable (which doesn't have an error associated to it) and make a linear fit to the data (I know from theory that I need a linear fit). For concreteness, say that I have 3 data points with values ##10\pm1##, ##20\pm2##, ##30\pm3## (the ratio between errors and actual values is not constant in my case, I just picked these numbers for simplicity) and that the middle one, 20, is the reference. The differences are thus: 10, 0 and -10. I am not sure what errors to put on these differences. If I treat the measured reference value as a constant, I would end up with: ##10 \pm 1##, ##0 \pm 2##, ##-10 \pm 3##. Doing a linear, chi-square fit now it's straightforward, but I am not sure if defining the errors like this is correct. Mainly I am not sure about the error on the reference point. That point is DEFINED as the reference point, so it shouldn't have an error associated to it, right? I could also do a propagation of errors (using the error on the 20 as the one to be propagated) and get: ##10 \pm \sqrt 5##, ##0 \pm 2\sqrt 2##, ##-10 \pm \sqrt 13##, but again, I have an error on the reference point and I am not sure if that is right. Lastly I could try one of the 2 methods mentioned above, but just completely drop the error on the reference point. But in that case a simple chi-square fit would give me infinities (as I would divide by an error of zero), so I am not sure if that is correct either. Can someone advise me on what is the right way to do this?

For completeness, the data I have is the transition frequency between 2 energy levels in an atom. I measure this frequency for different isotopes of that atom, and one of these isotopes is defined as the reference one (this is called an isotope shift measurement).
 
Physics news on Phys.org
  • #2
Hi,

I think you are basically asking for error analysis on simple linear regression.
A few links: http://en.wikipedia.org/wiki/Simple_linear_regression
for the errors: http://en.wikipedia.org/wiki/Simple_linear_regression#Normality_assumption

More tutorial-like: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
and http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis (with excel recipes)
and https://people.duke.edu/~rnau/Notes_on_linear_regression_analysis--Robert_Nau.pdf

Quoting 'myself' from a old thread:
BvU said:
Let
$$
\quad \overline Y = \sum {y_i\over \sigma_i^2} \Big / \sum {1\over \sigma_i^2}, \\
\quad \overline X = \sum {x_i\over \sigma_i^2} \Big / \sum {1\over \sigma_i^2},\\
\quad \overline {XY} = \sum {x_i y_i\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2}, \\
\quad \overline {Y^2} = \sum {y_i^2\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2} \\
\quad \overline {X^2} = \sum {x_i^2\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2}
$$
Then let
$$
\quad SS_X = \overline {X^2} - \overline {X}^2,\qquad SS_Y = \overline {X^2} - \overline {X}^2,\qquad SS_{XY} = \overline {XY^2} - \overline {X}\;\overline {Y}
$$
These are Kirchners (11)-(13) but divided by n, so from here on we can use his expressions as long as we have the same power in numerator and denominator.

For the record:
$$
\quad r^2 = {S_{XY}^2 \over SS_X\;SS_Y} \\
\quad b = r\; {S_Y\over S_X} \qquad \left ( \; = {S_{XY} \over S_X^2} \right )\\
\quad a = \overline Y - b \overline X
$$
And now come the all important ##\sigma##:
$$
\qquad\sigma_b^2 = {SS_Y/SS_X - b^2\over n-2}\\ \ \\
\qquad\sigma_a^2 = \sigma_b^2 \left ( {SS_X\over n} +\overline {X}^2 \right )
$$
The Bevington reference by QD (mine is from the seventies :) ) is really excellent: it has everything, even clearer and more extensive (at various levels), plus a fortran (!) listing that shouldn't be too difficult to re-engineer to python.

Again: if possible use physics sense: your y look like calculation results; systematic errors don't average out, so be sure you keep common factors separate. You can even analyze the results for this: if weighted and unweighted show quite different results, there might be something wrong with the error estimates.

Your results don't really need ten digit accuracy. And you have to ask if your sigma's are really distinct: the relative accuracy of a sigma based on averaging k measurements is around ##1/\sqrt k ##. The 0.02 differs considerably from the 0.08 -- there might be an experimental reason for that.

To top it all off, I add a few pictures from what I got using your data.
Red dot is center of gravity unweighted, green weighted. Unweighted result is identical to excel linear trend. Dashed lines are fit (middle one ) and idem +/- uncertainty in predicted ##y_i## (Kirchner (20)).

I'll be glad to share the numerical results (in the sense of: sharing. You show your working and then I'll do the same :) ). I am also interested in the context: what's x, y, how did the y and ##\sigma_i## come about ?

Oh, and: anyone with comments/corrections: very welcome!
Key points:
A linear fit (y = ax + b) goes through the average y, average x point. It can wiggle (a) and slide up and down parallel (b).

If the average y, average x point is your origin, the fit (y = ax + b) parameter errors (##\sigma_\sf a, \sigma_\sf b##) are uncorrelated. Otherwise you have to take their correlation coefficient into account when estimating the error on a predicted y.

Your artifical example data are worthless to me, but if you want to, you can work them out and solicit further comments.
 
  • #3
kelly0303 said:
Mainly I am not sure about the error on the reference point. That point is DEFINED as the reference point, so it shouldn't have an error associated to it, right?
If you were to repeat the experiment a hundred times would you measure the same value for the reference each time?

When you are in doubt about something like this you can simply simulate it both ways and see what happens. Do your analysis and look for bias in particular.
 
  • #4
Dale said:
If you were to repeat the experiment a hundred times would you measure the same value for the reference each time?

When you are in doubt about something like this you can simply simulate it both ways and see what happens. Do your analysis and look for bias in particular.
Thank you for your reply. Yes, the reference value would be the same all the time (as it is given by a quantum transition). Of course, doing the measurement will have some errors to it. But for each individual measurement, the difference between the reference and itself will be zero (obviously). So the average of the shift over any number of measurements will always be zero. I am not sure if I should associate an error to that zero or not.
 
  • #5
BvU said:
Hi,

I think you are basically asking for error analysis on simple linear regression.
A few links: http://en.wikipedia.org/wiki/Simple_linear_regression
for the errors: http://en.wikipedia.org/wiki/Simple_linear_regression#Normality_assumption

More tutorial-like: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
and http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis (with excel recipes)
and https://people.duke.edu/~rnau/Notes_on_linear_regression_analysis--Robert_Nau.pdf

Quoting 'myself' from a old thread:
Key points:
A linear fit (y = ax + b) goes through the average y, average x point. It can wiggle (a) and slide up and down parallel (b).

If the average y, average x point is your origin, the fit (y = ax + b) parameter errors (##\sigma_\sf a, \sigma_\sf b##) are uncorrelated. Otherwise you have to take their correlation coefficient into account when estimating the error on a predicted y.

Your artifical example data are worthless to me, but if you want to, you can work them out and solicit further comments.
My question is not mainly about the fit. It is about what error to associate to the data points, especially to the reference one, which by definition should be zero. Would that point have an error to it?
 
  • #6
Yes
 
  • #7
kelly0303 said:
Of course, doing the measurement will have some errors to it
Then you need to include its uncertainty
 
  • #8
Dale said:
Then you need to include its uncertainty
I am just not sure what is the meaning of uncertainty in this case. The value of the reference frequency has an error associated to it. But the difference between that value and itself is zero, no matter what the actual value is. So what I mean is, the error on a value reflects basically the fact that we don't know what the true value is. But in this case we know the true value of the difference, which is zero. We can never measure the absolute frequency perfectly, but no matter what the absolute value is, the difference between that value and itself must obviously be zero. So what would an error associated to that value of zero mean in this case? Thank you!
 
  • #9
BvU said:
Your artifical example data are worthless to me, but if you want to, you can work them out and solicit further comments.
 
  • #10
From your lack of reaction I gather the formulas in post #2 mean nothing to you. That's what the links are for: to work through them to get a feeling for fitting errors. If you can, post your data and we'll work through them to come to an answer. Talking in general terms is a waste of time and too error :wink: prone
 
  • #11
From a general standpoint, whenever you don’t know what approach to take in a situation like this one good approach is Monte Carlo simulation. Simulate 10000 complete data sets with known parameters, analyze them both ways, and see what happens to the parameter estimates.
 
  • #12
Bit early for giving up linear regression and interpreting the results, I would say. To me MC is something of a last resort before quitting altogether (no offense intended for the HEP colleagues).
 
  • #13
BvU said:
Bit early for giving up linear regression and interpreting the results, I would say. To me MC is something of a last resort before quitting altogether (no offense intended for the HEP colleagues).
Here is the actual data. The fist column is the x-axis (the error is insignificantly low), the second column is the dependent variable and the third column the error on the measured value (the frequency). I need to plot the frequencies relative (differences) to the fourth one (13281.586) and fit a straight line to the differences vs x and extract the slope. My main question is what error should I quote on the 5 data points for the fit? Thank you!

-0.31240977 13281.864 0.006
-0.21655197 13281.764 0.015
-0.08060223 13281.687 0.023
0 13281.586 0.006
0.21088 13281.408 0.016
 
  • #14
Good. So we are looking at this:

1585755703473.png

Excel (:oldruck: ) and I added a trendline without taking the individual errors into account (because I don't know how to coerce XL into doing that).
A linear fit is acceptable; assuming equal weights is disputable. (how come the errors vary so much?)
You can be the judge whether that is sensible -- if not, retreat to the formulas in post #2.

I see nothing suspicious in the data - but a sample size of 5 is of course not all that overwhelming.

Regression (equal weights, same reason as above :smile:) gives:

CoefficientsStandard Error
Intercept
13281.59318​
0.008506​
X Variable 1
-0.860583638​
0.043019​

One can subtract the averages
-0.07973679​
13281.6618​
to ensure that the fit goes through (0,0) and repeat the regression:
CoefficientsStandard Error
Intercept
1.45519E-12​
0.007783​
X Variable 1
-0.860583638​
0.043019​

as you see, the slope and its error remain the same, the intercept is 0 and the error is a little bit smaller. Big advantage: the errors are now uncorrelated.

kelly0303 said:
My main question is what error should I quote on the 5 data points for the fit?
This would be for the fit as you describe it: wrt (0, 13281.586)

I would add 0.08 in quadrature with 0.043 x and take the square root. Then round off to one significant digit (two if the first is a 1).

You will also have to worry about systematic errors: the 0.08 is less than one in a million ! But that's physics, not statistics :wink:
 
  • #15
BvU said:
Bit early for giving up linear regression and interpreting the results, I would say.
That is absolutely not what I am suggesting. I am suggesting that he simulate his data *and* the linear regression using both of his proposed approaches so that he can understand how linear regression treats his data under those two approaches. This has nothing at all to do with giving up linear regression. It has to do with him developing a practical understanding of his choices.
 
  • Like
Likes BvU
  • #16
BvU said:
Good. So we are looking at this:

View attachment 259786
Excel (:oldruck: ) and I added a trendline without taking the individual errors into account (because I don't know how to coerce XL into doing that).
A linear fit is acceptable; assuming equal weights is disputable. (how come the errors vary so much?)
You can be the judge whether that is sensible -- if not, retreat to the formulas in post #2.

I see nothing suspicious in the data - but a sample size of 5 is of course not all that overwhelming.

Regression (equal weights, same reason as above :smile:) gives:

CoefficientsStandard Error
Intercept
13281.59318​
0.008506​
X Variable 1
-0.860583638​
0.043019​

One can subtract the averages
-0.07973679​
13281.6618​
to ensure that the fit goes through (0,0) and repeat the regression:
CoefficientsStandard Error
Intercept
1.45519E-12​
0.007783​
X Variable 1
-0.860583638​
0.043019​

as you see, the slope and its error remain the same, the intercept is 0 and the error is a little bit smaller. Big advantage: the errors are now uncorrelated.

This would be for the fit as you describe it: wrt (0, 13281.586)

I would add 0.08 in quadrature with 0.043 x and take the square root. Then round off to one significant digit (two if the first is a 1).

You will also have to worry about systematic errors: the 0.08 is less than one in a million ! But that's physics, not statistics :wink:
Thank you for your reply! I am a bit confused as to why did you subtract the mean of the values. As we use the 4th point as the reference, shouldn't we subtract 13281.586? I assume that the value of the slope will be the same, as well as the error, but I was just curious why the mean?

[/QUOTE]
I would add 0.08 in quadrature with 0.043 x and take the square root. Then round off to one significant digit (two if the first is a 1).
[/QUOTE]

I am not sure I understand this part. What error would be this? And where is the 0.08 coming from?

Thank you!
 
  • #17
Dale said:
From a general standpoint, whenever you don’t know what approach to take in a situation like this one good approach is Monte Carlo simulation. Simulate 10000 complete data sets with known parameters, analyze them both ways, and see what happens to the parameter estimates.
Thank you for your reply! I am not sure what kind of simulations should I do. The experiment is quite complex and I don't really have access to the simulations made (and doing them by myself would probably take a very long time).
 
  • #18
kelly0303 said:
why did you subtract the mean of the values
Was explained in #2:
BvU said:
Key points:
A linear fit (y = ax + b) goes through the average y, average x point. It can wiggle (a) and slide up and down parallel (b).

If the average y, average x point is your origin, the fit (y = ax + b) parameter errors (##\sigma_\sf a, \sigma_\sf b##) are uncorrelated. Otherwise you have to take their correlation coefficient into account when estimating the error on a predicted y.

kelly0303 said:
I am not sure I understand this part. What error would be this? And where is the 0.08 coming from?
Do you know a little bit about simple error propagation ? If you have an estimate ##y = ax +b## with ##a\pm\sigma_a## and ##b\pm\sigma_b## from a fit and ##\sigma_a## and ##\sigma_b## are uncorrelated, then your best guess for the predicted ##y## is $$y = (ax +b) \pm \sqrt{\sigma_a^2\; x^2 +\sigma_b^2} $$

The 0.008 is the estimated standard deviation of the intercept (It says 0.0077, but the error is only ##\approx## 50% accurate with five points sample). This error is independent of the error in the slope when the origin is ##(x_{\sf average}, y_{\sf average})## . (I see I typed 0.08 and confused you, sorry)Caveat: as said, for a fit with individual weights you have to retreat to the expressions in #2.

Still curious: why do the individual points have different errrors ? Widely different intensities ? Or do they come out of a complicated fitting procedure, too ?
 
  • #19
kelly0303 said:
Thank you for your reply! I am not sure what kind of simulations should I do. The experiment is quite complex and I don't really have access to the simulations made (and doing them by myself would probably take a very long time).
You should do the simplest simulation possible to answer your specific question. You have some data with associated errors that you believe follows a linear fit and you want to know if you should treat the reference point as having zero error.

So simulate a fake data set with a known true linear effect and some added noise that is approximately consistent with the real errors. Then analyze that simulated data both ways and compare the result to the known true linear fit. Repeat 10000 times and compare.
 
  • #20
BvU said:
Was explained in #2:
Do you know a little bit about simple error propagation ? If you have an estimate ##y = ax +b## with ##a\pm\sigma_a## and ##b\pm\sigma_b## from a fit and ##\sigma_a## and ##\sigma_b## are uncorrelated, then your best guess for the predicted ##y## is $$y = (ax +b) \pm \sqrt{\sigma_a^2\; x^2 +\sigma_b^2} $$

The 0.008 is the estimated standard deviation of the intercept (It says 0.0077, but the error is only ##\approx## 50% accurate with five points sample). This error is independent of the error in the slope when the origin is ##(x_{\sf average}, y_{\sf average})## . (I see I typed 0.08 and confused you, sorry)Caveat: as said, for a fit with individual weights you have to retreat to the expressions in #2.

Still curious: why do the individual points have different errrors ? Widely different intensities ? Or do they come out of a complicated fitting procedure, too ?
Thank you for this. The points came from another fitting procedure. Each peak of a spectrum was fitted with a skewed Voigt profile and some peaks were more clear than others, hence the different errors.

So about subtracting the mean, I understand mathematically what you mean, but physically, assuming our data was perfect and based on the way we define the reference point, we must have the (0,0) point, in an isotope shift plot as the one I mentioned, coinciding with the 4th data point above. If I subtract the mean, the reference point will not be at (0,0) anymore which doesn't make physical sense.
 
  • #21
kelly0303 said:
physically, assuming our data was perfect
:smile:
 
  • #22
Dale said:
:smile:
Fair! What I meant to say was theoretically speaking.
 
  • #23
Dale said:
You should do the simplest simulation possible to answer your specific question. You have some data with associated errors that you believe follows a linear fit and you want to know if you should treat the reference point as having zero error.

So simulate a fake data set with a known true linear effect and some added noise that is approximately consistent with the real errors. Then analyze that simulated data both ways and compare the result to the known true linear fit. Repeat 10000 times and compare.
Thank you for this. So I tried to do a basic simulation. I generated 5 points on a straight line, of slope 30 and intercept 0 and attached some errors to i: ##(x_i, y_i, err_{y_i})##. Then 10000 times I did this: for each point I generated a number from a gaussian of mean ##y_i## and standard deviation ##err_{y_i}##, call it ##y_i'## and I associated to it the same error ##err_{y_i}## so I have 10000 sets of 5 points of the form ##(x_i, y_i', err_{y_i})##. For each of these 10000 I took the difference between each y' value and the 4th one (same as in my real data), so I end up with 10000 sets of 5 points of the form: ##(x_i, y_i'-y_4', err_{y_i})##. I fit this with a straight line and I get 10000 values for the slope and intercept. I paste below few of them in the order: slope, intercept, slope error, intercept error (they are as given by my code, I wouldn't keep all the significant digits).

[ 3.07557250e+01, 1.78472959e-02, 3.23231782e-01, 1.27835065e-02]
[ 3.09080924e+01, 6.65764699e-02, 1.78275245e+00, 7.05059741e-02]
[ 2.87776702e+01, -1.56217556e-01, 1.28856958e+00, 5.09615288e-02]
[ 3.09878549e+01, -1.24135958e-03, 9.83929421e-01, 3.89139376e-02]
[ 3.04860604e+01, 1.09770147e-01, 4.06615747e-01, 1.60812180e-02]

They look consistent with the input values, but now I am not sure how to combine all the 10000 estimates in a single final number. I tried to do a weighted average like this:

$$slope_{mean}= \frac{\sum_i slope_i/error_i}{\sum_i 1/error_i}$$
$$slope_{error }= \frac{1}{\sum_i 1/error_i}$$

and same for the intercept, but using this formula i get these values:

$$30.021878997654973 \pm 2.122516723462284\times10^{-5}$$
$$-0.00018113826319896513 \pm 8.394340792573846\times10^{-7}$$

The means value are not that bad, but looking at the error associated to them, they seem to be few thousands sigmas away from the true value. What am I doing wrong? How should I combing all the 10000 values into a single one and what error should I put on it? Also, why is the weighted average failing? Thank you!
 
  • #24
kelly0303 said:
They look consistent with the input values, but now I am not sure how to combine all the 10000 estimates in a single final number.
Excellent work! I wouldn’t combine the estimates, I would just plot a histogram of the results and compare that to the histogram obtained with the other method.

kelly0303 said:
I fit this with a straight line and I get 10000 values for the slope and intercept.
Did you do a weighted least squares with a weighting determined by the errors?
 
  • #25
Dale said:
Excellent work! I wouldn’t combine the estimates, I would just plot a histogram of the results and compare that to the histogram obtained with the other method.

Did you do a weighted least squares with a weighting determined by the errors?
Thanks for this! But just out of curiosity, assuming that these 10000 points come from (a lot of) different experiments, how would you combine them into one value with an error on it? Why does a weighted average gives such a bad result? (I am more interested for future cases where I'll actually have to do something similar)
 
  • #26
Well, you don’t actually want to combine them at all. If you combine them then you will lose important information. What you want to do is plot them as a histogram and mark or compare the histogram to the known true value. The histogram will show you if there is any bias as well as the range and any skewness.

If you really want just a few numbers then the minimum I would boil it down to would be five: the mean (no weighting) and standard deviation the median, and the first and third quartiles
 
  • #27
Dale said:
Well, you don’t actually want to combine them at all. If you combine them then you will lose important information. What you want to do is plot them as a histogram and mark or compare the histogram to the known true value. The histogram will show you if there is any bias as well as the range and any skewness.

If you really want just a few numbers then the minimum I would boil it down to would be five: the mean (no weighting) and standard deviation the median, and the first and third quartiles
Oh I see. So when is using a weighted average a good idea? Using the standard deviation here works well, but in real life, I would probably have the values of the slope from 2-3 experiments. How would I quote a value and an error based on these 2-3 values (to use in theoretical calculations for example)? Thank you!
 

1. What is a linear fit on the difference of data points?

A linear fit on the difference of data points is a statistical method used to find the best-fitting line that represents the relationship between two variables. It involves finding the slope and intercept of the line that minimizes the sum of squared differences between the observed data points and the predicted values.

2. How is a linear fit on the difference of data points calculated?

To calculate a linear fit on the difference of data points, you will need to use a statistical software or calculator that has a linear regression function. This function will take in the data points and calculate the slope and intercept of the best-fitting line using various mathematical formulas and algorithms.

3. What does the slope of a linear fit on the difference of data points represent?

The slope of a linear fit on the difference of data points represents the rate of change between the two variables. It indicates how much the dependent variable changes for every unit increase in the independent variable. A positive slope indicates a positive relationship, while a negative slope indicates a negative relationship.

4. How do you interpret the R-squared value in a linear fit on the difference of data points?

The R-squared value, also known as the coefficient of determination, represents the proportion of variation in the dependent variable that is explained by the independent variable in a linear fit. It ranges from 0 to 1, with higher values indicating a better fit. For example, an R-squared value of 0.8 means that 80% of the variation in the dependent variable can be explained by the independent variable.

5. What are the assumptions of a linear fit on the difference of data points?

The main assumptions of a linear fit on the difference of data points are that the relationship between the two variables is linear, the data points are independent, and the errors are normally distributed with constant variance. It is also important to check for outliers and influential data points that can affect the results of the linear fit.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
24
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
17
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
904
  • Set Theory, Logic, Probability, Statistics
2
Replies
56
Views
2K
Back
Top