# Right way to do a linear fit

• I
Hello! I have some data of the form (x,y,z) which I know it is described by a function of the form: ##z=y(a+bx)##, where a and b are parameters to be fitted for. z and y have error associated to them while x doesn't (x is actually an integer going from 0 to 3 for each value of y). I tried to do the fit in 2 different ways. I first made a linear fit of the form ##z=yA## for A (I used this package which accounts for the error on both z and y: https://docs.scipy.org/doc/scipy/reference/odr.html) for each value of x, then I made a fit of the form ##A=a+bx## for a and b, with the error on A obtained from the first fit. In the end I get a value and error for a and b. A second method I used was to fit directly ##z=y(a+bx)## to the whole data at once (it is not really a linear fit anymore, but it can be easily done in Python, with the same package as mentioned above). Now I get a new set of values and errors for a and b. The values obtained using the 2 methods are consistent with each other (within the errors on a and b), but using the first method gives a smaller error than in the second method. Is there anything I am missing? Shouldn't I get the exactly same result both ways? And in case the answer is no, which method should I use and why? Thank you!

Related Set Theory, Logic, Probability, Statistics News on Phys.org
jedishrfu
Mentor
Can you provide some context here? What is the data you’re fitting? Where does it come from?

Knowing that, we might find that certain fields of research prefer certain methods to be used over other methods.

Dale
Mentor
Shouldn't I get the exactly same result both ways?
No, definitely not. If different methods always gave exactly the same result then there would be no point in having different methods at all.

And in case the answer is no, which method should I use and why?
The errors in y, are they large or can they be neglected?

• jedishrfu
Can you provide some context here? What is the data you’re fitting? Where does it come from?

Knowing that, we might find that certain fields of research prefer certain methods to be used over other methods.
The data is from a molecular spectroscopy experiment. For people working in the field, this is similar to a King plot fit, but for molecular terms (when the field shift is important). z corresponds to a frequency shift between different molecules, y is the change in radius of one of the atoms of the molecules between different molecules and x is the frequency level that is being tested.

No, definitely not. If different methods always gave exactly the same result then there would be no point in having different methods at all.

The errors in y, are they large or can they be neglected?
Thank you for your reply! To be honest I wasn't even sure if they can count as different, I assumed they are the same method, but it one case it do it in 2 steps while in the other in one step only.

The errors on y are a lot smaller than the errors on z. From what I've seen ignoring them doesn't produce a big difference. The errors on z contain also systematic uncertainties and the statistics for them are a lot lower, so the error is quite big.

Dale
Mentor
The errors on y are a lot smaller than the errors on z.
Then doing a standard least squares fit should be fine. Stepwise first are always a little sketchy, so I would avoid it. The smaller error is most likely producing a larger bias.

I would probably fit to the following model ##z= ay + bx + cxy + d## with a standard linear model. In R this model would be written
Code:
z~x*y
where the inclusion of the other terms is so standard that they are simply assumed. Leaving out intercept terms and lower order terms can introduce bias. This model will give you the best unbiased linear estimator.

Then doing a standard least squares fit should be fine. Stepwise first are always a little sketchy, so I would avoid it. The smaller error is most likely producing a larger bias.

I would probably fit to the following model ##z= ay + bx + cxy + d## with a standard linear model. In R this model would be written
Code:
z~x*y
where the inclusion of the other terms is so standard that they are simply assumed. Leaving out intercept terms and lower order terms can introduce bias. This model will give you the best unbiased linear estimator.
Oh I see! So if the fit is good b and d should be consistent with zero, right? Thanks a lot! Could you please explain to me a bit more why doing it in 2 steps gives me a different error (it is actually ~3 times smaller)?

Dale
Mentor
Oh I see! So if the fit is good b and d should be consistent with zero, right? Thanks a lot! Could you please explain to me a bit more why doing it in 2 steps gives me a different error (it is actually ~3 times smaller)?
I am surprised that it is that much different. Without the data I can’t really tell. There might be some substantial covariance or multicolinearity that is constrained away in the stepwise approach.

I am surprised that it is that much different. Without the data I can’t really tell. There might be some substantial covariance or multicolinearity that is constrained away in the stepwise approach.
Please find the data I am using below. The errors are combined statistical and systematic. They come from different experiments (hence the different range of errors). Just to give a bit more details, the function I actually need to fit is this ##z=y(a+b(x+0.5)/4.186)## (just a redefinition of a and b for completeness). Each sub-array of z corresponds to a value of x. For example the second entry of z should be written as: ##0.176=-0.216(a+b(0+0.5)/4.186)## Please let me know if I can provide further details.

$$y = [-0.312, -0.216, -0.080, 0. , 0.210 ]$$
$$y_{err}=[0.015, 0.010, 0.004, 0.00001,0.01]$$
$$x=[0,1,2,3]$$
$$z = [[ 0.268, 0.176, 0.117 , -0. , -0.184], [ 0.277, 0.177, 0.100, -0. , -0.179] [ 0.274, 0.178, 0.121, -0. , -0.250] [ 0.298, 0.063, 0.001, -0. , -0.374 ]]$$
$$z_{err}=[[0.008, 0.015, 0.028, 0.008, 0.021], [0.005, 0.013 , 0.018, 0.004, 0.012], [0.014, 0.016, 0.053, 0.016, 0.042], [0.059, 0.088, 0.163, 0.055, 0.151]]$$