How to handle the infinity when making a least square fit for the first point?

In summary, the question is about fitting a straight line to a set of data points with known uncertainties, where the line is required to pass through a specific point. The issue arises when trying to calculate the least squares fit using a fitting program, as the uncertainty for the specific point is zero and causes the equation to blow up. Possible solutions include assigning a small, non-zero uncertainty to that point or finding a smarter way to handle this problem.
  • #1
Malamala
299
26
Hello! I have 5 data points with errors associated to them ##y_i \pm dy_i## and the corresponding ##x_i## values (which don't have uncertainties associated to them). I need to calculate the difference between the first of these points, ##y_1## and the rest, and fit a straight line to it (basically the plot will be ##\Delta y## vs ##x##). For the other 4 points the error associated with them is just ##d(\Delta y_i)=\sqrt{dy_1^2+dy_i^2}## for ##i## from 2 to 5. About the first point itself, at that value of ##x_1## the ##\Delta y## value should be zero. Also, given that this is the reference point, the error associated to that should be zero, too (right?). Now when I want to make a least square fit, I need to weight the difference between the model and the data by ##1/(d(\Delta y))##. However that is infinity in the case of the first point (which I guess it makes sense, as I am sure that the line should pass through that point). However I am not sure how to make it work numerically i.e. using a fitting program. Should I just replace 0 by something like ##10^{-15}##? Does anyone have any advice on how should I handle this infinity? Thank you!
 
Physics news on Phys.org
  • #2
As I understand your question, you wish to find the least squares fit of a straight line to dataI with the requirement that the line must past through a given point and that known weights are used on the squared deviations from the line at the other points. Someone who understands curve fitting could do the calculation or code a program to do that for you. If you are determined to used a particular software package to solve the problem, you should name it. Someone who understands the particular software might know how to use the software for that purpose or how to "trick" the software into solving the problem using a routine designed for a different problem.
 
Last edited:
  • #3
Malamala said:
About the first point itself, at that value of x1x1x_1 the ΔyΔy\Delta y value should be zero.
It is my very strong recommendation to not constrain the fit to go through zero. Your data should be zero at that point, so if the model is valid then the fit will produce a value which is not significantly different from zero. If the fit produces a value that is significantly different from zero then that tells you something important. Also, constraining the fit to pass through zero can lead to bias in the estimation of the slope as well as inflating the significance of the slope.

In short, do not treat that first point special in any way. Simply do the fit and check that the naturally fitted value is indeed not significantly different from zero.
 
  • Like
Likes hutchphd
  • #4
Dale said:
It is my very strong recommendation to not constrain the fit to go through zero. Your data should be zero at that point, so if the model is valid then the fit will produce a value which is not significantly different from zero. If the fit produces a value that is significantly different from zero then that tells you something important. Also, constraining the fit to pass through zero can lead to bias in the estimation of the slope as well as inflating the significance of the slope.

In short, do not treat that first point special in any way. Simply do the fit and check that the naturally fitted value is indeed not significantly different from zero.
Thank you for your reply! Actually my question is about doing the fit itself (in Python for example). The function to be minimized for the fit is $$\sum_{i=1}^5\frac{(\Delta y_i - y_{pred})^2}{(d(\Delta y_i))^2}$$ But for the first point, ##d(\Delta y_i) = 0## so that equation blows up. Mathematically, the only way the things can stay finite is if the numerator is zero i.e. the fit must pass through that point, otherwise chi squared in infinite. So my question is, should I assign by hand a very small uncertainty to that first point, but non zero, such that the code work? Or is there some other smarter way to handle this?
 
Last edited:
  • #5
Malamala said:
The function to be minimized for the fit is $$\sum_{i=1}^5\frac{\Delta y_i - y_{pred}}{d(\Delta y_i)}$$
Did you mean to say ##\sum_{i=1}^5\frac{(\Delta y_i - y_{pred})^2}{d(\Delta y_i)}## ?

But for the first point, ##d(\Delta y_i) = 0## so that equation blows up.

The typical thread about statistics on the forum produces a page of answers before the problem is defined - if it ever is defined. As an exception to that trend, let's try to define your problem.

You say you have 5 pairs of data points ##(x_i, y_i)## Why are you are introducting the symbol "##\Delta##"? Do you want "##\Delta y_i ##" to denote ##y_{i+1} - y_i## ? And what do you mean by the notation "##d(\Delta y_i)##" ?
 
  • #6
Stephen Tashi said:
Did you mean to say ##\sum_{i=1}^5\frac{(\Delta y_i - y_{pred})^2}{d(\Delta y_i)}## ?
The typical thread about statistics on the forum produces a page of answers before the problem is defined - if it ever is defined. As an exception to that trend, let's try to define your problem.

You say you have 5 pairs of data points ##(x_i, y_i)## Why are you are introducting the symbol "##\Delta##"? Do you want "##\Delta y_i ##" to denote ##y_{i+1} - y_i## ? And what do you mean by the notation "##d(\Delta y_i)##" ?
Yes, my bad, that is squared! I am sorry if I didn't explained the problem well enough. ##\Delta y_i = y_1-y_i##. So ##\Delta y_1 = 0##. By ##d(\Delta y_i)## I mean the error associated with this difference, which by propation of errors should be given by ##\sqrt{dy_i^2+dy_1^2}##, except for the first point, for which the error is by definition zero.
 
  • #7
Malamala said:
##\Delta y_i = y_1-y_i##. So ##\Delta y_1 = 0##.

So you wish a least squares fit of a straight line to the points ##(x_i, y1 - y_i)##

For me, it's easier to think about ##y_i - y1##, a difference with the opposite sign. I'll imagine a situation where this makes sense. Suppose I have film of an object moving in a straight line, from which I infer the position of the object, possibly making an error in that estimate. I know the object is moving at a constant velocity and I wish to estimate it from these measurements. I imagine that ##x_i## is time and ##y_i## is position.

A different situation, mathematically, is the case where the object moves at a random velocity according to some stochastic process and I wish to estimate the average velocity produced by this process.

The slope of a least squares fit of a line to the ##(x_i, \Delta y_i)## data may not be the estimator of velocity that has minimum variance. For example, a conceptually different estimator is to do a least squares fit of a line to the ##(x_i, y_i)## data and use the slope of that line as the estimator of velocity. Another conceptually different estimator is ##\frac{ \sum_{i=1}^4 (yi+1 - y_i)}{4}##.

If you are determined to do a least squares fit to ##(x_i, y1- y_i)## with a line must pass through ##(x_1, 0)##, we can figure that out. Picking a weight for the datum ##(x1,0)## won't be necessary.

If you want to fit a straight line to the ##(x_i, y_1 - y_i)## data and leave open the posssibility that the line does not pass through ##(x_1,0)## then omit the point ##(x_1, y_1 - y_1) = (x_1, 0) ## from consideration since the 0 is not dependent on a particular measurement. That's my suggestion. Any weight assigned to the datum ##(x_1,0)## would be a subjective choice.
 
  • #8
Malamala said:
Yes, my bad, that is squared! I am sorry if I didn't explained the problem well enough. ##\Delta y_i = y_1-y_i##. So ##\Delta y_1 = 0##. By ##d(\Delta y_i)## I mean the error associated with this difference, which by propation of errors should be given by ##\sqrt{dy_i^2+dy_1^2}##, except for the first point, for which the error is by definition zero.
Am I missing something? Subtracting a constant does not change the error. The errors for the points don’t change here.

If the errors change from point to point then most fitting routines will have problems. You really want to avoid that.
 
  • #9
Dale said:
Am I missing something? Subtracting a constant does not change the error. The errors for the points don’t change here.

If the errors change from point to point then most fitting routines will have problems. You really want to avoid that.
I am not subtracting a constant. I am subtracting ##y_1## i.e. the first measured point. A simple example would be this: say we measure several positions (with a certain error associated to them) and we want to calculate the distance traveled from the first measurement. For all the points except the first one i.e. ##y_1##, the error gets calculated using error propagation. But for the first one, even if we had an error associated to its measurement, when we plot the differences relative to it, that error doesn't matter anymore, as we define that point as the initial point, so the difference between the point and itself is zero, by definition, regardless of our measurements.
 
  • #10
By subtracting off the first point, you are introducing correlated errors on the other points. There have been a lot of equations written in this thread, but I don't believe I have seen any that cover this case. It's not simple.

You also need to better specify what you are looking for with "best fit line". For example, your best estimate of the slope is not the slope of the line that minimizes the residual for the remaining N-1 points. Indeed, I'd not use the words "best fit" as they are sure to introduce misunderstandings.
 
  • #11
Malamala said:
I am not subtracting a constant.
Don’t do that. Subtract a constant. If you like, set that constant equal to the value of the first data point, but subtract a constant. Otherwise you are going to be doing all sorts of bad things and violate many assumptions.

You should have ##d(\Delta y)_1=dy_1##
 
  • #12
I suspect some statistician has already worked on the problem you have proposed. Perhaps a forum member familiar with statistical journals can find an article about it. The problem resembles analyzing data from Galileo's famous experiment involving a ball rolling down an inclined plane. Unfortunately, when I search the internet for articles on that topic, I don't find any that apply a weighted least squares fit. So,in the meantime, we must rely on reasoning.

I don't like the notation ##\Delta y_i = y_1 - y_i## because ## y_i > y_1## produces a negative "displacement" instead of a positive displacement. Hence, I'm going to use the definition ##\Delta y_k = y_k - y_1##

If we want to do weighted least squares properly, we have to consider whether the errors in measurement of ##\Delta y_i## and ##\Delta y_{i+1}## are uncorrelated. Both ##\Delta y_i## and ##\Delta y_{i+1}## depend on the measurement ##y_1##. For example, if the measurement of ##y_1## errs by being too large, it is likely to cause the measurements ##y_2 - y_1## and ##y_3 - y_1## to err by being too small.

Articles on the web show the role of covariance matrix for errors in least squares regression. (Unfortunately, the current Wikipedia article https://en.wikipedia.org/wiki/Weighted_least_squares is in a bad state. It appears to be a section cut out from a different article and it lacks a coherent introduction. )

If you estimate ##A## by fitting a line ##y= Ax + B## to the ##(x_i,y_i)## data, you don't have to worry about correlations between errors if the errors in the measurements ##y_i## are independent.

If you estimate ##A## by fitting a line ##\Delta y = A x ## to the ##(x_i, \Delta y_i)## data you do need to compute the off-diagonal elements of a convariance matrix. I think its simple to compute the covariance matrix if you want to get into that. For ##j \ne k, j > 1, k > 1##, I think the covariance between the errors in ##\Delta y_j## and ##\Delta y_k## is ##\sigma^2_{e_1}##, the variance of the error ##e_1## in the measurement ##y_1##.

Deciding what to do in a coherent manner is not a conceptually simple task because the conceptual foundations of statistics are not simple. The goal, in common speech is to find the "best" way to estimate something. The technical questions include: 1)What quantity do you wish to estimate? 1)What properties of estimators make one one "better" than another? (Unfortunately, saying we want to find the "correct" estimate or the "true" estimate isn't a specific criteria. Any estimate we make probably has some error to it.) 3) Will two given ways of estimating ##A## amount to computing the same estimator? - or do they produce different estimators?
 
  • #13
@Stephen Tashi , @Dale here is the problem that I am trying to solve (although the specific problem is not that important, I am more interested in a general approach). I want to calculate the isotope shift for a given element i.e. how much the radius of the nucleus of a certain element changes for different isotopes, relative to one of the isotopes. I attached a plot of one of these examples (in this case it is the 4th point that is the reference one). I need to fit a straight line to this (in order to do some further calculations using the parameters of this line) and as I said, I am mainly not sure how to handle the fact that the 4th point has zero error. @Dale I am not sure what you mean by using a constant. Why would I have an error on the 4th point (in this case), if I know for sure that the isotope shift of that point, by definition, is zero?
 

Attachments

  • plot.png
    plot.png
    2.6 KB · Views: 233
  • #14
Malamala said:
Why would I have an error on the 4th point (in this case), if I know for sure that the isotope shift of that point, by definition, is zero?
Because the best fit line may not go through the 4th point. So forcing it to do so is a bad idea.
Malamala said:
Why would I have an error on the 4th point (in this case), if I know for sure that the isotope shift of that point, by definition, is zero?
What you are doing here is removing the intercept for the fit, or equivalently forcing the intercept to go through a specific point. It is a bad idea numerically and statistically regardless of the theoretical justification. Your problem with the division by zero is a symptom telling you that this approach is making the math ill. My strong opinion is that you should never ever remove or coerce the intercept. Doing so introduces bias, makes your residuals non-zero mean, and can hide problems with your model.
 
  • #15
Dale said:
Because the best fit line may not go through the 4th point. So forcing it to do so is a bad idea.
What you are doing here is removing the intercept for the fit, or equivalently forcing the intercept to go through a specific point. It is a bad idea numerically and statistically regardless of the theoretical justification. Your problem with the division by zero is a symptom telling you that this approach is making the math ill. My strong opinion is that you should never ever remove or coerce the intercept. Doing so introduces bias, makes your residuals non-zero mean, and can hide problems with your model.
So you are saying that for the 4th point I should just attach the original error i.e. ##dy_1## while for the other I should still do what I am already doing i.e. ##\sqrt{dy_1^2+dy_i^2}##? I see what you mean, but I am still not totally convinced. The fact that I have that infinity is clearly not ideal, but why wouldn't I want my fit to go thorough that point? Intuitively, a fit should reflect the data as well as possible, so if I am confident that the isotope shift for that point is exactly zero don't I want my fit to go through that point? In general you want your fit to be as close as possible to the points you are confident about. I guess what confuses me is, what is the meaning of an error term associated to that 4th point? I am 100% confident that subtracting that radius value (whatever that is) from itself, should give me zero. It's like measuring the speed of an accelerating car versus time. You might have some error in measuring the velocity, but when the car starts, that's your definition of zero and you want a fit to your data to reflect that the velocity at ##t=0## is zero, because that is the point you are the most confident about.
 
  • #16
Malamala said:
I am mainly not sure how to handle the fact that the 4th point has zero error.

As I mentioned in my previous post, if you are doing weighted least squares regression you need to employ the covariance matrix if the errors are correlated. If you do a weighted least squares fit to the ##(x_i,\Delta y_i)## data , the errors are correlated. Hence it insuficient to consider only the errors whose standard deviations you denote by ##\sqrt{ dy^2_1 + dy^2_i}##.

In a weighted least squares curve fit, the only way to incorporate the information that ##\Delta y_1## is always 0 is not to fit a curve of the form ##\Delta y = Ax + B## to 5 data points. Instead you must fit the curve ##\Delta y = A(x - x_1) + \Delta y_1## to the 4 data points left after excluding ##(x_1, \Delta y_1)##.

You should consider doing a linear regression to the ##(x_i, y_i)## data then using the slope of that line to estimate the
curve that fits the ##(x_i, \Delta y_i)## data. I can imagine reasons for not considering that. Suppose the experiment has an uncontrolled aspect that can add the same random constant into each ##y_i## measurement, then to make a prediction that useful in a repetition of the experiment, it would be natural to measure departures in ##\Delta y## from some reference value ##\Delta y_1##.
 
  • Like
Likes Dale
  • #17
Stephen Tashi said:
As I mentioned in my previous post, if you are doing weighted least squares regression you need to employ the covariance matrix if the errors are correlated. If you do a weighted least squares fit to the ##(x_i,\Delta y_i)## data , the errors are correlated. Hence it insuficient to consider only the errors whose standard deviations you denote by ##\sqrt{ dy^2_1 + dy^2_i}##.

In a weighted least squares curve fit, the only way to incorporate the information that ##\Delta y_1## is always 0 is not to fit a curve of the form ##\Delta y = Ax + B## to 5 data points. Instead you must fit the curve ##\Delta y = A(x - x_1) + \Delta y_1## to the 4 data points left after excluding ##(x_1, \Delta y_1)##.

You should consider doing a linear regression to the ##(x_i, y_i)## data then using the slope of that line to estimate the
curve that fits the ##(x_i, \Delta y_i)## data. I can imagine reasons for not considering that. Suppose the experiment has an uncontrolled aspect that can add the same random constant into each ##y_i## measurement, then to make a prediction that useful in a repetition of the experiment, it would be natural to measure departures in ##\Delta y## from some reference value ##\Delta y_1##.
I am not sure I understand why are the errors correlated? Each point ##y_i## is measured independently.
 
  • #18
Malamala said:
I am not sure I understand why are the errors correlated? Each point ##y_i## is measured independently.
But the points ##\Delta y_i## are not measured independently.
 
  • Like
Likes Vanadium 50
  • #19
Dale said:
But the points ##\Delta y_i## are not measured independently.
What do you mean? ##\Delta y_i = y_4 - y_i## (in the isotope shift case), so if ##y_i## and ##y_j## are independent, why subtracting a constant value (in this case ##y_4##, which is a constant in the sense that it has the save value for all ##\Delta y_i##) would make them not independent anymore. Is the degree of correlation between 2 variables changing by adding a constant term to both of them?
 
  • #20
if yi and yj are independent, why subtracting a constant value (in this case y4, which is a constant in the sense that it has the save value for all Δyi) would make them not independent anymore.

Assume the linear model for the ##(x,y)## data is:
##y_i = A X_i + B + e_i ##
where ##y_i## is the measured value of the actual value ##Y_i = AX_i + B## and ##e_i## is a random error with mean zero. Assume ##e_i,\ e_j## are independent random variables for ##i \ne j##.

The linear model for the ##(x,\Delta y)## data is:

##\Delta y_k = (y_1 - y_k) = ( AX_1 + B + e_1) - (AX_k + B + e_k)##
## = A(X_1 - X_k) + (e_1 - e_k)##

The errors in ##\Delta y_j## and ##\Delta y_k##, each contain the term ##e_1##.

Let ##E## to denote the expectation for a random variable over repetitions of the experiment that generated
the data.
##E(\Delta y_k) = A(X_1 - X_k) + 0##

For ##j \ne k##, ##j \ne 1##, ##k \ne 1## :

##E(\Delta y_j \Delta y_k) = E( (A(X_1 - X_j) + e_1 - e_k)(A(X_1 - X_k) + e_1 - e_j) ##
## = E( (A( X_1 - X_k) A( X_1 - X_j) + (e_1 - e_k)A(X_1 - X_k) + A(X_1 - X_j)(e1 - e_j) + (e_1 - e_k)(e_1 - e_j) ) ##
## = A( X_1 - X_k) A( X_1 - X_j) + 0 + 0 + E( (e1- e_k)(e_1 - e_j))##
## = A( X_1 - X_k) A( X_1 - X_j) + E e^2_1 - E (e_1 e_j) - E(e_k e_1) + E( e_k e_j)##
## = A( X_1 - X_k) A( X_1 - X_j) + \sigma^2_{e_1} - 0 - 0 + 0 ##

##COV( \Delta y_j, \Delta y_k) = E(\Delta y_j \Delta y_k) - E(\Delta y_j) E(\Delta y_k)##
## = A(X_1 - X_k) A(X_1 - X_j) + \sigma^2_{e_1} - A(X_1 - X_k) A(X_1 - X_j) = \sigma^2_{e_1}##
 
Last edited:
  • Like
Likes WWGD
  • #21
Malamala said:
why subtracting a constant value (in this case ##y_4##, which is a constant in the sense that it has the save value for all ##\Delta y_i##) would make them not independent anymore.
I was recommending subtracting a constant, you were the one suggesting to not subtract a constant. If you subtract a constant then the errors are independent and unchanged, including the error on the reference point so ##d(\Delta y)_1=dy_1## and you will fit a model with both a slope and an intercept. If you are treating the reference point as having no error so ##d(\Delta y)_1=0## then you are not subtracting a constant and the errors are no longer independent and you will fit a model with a slope but no intercept (at least the intercept is not a free parameter).
 
  • #22
Malamala said:
if ##y_i## and ##y_j## are independent, why subtracting a constant value (in this case ##y_4##, which is a constant in the sense that it has the save value for all ##\Delta y_i##) would make them not independent anymore.

Imagine 100 different experiments where the ##(x_i,y_i)## data is taken. If we picked the value ##y_4## from the first experiment (or picked an arbitrary value for ##y_4##) and always used it to compute ##\Delta y_j = (y_4 - y_j)## on subsequent experiments, then we would call that subtracting a constant from the data. However, if ##y_4## includes the result of a possibly random measurement error and we use a possibly different value of ##y_4## to compute ##(y4 - y_j)## on subsequent experiments then ##y_4## is a random variable, not a constant.

Considering ##y_4## as a random variable, is it surprising that the random quantities, ##y_4 - y_2## and ##y_4 - y_3## show some dependence?

If you don't want to think about repeated experiments, then the concepts of "independent" and "dependent" have no meaning because they refer to properties of random variables. Likewise the things you denote by ##dy_4## and ##dy_3## have no meaning as properties of random variables unless we are permitted to imagine repeated measurements of ##y_4## and ##y_3##.
 
  • Like
Likes Dale
  • #23
You don't want to include a zero error point in your calculation. You want to handle it as a constraint. That can be done by only considering solutions that pass through your intercept. Thus your problem is to find the value of the slope that minimizes the weighted root sum square of your measurements given your intercept. I don't know if there is a simple closed form solution to this so as a lazy engineer I would just iterate to a solution. The slope from the unconstrained solution will good starting guess. Calculate the weighted sum of squares for each slope and pick smallest.
 
Last edited:
  • #24
GlenLaRocca said:
Thus your problem is to find the value of the slope that minimizes the root sum square of your measurements given your intercept.

The problem described in the original post asks how to do a weighted least squares fit. A weighted least squares fit takes into account the different standard deviations associated with the different y-values. Hence it is more complicated that minimizing the root sum square of the deviations from a line. Minimizing the root sum square of the deviations without using weights is appropriate when the y-values all have the same standard deviations.
 
  • Like
Likes Dale
  • #25
Last edited:
  • Like
Likes hutchphd and jim mcnamara

1. What does it mean to "weight" the data for a fit?

Weighting the data for a fit means assigning different levels of importance to different data points in a dataset. This is typically done in regression analysis to account for the fact that some data points may have more variability or uncertainty than others.

2. Why is weighting important in data fitting?

Weighting is important in data fitting because it allows for a more accurate representation of the data. By assigning different weights to data points, the fitting algorithm can take into account the variability and uncertainty of the data and produce a more precise fit.

3. How is the weight of a data point determined?

The weight of a data point is typically determined based on the inverse of the variance or error associated with that data point. This means that data points with higher variance or error will be given a lower weight, while those with lower variance or error will be given a higher weight.

4. What are the benefits of weighting data for a fit?

Weighting data for a fit can lead to a more accurate and precise fit, as it takes into account the variability and uncertainty of the data. This can also help to reduce the impact of outliers in the dataset, resulting in a more robust fit.

5. Are there any potential drawbacks to weighting data for a fit?

One potential drawback of weighting data for a fit is that it can introduce bias into the results if the weights are not chosen carefully. Additionally, weighting can also increase the complexity and computation time of the fitting process, especially for large datasets.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
26
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
884
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
906
Back
Top