How to perform a fit with correlated variables

  • Context: Graduate 
  • Thread starter Thread starter BillKet
  • Start date Start date
  • Tags Tags
    Fit Variables
Click For Summary

Discussion Overview

The discussion revolves around the challenges of fitting a model with several variables, particularly focusing on two highly correlated variables. Participants explore the implications of multicollinearity on parameter estimation and fitting techniques, including least squares fitting and regression methods. The context includes theoretical considerations from a physics model related to diatomic molecules.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes the fitting process and the impact of correlation on the uncertainty of parameters p and q, noting that fixing one variable reduces uncertainty but increases RMS error.
  • Another participant suggests using forward stepwise linear regression to handle correlated variables, mentioning techniques like forward selection and backward elimination.
  • Some participants clarify that linear regression can be applied to non-linear relationships by transforming the data, though one participant expresses skepticism about fitting a complex model with a simple linear approach.
  • Concerns about multicollinearity are raised, with suggestions to eliminate one of the correlated variables or use model selection criteria like AIC or BIC.
  • Participants discuss the necessity of both parameters based on the underlying physics model, questioning how to appropriately manage the correlation without compromising the model's integrity.
  • One participant proposes applying a standard linear regression to a reformulated model to accommodate both parameters, acknowledging the ill-conditioned nature of the problem.

Areas of Agreement / Disagreement

Participants express differing views on how to handle the correlated variables, with some advocating for elimination or fixing one parameter, while others emphasize the need to retain both due to theoretical considerations. The discussion remains unresolved regarding the best approach to fitting the model.

Contextual Notes

Participants note that the correlation between variables leads to instability in parameter estimates, and the choice of fitting method may depend on the specific characteristics of the data and model. There is an acknowledgment of the limitations of statistical methods in addressing multicollinearity.

BillKet
Messages
311
Reaction score
30
Hello! I need to perform a fit with several variables and 2 of them are very correlated (above 0.99). The functional form of these 2 variables is something like: ##(p+q)x+qf(x)##, where ##f(x)## contains polynomials and some square roots of x, but the coefficients appearing in ##f(x)## are much smaller than one, for example something like ##10^{-7} x^2## (for completeness, but not very relevant to my questions, this is from fitting the p and q parameters of the lambda doubling in a ##^2\Pi_{1/2}## state in a diatomic molecule). If I keep both p and q as free variables, I end up with some values around p=0.1 and q=0.001 with the error for both on the order of 0.0001 and a very good RMS error for the points used for the fit. If I set q=0 and fix it at zero, the uncertainty on p becomes 10 times smaller, but the RMS error is about 50% bigger. I also tried to fix q at the fitted value i.e. q=0.001 and fit just for p. In this case the RMS was as good as initially (even slightly better) and the uncertainty on p was 10 times smaller than initially. I am not sure what is the best way to present my results. If I let both p and q to vary, the uncertainty on p is big, but it feels like that doesn't reflect the truth, as that error is mainly influenced by q, as they appear as p+q. If I fixed q=0.001, the errors on q and p would be different by a factor of 10 and I am not sure if that makes sense mathematically, as they do appear as p+q. Can someone advice me on what is the best way to proceed? Thank you!
 
Physics news on Phys.org
You do not say if you are using linear regression or some other technique. Forward stepwise linear regression would make a model with the highest correlated variable first. Then it would remove the correlated part from the other variables and see if it is statistically reasonable to introduce the remainder into the model. There are techniques called forward selection, backward elimination, and bidirectional elimination.
See https://en.wikipedia.org/wiki/Stepwise_regression
There are critics of these methods, but that is true of all statistical methods. All statistical methods should be used wisely.
If you are using some non-linear model, I think that you could still remove the correlated part of one of your variables and see if the remainder is statistically reasonable to add to the model after the first one is included.
 
  • Informative
Likes   Reactions: Twigg
FactChecker said:
You do not say if you are using linear regression or some other technique. Forward stepwise linear regression would make a model with the highest correlated variable first. Then it would remove the correlated part from the other variables and see if it is statistically reasonable to introduce the remainder into the model. There are techniques called forward selection, backward elimination, and bidirectional elimination.
See https://en.wikipedia.org/wiki/Stepwise_regression
There are critics of these methods, but that is true of all statistical methods. All statistical methods should be used wisely.
I am not sure what you mean by linear regression. Isn't that meant only if the dependence is linear? I am using least squares fitting.
 
BillKet said:
I am not sure what you mean by linear regression. Isn't that meant only if the dependence is linear? I am using least squares fitting.
Linear regression uses least-squares fitting and is not as restrictive as you might initially think.
Suppose you are looking for the relationship between ##X## and ##Y##, with ##Y## a function of ##X##.
The regression finds the least-squares linear model, but you can apply it to non-linear relationships. You can try linear regression on a model ##Y = aX+b##, but if the relationship looks more like ##Y = aX^2+b##, you can apply linear regression on that. Just square all the ##X## data.
 
FactChecker said:
Linear regression uses least-squares fitting and is not as restrictive as you might initially think.
Suppose you are looking for the relationship between ##X## and ##Y##, with ##Y## a function of ##X##.
The regression finds the least-squares linear model, but you can apply it to non-linear relationships. You can try linear regression on a model ##Y = aX+b##, but if the relationship looks more like ##Y = aX^2+b##, you can apply linear regression on that. Just square all the ##X## data.
But my relationship is a lot more complicated than that. For example I have something of the form:

$$Bx(x+1)+D(x(x+1))^2+(p+q)x+q10^{-7}\sqrt{x}$$

I do know the functional form of my equation, I don't understand how can I fit a line to this.
 
BillKet said:
Can someone advice me on what is the best way to proceed?
What you are running into is called multicolinearity. Or maybe, since it is just two correlated variables, just colinearity.

The easiest thing to do is to just eliminate one of the colinear variables. You can use the AIC or the BIC to choose which model is better if you don’t have a good theoretical reason for choosing one. Or a more rigorous model-building approach like stepwise regression.

You can keep both parameters as long as you are not trying to make inferences about the parameter values. Keeping both will still give good fits to the data, but the parameter values themselves are fundamentally unstable
 
  • Like
Likes   Reactions: Twigg and FactChecker
Sorry. I jumped to conclusions before thoroughly reading your initial post. It looks to me as though the last term will be hard to determine since its contribution is so small. I assume that ##B##, ##D##, ##p##, and ##q## are the unknown parameters. In that case you might consider applying linear regression to the model ##Y=BX_1+DX_2+pX_3+qX_4##, where ##X_1=x(x+1)##, ##X_2=(x(x+1))^2##, ##X_3=x##, and ##X_4=x+10^{-7}\sqrt x##.
I see your point that ##pX_3+qX_4## is problematic, nearly redundant. I wonder what a stepwise linear regression would do with it.
 
Dale said:
What you are running into is called multicolinearity. Or maybe, since it is just two correlated variables, just colinearity.

The easiest thing to do is to just eliminate one of the colinear variables. You can use the AIC or the BIC to choose which model is better if you don’t have a good theoretical reason for choosing one. Or a more rigorous model-building approach like stepwise regression.

You can keep both parameters as long as you are not trying to make inferences about the parameter values. Keeping both will still give good fits to the data, but the parameter values themselves are fundamentally unstable
So by eliminating one parameter, do you mean setting it to zero? Based on the physics model upon which this equation is built, I do need both parameters. Basically, if I set q=0, p can take over the initial q value in the p+q term, but the sqrt term will vanish, and hence the model will be wrong. If I set p to zero, q would need to become 2 orders of magnitude bigger to take over the p+q part, but then the sqrt part will be too big. I am not sure how can I get rid of one of the parameters, without using a wrong model.
 
BillKet said:
So by eliminating one parameter, do you mean setting it to zero? Based on the physics model upon which this equation is built, I do need both parameters. Basically, if I set q=0, p can take over the initial q value in the p+q term, but the sqrt term will vanish, and hence the model will be wrong. If I set p to zero, q would need to become 2 orders of magnitude bigger to take over the p+q part, but then the sqrt part will be too big. I am not sure how can I get rid of one of the parameters, without using a wrong model.
Then I think you should try a standard linear regression that will force both terms into the model of post #7 and see what you get. At least it would fit your theory. It would be the least-squares model.
As @Dale says, it would be a very ill-conditioned problem.
 
  • #10
FactChecker said:
Then I think you should try a standard linear regression that will force both terms into the model of post #7 and see what you get. At least it would fit your theory. It would be the least-squares model.
As @Dale says, it would be a very ill-conditioned problem.
I see what you mean by linear regression in this case, thanks! But the way I did the fit was basically like that, i.e. I forced both terms into the fit. And the fit looks great as well as the values of p and q are around the values I would expect from theory. My only concern is with uncertainties on the p and q. I saw in other molecular physics papers people fixing one of the parameters when it was very correlated with another, but I am not sure how to quote the errors in that case. I guess it depends on the field (and hence the readers) but I was wondering how would you quote the values and uncertainties in this situation.
 
  • Like
Likes   Reactions: Twigg
  • #11
BillKet said:
I see what you mean by linear regression in this case, thanks! But the way I did the fit was basically like that, i.e. I forced both terms into the fit. And the fit looks great as well as the values of p and q are around the values I would expect from theory. My only concern is with uncertainties on the p and q. I saw in other molecular physics papers people fixing one of the parameters when it was very correlated with another, but I am not sure how to quote the errors in that case. I guess it depends on the field (and hence the readers) but I was wondering how would you quote the values and uncertainties in this situation.
I'm sorry that I don't feel qualified to answer that question. Perhaps others with knowledge of the molecular physics papers that you refer to can give you better advice. You might want to provide links to those papers and ask specific questions about them. In that case, there might be a better section of this forum to ask the question.
 
  • #12
FactChecker said:
I'm sorry that I don't feel qualified to answer that question. Perhaps others with knowledge of the molecular physics papers that you refer to can give you better advice. You might want to provide links to those papers and ask specific questions about them. In that case, there might be a better section of this forum to ask the question.
Oh sorry for the confusion, I meant, assuming you were to publish this in your own field (not molecular spectroscopy), how would you present your results.
 
  • #13
BillKet said:
Oh sorry for the confusion, I meant, assuming you were to publish this in your own field (not molecular spectroscopy), how would you present your results.
Sorry. This is a very extreme case where the difference between the two terms is seven orders of magnitude lower. I have no experience with that, other than numerical issues on the computer.
 
  • #14
FactChecker said:
Sorry. This is a very extreme case where the difference between the two terms is seven orders of magnitude lower. I have no experience with that, other than numerical issues on the computer.
That's totally ok, thanks a lot for the insights! Just for reference (and for others reading), the resolution of the experiment is good enough such that the sqrt term does make a difference when performing the fit
 
  • #15
BillKet said:
That's totally ok, thanks a lot for the insights! Just for reference (and for others reading), the resolution of the experiment is good enough such that the sqrt term does make a difference when performing the fit
Stepwise Regression, and Analysis Of Variance, ANOVA, methods would calculate the Coefficient of Partial Determination to see if the additional term (with appropriately adjusted coefficients) is statistically justified. There is a probability associated with that ratio, but I do not know if that is appropriate for your application.
 
  • #16
BillKet said:
So by eliminating one parameter, do you mean setting it to zero?
I actually mean a model without that parameter at all. Sometimes a model without a given parameter is equivalent to a model with the parameter set to zero, sometimes set to one, sometimes some other value. It depends on the model.

BillKet said:
Based on the physics model upon which this equation is built, I do need both parameters.
OK, that is fine then. But you cannot make inferences about the values of the two parameters. You need to restrict your use of the model to making inferences about predictions. The predictions will still be valid even though the parameter estimates will not.
 
  • Like
Likes   Reactions: Twigg
  • #17
BillKet said:
If I let both p and q to vary, the uncertainty on p is big, but it feels like that doesn't reflect the truth, as that error is mainly influenced by q, as they appear as p+q. If I fixed q=0.001, the errors on q and p would be different by a factor of 10 and I am not sure if that makes sense mathematically, as they do appear as p+q.
If I understand you correctly, I believe your analysis is spot on and there isn't anything you can do about it. The model doesn't have any more information than this. It only has a high degree of certainty on p+q, not p. I believe no trickery will get around that fundamental issue.

If you want to test it, try comparing the covariance matrices you get when you fit to variables (p,q) and (p+q,q). In the former case, you should see large off-diagonal terms, and in the latter case I believe that correlation off-diagonal term will be small. If you want to present your results with no amibiguity, I would present your whole covariance matrix for (p,q). Alternatively, you could just present the error on p+q and q (assuming the correlation was small) and put a note in the supplementary materials of your paper (if it has one). I think both of those would be very honest and upfront presentations of your result.

I hope that was helpful!
 
  • Like
Likes   Reactions: BillKet and Dale

Similar threads

Replies
8
Views
2K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 8 ·
Replies
8
Views
7K
Replies
28
Views
4K
  • · Replies 1 ·
Replies
1
Views
1K
Replies
24
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 16 ·
Replies
16
Views
2K
  • · Replies 18 ·
Replies
18
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K