# How to perform a fit with correlated variables

• A
• BillKet

#### BillKet

Hello! I need to perform a fit with several variables and 2 of them are very correlated (above 0.99). The functional form of these 2 variables is something like: ##(p+q)x+qf(x)##, where ##f(x)## contains polynomials and some square roots of x, but the coefficients appearing in ##f(x)## are much smaller than one, for example something like ##10^{-7} x^2## (for completeness, but not very relevant to my questions, this is from fitting the p and q parameters of the lambda doubling in a ##^2\Pi_{1/2}## state in a diatomic molecule). If I keep both p and q as free variables, I end up with some values around p=0.1 and q=0.001 with the error for both on the order of 0.0001 and a very good RMS error for the points used for the fit. If I set q=0 and fix it at zero, the uncertainty on p becomes 10 times smaller, but the RMS error is about 50% bigger. I also tried to fix q at the fitted value i.e. q=0.001 and fit just for p. In this case the RMS was as good as initially (even slightly better) and the uncertainty on p was 10 times smaller than initially. I am not sure what is the best way to present my results. If I let both p and q to vary, the uncertainty on p is big, but it feels like that doesn't reflect the truth, as that error is mainly influenced by q, as they appear as p+q. If I fixed q=0.001, the errors on q and p would be different by a factor of 10 and I am not sure if that makes sense mathematically, as they do appear as p+q. Can someone advice me on what is the best way to proceed? Thank you!

You do not say if you are using linear regression or some other technique. Forward stepwise linear regression would make a model with the highest correlated variable first. Then it would remove the correlated part from the other variables and see if it is statistically reasonable to introduce the remainder into the model. There are techniques called forward selection, backward elimination, and bidirectional elimination.
See https://en.wikipedia.org/wiki/Stepwise_regression
There are critics of these methods, but that is true of all statistical methods. All statistical methods should be used wisely.
If you are using some non-linear model, I think that you could still remove the correlated part of one of your variables and see if the remainder is statistically reasonable to add to the model after the first one is included.

• Twigg
You do not say if you are using linear regression or some other technique. Forward stepwise linear regression would make a model with the highest correlated variable first. Then it would remove the correlated part from the other variables and see if it is statistically reasonable to introduce the remainder into the model. There are techniques called forward selection, backward elimination, and bidirectional elimination.
See https://en.wikipedia.org/wiki/Stepwise_regression
There are critics of these methods, but that is true of all statistical methods. All statistical methods should be used wisely.
I am not sure what you mean by linear regression. Isn't that meant only if the dependence is linear? I am using least squares fitting.

I am not sure what you mean by linear regression. Isn't that meant only if the dependence is linear? I am using least squares fitting.
Linear regression uses least-squares fitting and is not as restrictive as you might initially think.
Suppose you are looking for the relationship between ##X## and ##Y##, with ##Y## a function of ##X##.
The regression finds the least-squares linear model, but you can apply it to non-linear relationships. You can try linear regression on a model ##Y = aX+b##, but if the relationship looks more like ##Y = aX^2+b##, you can apply linear regression on that. Just square all the ##X## data.

Linear regression uses least-squares fitting and is not as restrictive as you might initially think.
Suppose you are looking for the relationship between ##X## and ##Y##, with ##Y## a function of ##X##.
The regression finds the least-squares linear model, but you can apply it to non-linear relationships. You can try linear regression on a model ##Y = aX+b##, but if the relationship looks more like ##Y = aX^2+b##, you can apply linear regression on that. Just square all the ##X## data.
But my relationship is a lot more complicated than that. For example I have something of the form:

$$Bx(x+1)+D(x(x+1))^2+(p+q)x+q10^{-7}\sqrt{x}$$

I do know the functional form of my equation, I don't understand how can I fit a line to this.

Can someone advice me on what is the best way to proceed?
What you are running into is called multicolinearity. Or maybe, since it is just two correlated variables, just colinearity.

The easiest thing to do is to just eliminate one of the colinear variables. You can use the AIC or the BIC to choose which model is better if you don’t have a good theoretical reason for choosing one. Or a more rigorous model-building approach like stepwise regression.

You can keep both parameters as long as you are not trying to make inferences about the parameter values. Keeping both will still give good fits to the data, but the parameter values themselves are fundamentally unstable

• Twigg and FactChecker
Sorry. I jumped to conclusions before thoroughly reading your initial post. It looks to me as though the last term will be hard to determine since its contribution is so small. I assume that ##B##, ##D##, ##p##, and ##q## are the unknown parameters. In that case you might consider applying linear regression to the model ##Y=BX_1+DX_2+pX_3+qX_4##, where ##X_1=x(x+1)##, ##X_2=(x(x+1))^2##, ##X_3=x##, and ##X_4=x+10^{-7}\sqrt x##.
I see your point that ##pX_3+qX_4## is problematic, nearly redundant. I wonder what a stepwise linear regression would do with it.

What you are running into is called multicolinearity. Or maybe, since it is just two correlated variables, just colinearity.

The easiest thing to do is to just eliminate one of the colinear variables. You can use the AIC or the BIC to choose which model is better if you don’t have a good theoretical reason for choosing one. Or a more rigorous model-building approach like stepwise regression.

You can keep both parameters as long as you are not trying to make inferences about the parameter values. Keeping both will still give good fits to the data, but the parameter values themselves are fundamentally unstable
So by eliminating one parameter, do you mean setting it to zero? Based on the physics model upon which this equation is built, I do need both parameters. Basically, if I set q=0, p can take over the initial q value in the p+q term, but the sqrt term will vanish, and hence the model will be wrong. If I set p to zero, q would need to become 2 orders of magnitude bigger to take over the p+q part, but then the sqrt part will be too big. I am not sure how can I get rid of one of the parameters, without using a wrong model.

So by eliminating one parameter, do you mean setting it to zero? Based on the physics model upon which this equation is built, I do need both parameters. Basically, if I set q=0, p can take over the initial q value in the p+q term, but the sqrt term will vanish, and hence the model will be wrong. If I set p to zero, q would need to become 2 orders of magnitude bigger to take over the p+q part, but then the sqrt part will be too big. I am not sure how can I get rid of one of the parameters, without using a wrong model.
Then I think you should try a standard linear regression that will force both terms into the model of post #7 and see what you get. At least it would fit your theory. It would be the least-squares model.
As @Dale says, it would be a very ill-conditioned problem.

Then I think you should try a standard linear regression that will force both terms into the model of post #7 and see what you get. At least it would fit your theory. It would be the least-squares model.
As @Dale says, it would be a very ill-conditioned problem.
I see what you mean by linear regression in this case, thanks! But the way I did the fit was basically like that, i.e. I forced both terms into the fit. And the fit looks great as well as the values of p and q are around the values I would expect from theory. My only concern is with uncertainties on the p and q. I saw in other molecular physics papers people fixing one of the parameters when it was very correlated with another, but I am not sure how to quote the errors in that case. I guess it depends on the field (and hence the readers) but I was wondering how would you quote the values and uncertainties in this situation.

• Twigg
I see what you mean by linear regression in this case, thanks! But the way I did the fit was basically like that, i.e. I forced both terms into the fit. And the fit looks great as well as the values of p and q are around the values I would expect from theory. My only concern is with uncertainties on the p and q. I saw in other molecular physics papers people fixing one of the parameters when it was very correlated with another, but I am not sure how to quote the errors in that case. I guess it depends on the field (and hence the readers) but I was wondering how would you quote the values and uncertainties in this situation.
I'm sorry that I don't feel qualified to answer that question. Perhaps others with knowledge of the molecular physics papers that you refer to can give you better advice. You might want to provide links to those papers and ask specific questions about them. In that case, there might be a better section of this forum to ask the question.

I'm sorry that I don't feel qualified to answer that question. Perhaps others with knowledge of the molecular physics papers that you refer to can give you better advice. You might want to provide links to those papers and ask specific questions about them. In that case, there might be a better section of this forum to ask the question.
Oh sorry for the confusion, I meant, assuming you were to publish this in your own field (not molecular spectroscopy), how would you present your results.

Oh sorry for the confusion, I meant, assuming you were to publish this in your own field (not molecular spectroscopy), how would you present your results.
Sorry. This is a very extreme case where the difference between the two terms is seven orders of magnitude lower. I have no experience with that, other than numerical issues on the computer.

Sorry. This is a very extreme case where the difference between the two terms is seven orders of magnitude lower. I have no experience with that, other than numerical issues on the computer.
That's totally ok, thanks a lot for the insights! Just for reference (and for others reading), the resolution of the experiment is good enough such that the sqrt term does make a difference when performing the fit

That's totally ok, thanks a lot for the insights! Just for reference (and for others reading), the resolution of the experiment is good enough such that the sqrt term does make a difference when performing the fit
Stepwise Regression, and Analysis Of Variance, ANOVA, methods would calculate the Coefficient of Partial Determination to see if the additional term (with appropriately adjusted coefficients) is statistically justified. There is a probability associated with that ratio, but I do not know if that is appropriate for your application.

So by eliminating one parameter, do you mean setting it to zero?
I actually mean a model without that parameter at all. Sometimes a model without a given parameter is equivalent to a model with the parameter set to zero, sometimes set to one, sometimes some other value. It depends on the model.

Based on the physics model upon which this equation is built, I do need both parameters.
OK, that is fine then. But you cannot make inferences about the values of the two parameters. You need to restrict your use of the model to making inferences about predictions. The predictions will still be valid even though the parameter estimates will not.

• Twigg
If I let both p and q to vary, the uncertainty on p is big, but it feels like that doesn't reflect the truth, as that error is mainly influenced by q, as they appear as p+q. If I fixed q=0.001, the errors on q and p would be different by a factor of 10 and I am not sure if that makes sense mathematically, as they do appear as p+q.
If I understand you correctly, I believe your analysis is spot on and there isn't anything you can do about it. The model doesn't have any more information than this. It only has a high degree of certainty on p+q, not p. I believe no trickery will get around that fundamental issue.

If you want to test it, try comparing the covariance matrices you get when you fit to variables (p,q) and (p+q,q). In the former case, you should see large off-diagonal terms, and in the latter case I believe that correlation off-diagonal term will be small. If you want to present your results with no amibiguity, I would present your whole covariance matrix for (p,q). Alternatively, you could just present the error on p+q and q (assuming the correlation was small) and put a note in the supplementary materials of your paper (if it has one). I think both of those would be very honest and upfront presentations of your result.

• 