Help with a difficult (?) regression

  • I
  • Thread starter lavoisier
  • Start date
  • Tags
    Regression
In summary: The summary ends here.]In summary, the conversation discusses a nonlinear regression equation and the difficulties the speaker is having with finding accurate estimates for the parameters K and n. They had initially tried a linearization-based method, but it did not work well due to the distribution of errors and the dependence of the values in the equation. They also tried a nonlinear method, but it did not give good results either. The speaker is unsure why this is happening and asks for suggestions on how to estimate the parameters more reliably. The conversation ends with a potential solution of focusing on estimating n, as it seems to be the most theoretically constrained parameter.
  • #1
lavoisier
177
24
I wonder if someone can please help me understand why a nonlinear regression I'm attempting doesn't work, and suggest how I can tackle it.

It's based on this equation:

[itex](1-y) \cdot (D \cdot y+K) = n \cdot x \cdot y[/itex]

where x is the independent variable, a real number usually between 6400 and 640000, and y is the dependent variable, a real number ∈ (0,1).
D, K and n are real constants. D is known and is usually 1000.
K and n are the ones we want to estimate. K is a real positive number; n is also positive and is usually expected to be a small integer (1, 2 or 3 most of the time). But we accept that n may be non-integer.

The book I took the theory from says the equation can be linearised, so one can run the usual least squares fit.
Here's the equation they suggest:

[itex]\frac {x} {1-y} = \frac {K} {n} \cdot \frac {1} {y} + \frac {D} {n} [/itex]

Of course the idea is to plot the LHS vs 1/y, find a line fitting the points, and back-calculate K and n from the slope, intercept and D.

However, when I tried this on my data, in some cases I got a negative value of the intercept, which gave a negative n, and a negative K, as the slope was always positive.
This is physically impossible, given the theory from which the initial equation is derived.
So I'm guessing there is something wrong with the regression (or with the data, or both?).

Then I remembered reading that linearisation-based methods are not so good, as they cause problems with the distribution of the error, or something to that effect, and that nonlinear methods are often preferable.

So I solved the initial equation for y, plotted the experimental y vs the calculated y and tried (using Excel's solver) to find the values of K and n that gave the smallest sum of squared differences between the two.

Unfortunately this gave even worse results than the linear method!
Despite the sum of squares did reach a minimum as required, n became extremely large, and so did K. Again, the theory does not agree with that: n should be quite small and probably an integer.
I also noticed that, by setting n to a small value (say 1 or 2) and fitting K alone, the result was not too bad, but I could put any value for n, and K was always found with a satisfactory sum of squares. Same if I used a fixed reasonable value for K and fitted n alone.
So basically it looked like I could decide myself the value of K or n, which sounds absurd.

Further analysis revealed that in the linearised equation, the term D/n was usually several orders of magnitude smaller than K/(n y), so I guess small experimental errors in the measurement of y can have an enormous effect on the intercept and cause a really bad estimation of n.
But for the nonlinear method, I have no idea why it shouldn't work.

Does anyone know why this is happening?
And can you please suggest what I should do to estimate my parameters K and n more 'reliably'?

Thank you!
L

PS
Here's some simulated data, calculated using K=5000, D=1000, n=1.
[x=640000,y=0.007763881556107663],[x=256000.0,y=0.019229347046974],[x=64000.0,y=0.07345007480164349],[x=6400.0,y=
0.4603886792339626]

The real data I have are often extremely close to the simulated ones, with very small relative errors, but the estimation of K and n as independent parameters suffers from the problem I described above.

Here's some real data for 4 separate experiments y1, y2, y3 and y4 (sorry, poor formatting, I don't know how to paste tables):
x y1 y2 y3 y4
640000 0.003 0.058 0.009 0.007
256000 0.011 0.133 0.022 0.014
64000 0.028 0.371 0.065 0.047
6400 0.224 0.856 0.51 0.339
 
Physics news on Phys.org
  • #2
Where did the first equation you wrote come from? Is it supposed to be an exact equation, like Newton's 3rd Law of Motion (F=ma) or is it from a statistical model? If it's a statistical model then it's incomplete because it needs at least one random variation term, typically denoted by ##\epsilon##, and it needs to specify the distribution of that random term.

One possibility is that the equation comes from a model in which the random term has a one-sided distribution such as lognormal or Weibull, or maybe even a distribution with compact support, like the beta distribution. If so then just doing an ordinary linear regression is not going to be appropriate because that uses the normal distribution, which has unbounded support, for the error term.
 
  • #3
lavoisier said:
Then I remembered reading that linearisation-based methods are not so good, as they cause problems with the distribution of the error, or something to that effect, and that nonlinear methods are often preferable.
Yes, that was my first thought. The linear least squares makes some assumptions about the errors that are probably violated in your experiment. Specifically, it assumes that the errors are independent, normally distributed with 0 mean, and the same standard deviation. With a dependent value in the range from 0 to 1 it is highly unlikely that those conditions are met.

lavoisier said:
I also noticed that, by setting n to a small value (say 1 or 2) and fitting K alone, the result was not too bad, but I could put any value for n, and K was always found with a satisfactory sum of squares.
That is actually pretty common also, and there isn't really much that you can do about it. Think about plotting your fit error as a function of n and K. Ideally you would like for there to be one sharp peak in this function where the error is minimized and other values are clearly worse. It sounds like what you have instead is some broad ridge where the errors are pretty similar. If the ridge is "skinny" then that means that you have a pretty good idea of the relationship between n and K, but the data simply don't allow you to distinguish any more than that.

Since you seem to believe that you have a good theoretical basis for this equation and since n seems like the most theoretically constrained then I would recommend that you set n and then fit K
 
Last edited:
  • #4
lavoisier said:
where x is the independent variable, a real number usually between 6400 and 640000

Is x always precisely measured? The usual technique of regression assumes there is no error measuring the independent variable. For example, a regression of the height of a child versus days of its life would assume the day was always precisely measured - e.g. 940 days would not be mistaken for 941 days.
 
  • #5
Thank you all for your replies. Here are my answers.

@andrewkirk
The equation comes from a book explaining the theory of binding of drug molecules to plasma proteins. I simplified the symbols, but D is the total concentration of drug, K is the dissociation constant of the protein-drug complex, y is the unbound fraction of drug, x is the total concentration of plasma protein (albumin for instance) and n is the number of drug molecules that can bind to a single molecule of plasma protein. Exact, I don't know, there are always approximations when dealing with chemical systems, but as far as this theory is concerned it should give a pretty good representation of the behaviour of the plasma protein + drug + medium system.
And actually, apart from the fact that it's difficult to separate K and n, the r squared, sums of squares etc are all very good (for my standards at least).

@Dale
The independent variable y is the 'fraction unbound' or fu. It is calculated as follows. The concentration of drug in two compartments A and B separated by a special membrane is measured. The concentration in A is the 'total' one, the one in B is the 'free' one. A value called PPB is calculated as (A-B)/A. fu is calculated as 1-PPB (yes, I know it would be just as well to do B/A, but that's how they do this). I'm pretty sure the error on A and B is normal. Maybe you can tell from this what kind of error fu would have? I think the error on log(fu) is normal, but I'm not sure.
And assuming we know what kind of error there is, would that help in putting together a better method to do this regression?

The fit error vs n and K is interesting; I will see if I can figure out how to do that. Maybe R has already got some neat function that does it.

Setting n and regressing K, yes, that's almost what I desperately thought to do, but trouble is we need an estimate of both parameters, as we then need to calculate another important parameter i = s*K/(n-s), where s is the slope from another experiment. s is always >0 and <n. In most cases s is so much smaller than n that we can just use the K/n ratio and say i ≈ s*K/n. And indeed, it seems here that I am able to determine the ratio between K and n quite accurately, but not the two separately. There are however cases where n and s are of comparable magnitude, and then it makes a difference to say n=1 and K=3000 or n=2 and K=6000. Suppose s=0.9. In the first case i = 0.9 * 3000 / (1-0.9) = 27000. In the second case i = 0.9 * 6000 / (2-0.9) = 4909.
In fact as I'm writing this, I'm thinking that I'll have to check this one, because if we are lucky our slopes are generally quite small and we can do without all the fuss.
But as a general concept I would still like to learn how one can get around this issue. I didn't know the regression parameters could behave like this.

@Stephen Tashi
Yes, x is known exactly, with negligible error (dilution).
---
Addendum: I tried the nls command in R.
Silly trick from my part, I tried to regress x as a function of y, and of course got badly battered. In some cases I got semi-decent numbers, like n close to 1 or 2 and K in a reasonable range (maybe by chance), but in other cases 'singular gradient' errors or values of n close to 0.
I tried to regress y as a function of x, and then it was always singular gradient, no matter what data I plugged in! :frown:

It would be interesting to know if a case like this one is 'bad' in general to handle, whatever one does, or if appropriate manipulation of the equation and choice of the solving algorithm may have an impact on the result.
 
  • #6
lavoisier said:
The concentration in A is the 'total' one, the one in B is the 'free' one. A value called PPB is calculated as (A-B)/A. fu is calculated as 1-PPB (yes, I know it would be just as well to do B/A, but that's how they do this). I'm pretty sure the error on A and B is normal. Maybe you can tell from this what kind of error fu would have?
Yes. My favorite approach in such cases is to do a quick Monte Carlo simulation. Just assume a reasonable distribution for A and B and calculate fu. Look to see if that calculated value looks similar to your actual measured data. Check to see if fu or log(fu) have normal distributions. Be sure to vary your parameters over a reasonable range to make sure the assumptions hold.

For something simple like B/A there is probably a theoretical solution, but I can do a Monte Carlo simulation faster than I could figure out what search term to even use.

lavoisier said:
but trouble is we need an estimate of both parameters
I don't think that you can use this data to estimate both. I think you will need a different experiment.

lavoisier said:
as a general concept I would still like to learn how one can get around this issue
Usually this is not an issue that you can get around with nonlinear regression. Basically you can just linearize it (which has its own problems) or perform a different experiment.
 
  • #7
Funny, I am momentarily analysing data on Immunoglobulines binding to Albumine.

The the equilibrium law reads
##K=\frac{(nx-D(1-y))Dy}{D(1-y)}##
or
## K(1-y)+Dy(1-y)=nxy##, which is from your expression.

Edit: First attempt of me was erroneous.
 
Last edited:
  • #8
So, let's rescale this, setting K'=K/D and x'=x/D.
We get ## K'(1-y) +(1-y-nx')y=0##.
nx' is larger than 64, hence 1-y is negligible in the second bracket. All you can get is K/n=xy/(1-y), what you allready know.
You should make some experiments where x~D.
 

FAQ: Help with a difficult (?) regression

1. What is regression analysis?

Regression analysis is a statistical method used to examine the relationship between a dependent variable (usually denoted as Y) and one or more independent variables (usually denoted as X). It helps to understand how the value of the dependent variable changes when the values of the independent variables are altered.

2. What is a difficult regression?

A difficult regression is one where the relationship between the dependent and independent variables is not easily determined. This can happen due to a large amount of noise, non-linear relationships, or a small sample size.

3. How can I improve the accuracy of a difficult regression?

There are a few ways to improve the accuracy of a difficult regression. One approach is to collect more data to reduce the effects of noise. Another approach is to use non-linear regression techniques if the relationship between variables is not linear. Additionally, it is important to carefully select the independent variables and perform feature engineering to ensure they are relevant to the dependent variable.

4. Can I use regression for prediction?

Yes, regression can be used for prediction. However, it is important to note that the accuracy of the predictions depends on the quality of the data and the assumptions made during the regression analysis. It is always recommended to validate the results of a regression model before using it for prediction.

5. How do I interpret the results of a difficult regression?

The interpretation of a difficult regression can be challenging as the relationship between variables may not be clear. It is important to carefully analyze the coefficients and their significance to understand the impact of each independent variable on the dependent variable. Additionally, graphical representations such as scatter plots and residual plots can also aid in interpreting the results of a difficult regression.

Similar threads

Replies
1
Views
1K
Replies
30
Views
3K
Replies
4
Views
1K
Replies
1
Views
2K
Replies
1
Views
541
Back
Top