# I Help with a difficult (?) regression

1. Oct 11, 2016

### lavoisier

I wonder if someone can please help me understand why a nonlinear regression I'm attempting doesn't work, and suggest how I can tackle it.

It's based on this equation:

$(1-y) \cdot (D \cdot y+K) = n \cdot x \cdot y$

where x is the independent variable, a real number usually between 6400 and 640000, and y is the dependent variable, a real number ∈ (0,1).
D, K and n are real constants. D is known and is usually 1000.
K and n are the ones we want to estimate. K is a real positive number; n is also positive and is usually expected to be a small integer (1, 2 or 3 most of the time). But we accept that n may be non-integer.

The book I took the theory from says the equation can be linearised, so one can run the usual least squares fit.
Here's the equation they suggest:

$\frac {x} {1-y} = \frac {K} {n} \cdot \frac {1} {y} + \frac {D} {n}$

Of course the idea is to plot the LHS vs 1/y, find a line fitting the points, and back-calculate K and n from the slope, intercept and D.

However, when I tried this on my data, in some cases I got a negative value of the intercept, which gave a negative n, and a negative K, as the slope was always positive.
This is physically impossible, given the theory from which the initial equation is derived.
So I'm guessing there is something wrong with the regression (or with the data, or both?).

Then I remembered reading that linearisation-based methods are not so good, as they cause problems with the distribution of the error, or something to that effect, and that nonlinear methods are often preferable.

So I solved the initial equation for y, plotted the experimental y vs the calculated y and tried (using Excel's solver) to find the values of K and n that gave the smallest sum of squared differences between the two.

Unfortunately this gave even worse results than the linear method!
Despite the sum of squares did reach a minimum as required, n became extremely large, and so did K. Again, the theory does not agree with that: n should be quite small and probably an integer.
I also noticed that, by setting n to a small value (say 1 or 2) and fitting K alone, the result was not too bad, but I could put any value for n, and K was always found with a satisfactory sum of squares. Same if I used a fixed reasonable value for K and fitted n alone.
So basically it looked like I could decide myself the value of K or n, which sounds absurd.

Further analysis revealed that in the linearised equation, the term D/n was usually several orders of magnitude smaller than K/(n y), so I guess small experimental errors in the measurement of y can have an enormous effect on the intercept and cause a really bad estimation of n.
But for the nonlinear method, I have no idea why it shouldn't work.

Does anyone know why this is happening?
And can you please suggest what I should do to estimate my parameters K and n more 'reliably'?

Thank you!
L

PS
Here's some simulated data, calculated using K=5000, D=1000, n=1.
[x=640000,y=0.007763881556107663],[x=256000.0,y=0.019229347046974],[x=64000.0,y=0.07345007480164349],[x=6400.0,y=
0.4603886792339626]

The real data I have are often extremely close to the simulated ones, with very small relative errors, but the estimation of K and n as independent parameters suffers from the problem I described above.

Here's some real data for 4 separate experiments y1, y2, y3 and y4 (sorry, poor formatting, I don't know how to paste tables):
x y1 y2 y3 y4
640000 0.003 0.058 0.009 0.007
256000 0.011 0.133 0.022 0.014
64000 0.028 0.371 0.065 0.047
6400 0.224 0.856 0.51 0.339

2. Oct 11, 2016

### andrewkirk

Where did the first equation you wrote come from? Is it supposed to be an exact equation, like Newton's 3rd Law of Motion (F=ma) or is it from a statistical model? If it's a statistical model then it's incomplete because it needs at least one random variation term, typically denoted by $\epsilon$, and it needs to specify the distribution of that random term.

One possibility is that the equation comes from a model in which the random term has a one-sided distribution such as lognormal or Weibull, or maybe even a distribution with compact support, like the beta distribution. If so then just doing an ordinary linear regression is not going to be appropriate because that uses the normal distribution, which has unbounded support, for the error term.

3. Oct 11, 2016

### Staff: Mentor

Yes, that was my first thought. The linear least squares makes some assumptions about the errors that are probably violated in your experiment. Specifically, it assumes that the errors are independent, normally distributed with 0 mean, and the same standard deviation. With a dependent value in the range from 0 to 1 it is highly unlikely that those conditions are met.

That is actually pretty common also, and there isn't really much that you can do about it. Think about plotting your fit error as a function of n and K. Ideally you would like for there to be one sharp peak in this function where the error is minimized and other values are clearly worse. It sounds like what you have instead is some broad ridge where the errors are pretty similar. If the ridge is "skinny" then that means that you have a pretty good idea of the relationship between n and K, but the data simply don't allow you to distinguish any more than that.

Since you seem to believe that you have a good theoretical basis for this equation and since n seems like the most theoretically constrained then I would recommend that you set n and then fit K

Last edited: Oct 11, 2016
4. Oct 12, 2016

### Stephen Tashi

Is x always precisely measured? The usual technique of regression assumes there is no error measuring the independent variable. For example, a regression of the height of a child versus days of its life would assume the day was always precisely measured - e.g. 940 days would not be mistaken for 941 days.

5. Oct 12, 2016

### lavoisier

@andrewkirk
The equation comes from a book explaining the theory of binding of drug molecules to plasma proteins. I simplified the symbols, but D is the total concentration of drug, K is the dissociation constant of the protein-drug complex, y is the unbound fraction of drug, x is the total concentration of plasma protein (albumin for instance) and n is the number of drug molecules that can bind to a single molecule of plasma protein. Exact, I don't know, there are always approximations when dealing with chemical systems, but as far as this theory is concerned it should give a pretty good representation of the behaviour of the plasma protein + drug + medium system.
And actually, apart from the fact that it's difficult to separate K and n, the r squared, sums of squares etc are all very good (for my standards at least).

@Dale
The independent variable y is the 'fraction unbound' or fu. It is calculated as follows. The concentration of drug in two compartments A and B separated by a special membrane is measured. The concentration in A is the 'total' one, the one in B is the 'free' one. A value called PPB is calculated as (A-B)/A. fu is calculated as 1-PPB (yes, I know it would be just as well to do B/A, but that's how they do this). I'm pretty sure the error on A and B is normal. Maybe you can tell from this what kind of error fu would have? I think the error on log(fu) is normal, but I'm not sure.
And assuming we know what kind of error there is, would that help in putting together a better method to do this regression?

The fit error vs n and K is interesting; I will see if I can figure out how to do that. Maybe R has already got some neat function that does it.

Setting n and regressing K, yes, that's almost what I desperately thought to do, but trouble is we need an estimate of both parameters, as we then need to calculate another important parameter i = s*K/(n-s), where s is the slope from another experiment. s is always >0 and <n. In most cases s is so much smaller than n that we can just use the K/n ratio and say i ≈ s*K/n. And indeed, it seems here that I am able to determine the ratio between K and n quite accurately, but not the two separately. There are however cases where n and s are of comparable magnitude, and then it makes a difference to say n=1 and K=3000 or n=2 and K=6000. Suppose s=0.9. In the first case i = 0.9 * 3000 / (1-0.9) = 27000. In the second case i = 0.9 * 6000 / (2-0.9) = 4909.
In fact as I'm writing this, I'm thinking that I'll have to check this one, because if we are lucky our slopes are generally quite small and we can do without all the fuss.
But as a general concept I would still like to learn how one can get around this issue. I didn't know the regression parameters could behave like this.

@Stephen Tashi
Yes, x is known exactly, with negligible error (dilution).
---
Addendum: I tried the nls command in R.
Silly trick from my part, I tried to regress x as a function of y, and of course got badly battered. In some cases I got semi-decent numbers, like n close to 1 or 2 and K in a reasonable range (maybe by chance), but in other cases 'singular gradient' errors or values of n close to 0.
I tried to regress y as a function of x, and then it was always singular gradient, no matter what data I plugged in!

It would be interesting to know if a case like this one is 'bad' in general to handle, whatever one does, or if appropriate manipulation of the equation and choice of the solving algorithm may have an impact on the result.

6. Oct 12, 2016

### Staff: Mentor

Yes. My favorite approach in such cases is to do a quick Monte Carlo simulation. Just assume a reasonable distribution for A and B and calculate fu. Look to see if that calculated value looks similar to your actual measured data. Check to see if fu or log(fu) have normal distributions. Be sure to vary your parameters over a reasonable range to make sure the assumptions hold.

For something simple like B/A there is probably a theoretical solution, but I can do a Monte Carlo simulation faster than I could figure out what search term to even use.

I don't think that you can use this data to estimate both. I think you will need a different experiment.

Usually this is not an issue that you can get around with nonlinear regression. Basically you can just linearize it (which has its own problems) or perform a different experiment.

7. Oct 13, 2016

### DrDu

Funny, I am momentarily analysing data on Immunoglobulines binding to Albumine.

$K=\frac{(nx-D(1-y))Dy}{D(1-y)}$
or
$K(1-y)+Dy(1-y)=nxy$, which is from your expression.

Edit: First attempt of me was erroneous.

Last edited: Oct 13, 2016
8. Oct 13, 2016

### DrDu

So, let's rescale this, setting K'=K/D and x'=x/D.
We get $K'(1-y) +(1-y-nx')y=0$.
nx' is larger than 64, hence 1-y is negligible in the second bracket. All you can get is K/n=xy/(1-y), what you allready know.
You should make some experiments where x~D.