# Can I get thresholds from logistic regression coefficients?

• I
Hello,
I remember an example of application of the logistic regression to medicine / epidemiology, which said (more or less) that the probability of a person having miocardial infarction was related to some variables such as age, cholesterol level, etc, and the equation included the various 'thresholds' for each of these variables.
Something like: c0 + cage (age - 50) + cchol (chol - 200) + ...
This was the x in the logistic formula P=1/(1+e-x).
If the coefficients are all positive, it follows that when age > 50 and chol > 200, a positive contribution is given to x by these two variables, which makes e-x smaller, and P closer to 1.

Now my question is, how did they find the thresholds (50 and 200) for age and chol?
If I had data on age, cholesterol, etc, vs presence/absence of the disease, and ran a logistic regression, I think I would get something like this:
x = a0 + aage age + achol chol + ...

I.e. I would only know that age and chol increase P, but not 'when' someone should start to worry about their age and cholesterol.

Am I completely off the mark here, or is there a technique to calculate these thresholds from the data?

Thanks!
L

Related Set Theory, Logic, Probability, Statistics News on Phys.org
c0 + cage (age - 50) + cchol (chol - 200) + ...
I expect those parameters, ##\text{age}_0=50## & ##\text{chol}_0=200##, were picked from some other knowledge (or arbitrarily). They don't come out of the regression. If they did, you would have a regression model like

\begin{align} x & = c_0 & + & c_{age} (\text{age} - \text{age}_0) & + & c_{chol} (\text{chol} - \text{chol}_0) & + \ldots \\ & = a_0 & + & c_{age} \text{age} & + & c_{chol} \text{chol} & + \ldots \end{align}

where those parameters can all be wrapped up into a single parameter, ##a_0 = c_0 - c_{age} \text{age}_0 - c_{chol} \text{chol}_0##. The regression can tell you what ##a_0## should be, but not how that breaks down into ##c_0, \text{age}_0, \text{chol}_0##.

Dale
Mentor
2020 Award
Yes you are right. Those numbers do not come from the regression. You can use any numbers you like there. It just changes the meaning and value of the constant term.

EnumaElish
Thank you both for your replies.

On the subject matter: ouch! I feared this would be the case.

Maybe I could estimate non-arbitrary thresholds by calculating what value of each variable corresponds to P=50%, when all other variables are set to their own population means.

E.g. if we have only age and cholesterol as explanatory variables, to calculate the age threshold:
P = 1/2 when x = 0
a0 + aage agethr + achol chol mean = 0
agethr = (- a0 - achol chol mean) / aage

I guess this would be like asking: what value of the age threshold makes an individual that is 'average' in all other respects reach 50% probability of disease?

Or maybe I should first calculate the probability of disease when all variables are set to their means, and use that as the probability threshold rather than a generic 50%.

In a related problem, where the probability was reduced to a binary outcome, the approach to find a threshold was based on comparing the distributions of the explanatory variables separated between the two outcome classes (in this case, diseased or not diseased).
If each explanatory variable is normally distributed in the two classes, apparently one can use standard formulae to compute in each case the value of the variable for which the probability of belonging to one class is the same as the probability of belonging to the other class; and that would be the threshold.
Not sure if this would be applicable here. Worth a try maybe.

Dale
Mentor
2020 Award
Maybe I could estimate non-arbitrary thresholds by calculating what value of each variable corresponds to P=50%, when all other variables are set to their own population means.
But those numbers are not thresholds of any kind.

All they mean is that the intercept term, ##c_0## in your original equation, is the value expected for someone with age=50 and chol=200. There is no cutoff or threshold involved.

For your other equation you instead have the intercept term, ##a_0##, is the value expected for someone with age=0 and chol=0. But again there is no threshold, it is just telling you how to interpret the intercept term.

Stephen Tashi
If I had data on age, cholesterol, etc, vs presence/absence of the disease, and ran a logistic regression, I think I would get something like this:
x = a0 + aage age + achol chol + ...

I.e. I would only know that age and chol increase P, but not 'when' someone should start to worry about their age and cholesterol.
We should first be clear on whether logistic regression can only produce models where an increase in value of a dependent variable either always increases the predicted probability or always decreases it. From glancing at the web, this appears to be true, but I myself have never used logistic regression.

There are situations where the probability of being healthy depends on keeping some variables near a "healthy" mean value. This must be an often encountered situation in the biological sciences. Surely people employ methods that handle non-monotonic effects.

Are you sure the paper you recall didn't define the dependent variable ##y## as the probability of being disease free and use an independent variable like ##x = c_0 + c_{age}( age - age_0)^2 + c_{chol} (chol - chol_0)^2 ## ?

OK, I see, thanks.
I will not try to read into this more than the numbers tell me.
Yes, I am pretty sure x was a linear function of the explanatory variables.

chiro
Hey lavoisier.

Is there anything that might suggest that the numbers are derived from some principle or some experimental data?

I think you might find that the numbers are representative of something - like an average/median figure that is used to calibrate the model.

It may be that the age of 50 and cholesterol of 200 are benchmarks for the experiment or science being discussed and there is probably a good reason for it.

Hi chiro,
I don't know, maybe. See message #4 where I mentioned the population mean, which is sort of close to what you're saying, at least conceptually.
In any case, those numbers should make sense, or 'mean' something, otherwise why put them there?
Because from the above discussion it's clear that one is in principle free to rewrite the linear exponent in any way that is consistent with the total.
And I believe there are infinite ways of doing that.

chiro
I think it has more to do with either a "cut-off" value if it isn't a mean or median.

I don't want to go further because that would be too much speculation on my part.

Dale
Mentor
2020 Award
Again, those values are completely arbitrary and only define the interpretation of the intercept term. They are not derived from the data not do they represent any sort of threshold or cutoff.

If an important value is known from prior data (population average, decision point, etc), then it is certainly possible to arbitrarily choose to use that important value. Then the intercept would be interpreted as referring to that value.

But again, that is not from the logistic regression and does not itself imply anything special like a cutoff.

Stephen Tashi
Something like: c0 + cage (age - 50) + cchol (chol - 200) + ...
This morning's web browsing says that there are various regression models, like the Tobit, that estimate thresholds from data. So if the authors used such a model they might present their findings with the above formula accompanied by instructions like "if the person's age is less than 50, set age = 50" etc.

FactChecker