Can I get thresholds from logistic regression coefficients?

lavoisier · Jun 21, 2016

Hello,
I remember an example of application of the logistic regression to medicine / epidemiology, which said (more or less) that the probability of a person having miocardial infarction was related to some variables such as age, cholesterol level, etc, and the equation included the various 'thresholds' for each of these variables.
Something like: c₀ + c_age (age - 50) + c_chol (chol - 200) + ...
This was the x in the logistic formula P=1/(1+e^-x).
If the coefficients are all positive, it follows that when age > 50 and chol > 200, a positive contribution is given to x by these two variables, which makes e^-x smaller, and P closer to 1.

Now my question is, how did they find the thresholds (50 and 200) for age and chol?
If I had data on age, cholesterol, etc, vs presence/absence of the disease, and ran a logistic regression, I think I would get something like this:
x = a₀ + a_age age + a_chol chol + ...

I.e. I would only know that age and chol increase P, but not 'when' someone should start to worry about their age and cholesterol.

Am I completely off the mark here, or is there a technique to calculate these thresholds from the data?

Thanks!
L

rikblok · Jun 21, 2016

lavoisier said:

c₀ + c_age (age - 50) + c_chol (chol - 200) + ...

I expect those parameters, ##\text{age}_0=50## & ##\text{chol}_0=200##, were picked from some other knowledge (or arbitrarily). They don't come out of the regression. If they did, you would have a regression model like

$$\begin{align}
x & = c_0 & + & c_{age} (\text{age} - \text{age}_0) & + & c_{chol} (\text{chol} - \text{chol}_0) & + \ldots \\
& = a_0 & + & c_{age} \text{age} & + & c_{chol} \text{chol} & + \ldots
\end{align}$$

where those parameters can all be wrapped up into a single parameter, ##a_0 = c_0 - c_{age} \text{age}_0 - c_{chol} \text{chol}_0##. The regression can tell you what ##a_0## should be, but not how that breaks down into ##c_0, \text{age}_0, \text{chol}_0##.

Dale · Jun 21, 2016

Yes you are right. Those numbers do not come from the regression. You can use any numbers you like there. It just changes the meaning and value of the constant term.

lavoisier · Jun 22, 2016

Thank you both for your replies.

On the subject matter: ouch! I feared this would be the case.

Maybe I could estimate non-arbitrary thresholds by calculating what value of each variable corresponds to P=50%, when all other variables are set to their own population means.

E.g. if we have only age and cholesterol as explanatory variables, to calculate the age threshold:
P = 1/2 when x = 0
a₀ + a_age age_thr + a_chol chol_mean = 0
age_thr = (- a₀ - a_chol chol_mean) / a_age

I guess this would be like asking: what value of the age threshold makes an individual that is 'average' in all other respects reach 50% probability of disease?

Or maybe I should first calculate the probability of disease when all variables are set to their means, and use that as the probability threshold rather than a generic 50%.

In a related problem, where the probability was reduced to a binary outcome, the approach to find a threshold was based on comparing the distributions of the explanatory variables separated between the two outcome classes (in this case, diseased or not diseased).
If each explanatory variable is normally distributed in the two classes, apparently one can use standard formulae to compute in each case the value of the variable for which the probability of belonging to one class is the same as the probability of belonging to the other class; and that would be the threshold.
Not sure if this would be applicable here. Worth a try maybe.

Dale · Jun 22, 2016

lavoisier said:

Maybe I could estimate non-arbitrary thresholds by calculating what value of each variable corresponds to P=50%, when all other variables are set to their own population means.

But those numbers are not thresholds of any kind.

All they mean is that the intercept term, ##c_0## in your original equation, is the value expected for someone with age=50 and chol=200. There is no cutoff or threshold involved.

For your other equation you instead have the intercept term, ##a_0##, is the value expected for someone with age=0 and chol=0. But again there is no threshold, it is just telling you how to interpret the intercept term.

Stephen Tashi · Jun 23, 2016

lavoisier said:

If I had data on age, cholesterol, etc, vs presence/absence of the disease, and ran a logistic regression, I think I would get something like this:
x = a₀ + a_age age + a_chol chol + ...

I.e. I would only know that age and chol increase P, but not 'when' someone should start to worry about their age and cholesterol.

We should first be clear on whether logistic regression can only produce models where an increase in value of a dependent variable either always increases the predicted probability or always decreases it. From glancing at the web, this appears to be true, but I myself have never used logistic regression.

There are situations where the probability of being healthy depends on keeping some variables near a "healthy" mean value. This must be an often encountered situation in the biological sciences. Surely people employ methods that handle non-monotonic effects.

Are you sure the paper you recall didn't define the dependent variable ##y## as the probability of being disease free and use an independent variable like ##x = c_0 + c_{age}( age - age_0)^2 + c_{chol} (chol - chol_0)^2 ## ?

lavoisier · Jun 23, 2016

OK, I see, thanks.
I will not try to read into this more than the numbers tell me.
Yes, I am pretty sure x was a linear function of the explanatory variables.

chiro · Jul 8, 2016

Hey lavoisier.

Is there anything that might suggest that the numbers are derived from some principle or some experimental data?

I think you might find that the numbers are representative of something - like an average/median figure that is used to calibrate the model.

It may be that the age of 50 and cholesterol of 200 are benchmarks for the experiment or science being discussed and there is probably a good reason for it.

lavoisier · Jul 13, 2016

Hi chiro,
I don't know, maybe. See message #4 where I mentioned the population mean, which is sort of close to what you're saying, at least conceptually.
In any case, those numbers should make sense, or 'mean' something, otherwise why put them there?
Because from the above discussion it's clear that one is in principle free to rewrite the linear exponent in any way that is consistent with the total.
And I believe there are infinite ways of doing that.

chiro · Jul 14, 2016

I think it has more to do with either a "cut-off" value if it isn't a mean or median.

I don't want to go further because that would be too much speculation on my part.

Dale · Jul 14, 2016

Again, those values are completely arbitrary and only define the interpretation of the intercept term. They are not derived from the data not do they represent any sort of threshold or cutoff.

If an important value is known from prior data (population average, decision point, etc), then it is certainly possible to arbitrarily choose to use that important value. Then the intercept would be interpreted as referring to that value.

But again, that is not from the logistic regression and does not itself imply anything special like a cutoff.

Stephen Tashi · Jul 14, 2016

lavoisier said:

Something like: c₀ + c_age (age - 50) + c_chol (chol - 200) + ...

This morning's web browsing says that there are various regression models, like the Tobit, that estimate thresholds from data. So if the authors used such a model they might present their findings with the above formula accompanied by instructions like "if the person's age is less than 50, set age = 50" etc.

FactChecker · Jul 14, 2016

If the "thresholds" were calculated using any model, it is very unlikely that the results would be such exact numbers like age_threshold=50, chol_threshold=200. Those "threshold" numbers seem to be human generated convenient numbers.

It seems to be an unresolved question as to how the thresholds are really used in the regression:
Option 1: x' = x-x_threshold (No real model change. Just using intermediate variables in the regression.)
Option 2: x' = max( x, x_threshold) (A significant model change, not adequately indicated by the regression equation in the original post.)

Can I get thresholds from logistic regression coefficients?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect