Undergrad Can I get thresholds from logistic regression coefficients?

Click For Summary
Logistic regression coefficients do not inherently provide thresholds for variables like age and cholesterol; these values are often chosen based on prior knowledge or arbitrary decisions. While regression can indicate how variables influence disease probability, it does not specify critical cutoff points for concern. To estimate non-arbitrary thresholds, one could calculate the values at which the probability of disease reaches a specific level, such as 50%, using population means for other variables. The discussion highlights that thresholds may be more about interpretation rather than derived from the regression itself, and any specific values used should have meaningful context. Ultimately, the thresholds mentioned in logistic regression models are not derived from the data but are often set for interpretative purposes.
lavoisier
Messages
177
Reaction score
24
Hello,
I remember an example of application of the logistic regression to medicine / epidemiology, which said (more or less) that the probability of a person having miocardial infarction was related to some variables such as age, cholesterol level, etc, and the equation included the various 'thresholds' for each of these variables.
Something like: c0 + cage (age - 50) + cchol (chol - 200) + ...
This was the x in the logistic formula P=1/(1+e-x).
If the coefficients are all positive, it follows that when age > 50 and chol > 200, a positive contribution is given to x by these two variables, which makes e-x smaller, and P closer to 1.

Now my question is, how did they find the thresholds (50 and 200) for age and chol?
If I had data on age, cholesterol, etc, vs presence/absence of the disease, and ran a logistic regression, I think I would get something like this:
x = a0 + aage age + achol chol + ...

I.e. I would only know that age and chol increase P, but not 'when' someone should start to worry about their age and cholesterol.

Am I completely off the mark here, or is there a technique to calculate these thresholds from the data?

Thanks!
L
 
Physics news on Phys.org
lavoisier said:
c0 + cage (age - 50) + cchol (chol - 200) + ...

I expect those parameters, ##\text{age}_0=50## & ##\text{chol}_0=200##, were picked from some other knowledge (or arbitrarily). They don't come out of the regression. If they did, you would have a regression model like

$$\begin{align}
x & = c_0 & + & c_{age} (\text{age} - \text{age}_0) & + & c_{chol} (\text{chol} - \text{chol}_0) & + \ldots \\
& = a_0 & + & c_{age} \text{age} & + & c_{chol} \text{chol} & + \ldots
\end{align}$$

where those parameters can all be wrapped up into a single parameter, ##a_0 = c_0 - c_{age} \text{age}_0 - c_{chol} \text{chol}_0##. The regression can tell you what ##a_0## should be, but not how that breaks down into ##c_0, \text{age}_0, \text{chol}_0##.
 
Yes you are right. Those numbers do not come from the regression. You can use any numbers you like there. It just changes the meaning and value of the constant term.
 
  • Like
Likes EnumaElish
Thank you both for your replies.

On the subject matter: ouch! I feared this would be the case.

Maybe I could estimate non-arbitrary thresholds by calculating what value of each variable corresponds to P=50%, when all other variables are set to their own population means.

E.g. if we have only age and cholesterol as explanatory variables, to calculate the age threshold:
P = 1/2 when x = 0
a0 + aage agethr + achol chol mean = 0
agethr = (- a0 - achol chol mean) / aage

I guess this would be like asking: what value of the age threshold makes an individual that is 'average' in all other respects reach 50% probability of disease?

Or maybe I should first calculate the probability of disease when all variables are set to their means, and use that as the probability threshold rather than a generic 50%.

In a related problem, where the probability was reduced to a binary outcome, the approach to find a threshold was based on comparing the distributions of the explanatory variables separated between the two outcome classes (in this case, diseased or not diseased).
If each explanatory variable is normally distributed in the two classes, apparently one can use standard formulae to compute in each case the value of the variable for which the probability of belonging to one class is the same as the probability of belonging to the other class; and that would be the threshold.
Not sure if this would be applicable here. Worth a try maybe.
 
lavoisier said:
Maybe I could estimate non-arbitrary thresholds by calculating what value of each variable corresponds to P=50%, when all other variables are set to their own population means.
But those numbers are not thresholds of any kind.

All they mean is that the intercept term, ##c_0## in your original equation, is the value expected for someone with age=50 and chol=200. There is no cutoff or threshold involved.

For your other equation you instead have the intercept term, ##a_0##, is the value expected for someone with age=0 and chol=0. But again there is no threshold, it is just telling you how to interpret the intercept term.
 
lavoisier said:
If I had data on age, cholesterol, etc, vs presence/absence of the disease, and ran a logistic regression, I think I would get something like this:
x = a0 + aage age + achol chol + ...

I.e. I would only know that age and chol increase P, but not 'when' someone should start to worry about their age and cholesterol.

We should first be clear on whether logistic regression can only produce models where an increase in value of a dependent variable either always increases the predicted probability or always decreases it. From glancing at the web, this appears to be true, but I myself have never used logistic regression.

There are situations where the probability of being healthy depends on keeping some variables near a "healthy" mean value. This must be an often encountered situation in the biological sciences. Surely people employ methods that handle non-monotonic effects.

Are you sure the paper you recall didn't define the dependent variable ##y## as the probability of being disease free and use an independent variable like ##x = c_0 + c_{age}( age - age_0)^2 + c_{chol} (chol - chol_0)^2 ## ?
 
OK, I see, thanks.
I will not try to read into this more than the numbers tell me.
Yes, I am pretty sure x was a linear function of the explanatory variables.
 
Hey lavoisier.

Is there anything that might suggest that the numbers are derived from some principle or some experimental data?

I think you might find that the numbers are representative of something - like an average/median figure that is used to calibrate the model.

It may be that the age of 50 and cholesterol of 200 are benchmarks for the experiment or science being discussed and there is probably a good reason for it.
 
Hi chiro,
I don't know, maybe. See message #4 where I mentioned the population mean, which is sort of close to what you're saying, at least conceptually.
In any case, those numbers should make sense, or 'mean' something, otherwise why put them there?
Because from the above discussion it's clear that one is in principle free to rewrite the linear exponent in any way that is consistent with the total.
And I believe there are infinite ways of doing that.
 
  • #10
I think it has more to do with either a "cut-off" value if it isn't a mean or median.

I don't want to go further because that would be too much speculation on my part.
 
  • #11
Again, those values are completely arbitrary and only define the interpretation of the intercept term. They are not derived from the data not do they represent any sort of threshold or cutoff.

If an important value is known from prior data (population average, decision point, etc), then it is certainly possible to arbitrarily choose to use that important value. Then the intercept would be interpreted as referring to that value.

But again, that is not from the logistic regression and does not itself imply anything special like a cutoff.
 
  • #12
lavoisier said:
Something like: c0 + cage (age - 50) + cchol (chol - 200) + ...

This morning's web browsing says that there are various regression models, like the Tobit, that estimate thresholds from data. So if the authors used such a model they might present their findings with the above formula accompanied by instructions like "if the person's age is less than 50, set age = 50" etc.
 
  • #13
If the "thresholds" were calculated using any model, it is very unlikely that the results would be such exact numbers like age_threshold=50, chol_threshold=200. Those "threshold" numbers seem to be human generated convenient numbers.

It seems to be an unresolved question as to how the thresholds are really used in the regression:
Option 1: x' = x-x_threshold (No real model change. Just using intermediate variables in the regression.)
Option 2: x' = max( x, x_threshold) (A significant model change, not adequately indicated by the regression equation in the original post.)
 

Similar threads

Replies
3
Views
3K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 13 ·
Replies
13
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 21 ·
Replies
21
Views
3K
  • · Replies 19 ·
Replies
19
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K