Can I get thresholds from logistic regression coefficients?

Click For Summary

Discussion Overview

The discussion revolves around the use of logistic regression in determining thresholds for variables such as age and cholesterol levels in predicting the probability of myocardial infarction. Participants explore how these thresholds are derived and whether they can be calculated from logistic regression data.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant recalls an example where logistic regression included thresholds for age and cholesterol, questioning how these thresholds are determined.
  • Another participant suggests that the thresholds (e.g., age=50, cholesterol=200) are likely chosen based on prior knowledge or arbitrarily, rather than derived from the regression itself.
  • It is proposed that one could estimate thresholds by calculating values that correspond to a probability of 50% when other variables are held at their means.
  • A participant mentions that the intercept term in the regression can be interpreted differently depending on the chosen thresholds, but emphasizes that these do not represent actual cutoffs.
  • There is a suggestion that the thresholds might represent average or median figures used to calibrate the model, but this remains speculative.
  • Some participants express uncertainty about whether logistic regression can accommodate non-monotonic relationships between variables and outcomes.
  • Another participant notes that if thresholds were derived from any model, they would likely not be exact numbers, suggesting they are human-generated for convenience.

Areas of Agreement / Disagreement

Participants generally agree that the thresholds discussed are not derived from the logistic regression itself and that their interpretation can vary. However, there is no consensus on how these thresholds should be determined or their significance in the context of the regression model.

Contextual Notes

Participants highlight the potential for arbitrary selection of threshold values and the implications of using different interpretations of the intercept term in the regression model. There is also mention of alternative regression models that might estimate thresholds from data, but this is not directly applicable to the logistic regression context discussed.

lavoisier
Messages
177
Reaction score
24
Hello,
I remember an example of application of the logistic regression to medicine / epidemiology, which said (more or less) that the probability of a person having miocardial infarction was related to some variables such as age, cholesterol level, etc, and the equation included the various 'thresholds' for each of these variables.
Something like: c0 + cage (age - 50) + cchol (chol - 200) + ...
This was the x in the logistic formula P=1/(1+e-x).
If the coefficients are all positive, it follows that when age > 50 and chol > 200, a positive contribution is given to x by these two variables, which makes e-x smaller, and P closer to 1.

Now my question is, how did they find the thresholds (50 and 200) for age and chol?
If I had data on age, cholesterol, etc, vs presence/absence of the disease, and ran a logistic regression, I think I would get something like this:
x = a0 + aage age + achol chol + ...

I.e. I would only know that age and chol increase P, but not 'when' someone should start to worry about their age and cholesterol.

Am I completely off the mark here, or is there a technique to calculate these thresholds from the data?

Thanks!
L
 
Physics news on Phys.org
lavoisier said:
c0 + cage (age - 50) + cchol (chol - 200) + ...

I expect those parameters, ##\text{age}_0=50## & ##\text{chol}_0=200##, were picked from some other knowledge (or arbitrarily). They don't come out of the regression. If they did, you would have a regression model like

$$\begin{align}
x & = c_0 & + & c_{age} (\text{age} - \text{age}_0) & + & c_{chol} (\text{chol} - \text{chol}_0) & + \ldots \\
& = a_0 & + & c_{age} \text{age} & + & c_{chol} \text{chol} & + \ldots
\end{align}$$

where those parameters can all be wrapped up into a single parameter, ##a_0 = c_0 - c_{age} \text{age}_0 - c_{chol} \text{chol}_0##. The regression can tell you what ##a_0## should be, but not how that breaks down into ##c_0, \text{age}_0, \text{chol}_0##.
 
Yes you are right. Those numbers do not come from the regression. You can use any numbers you like there. It just changes the meaning and value of the constant term.
 
  • Like
Likes   Reactions: EnumaElish
Thank you both for your replies.

On the subject matter: ouch! I feared this would be the case.

Maybe I could estimate non-arbitrary thresholds by calculating what value of each variable corresponds to P=50%, when all other variables are set to their own population means.

E.g. if we have only age and cholesterol as explanatory variables, to calculate the age threshold:
P = 1/2 when x = 0
a0 + aage agethr + achol chol mean = 0
agethr = (- a0 - achol chol mean) / aage

I guess this would be like asking: what value of the age threshold makes an individual that is 'average' in all other respects reach 50% probability of disease?

Or maybe I should first calculate the probability of disease when all variables are set to their means, and use that as the probability threshold rather than a generic 50%.

In a related problem, where the probability was reduced to a binary outcome, the approach to find a threshold was based on comparing the distributions of the explanatory variables separated between the two outcome classes (in this case, diseased or not diseased).
If each explanatory variable is normally distributed in the two classes, apparently one can use standard formulae to compute in each case the value of the variable for which the probability of belonging to one class is the same as the probability of belonging to the other class; and that would be the threshold.
Not sure if this would be applicable here. Worth a try maybe.
 
lavoisier said:
Maybe I could estimate non-arbitrary thresholds by calculating what value of each variable corresponds to P=50%, when all other variables are set to their own population means.
But those numbers are not thresholds of any kind.

All they mean is that the intercept term, ##c_0## in your original equation, is the value expected for someone with age=50 and chol=200. There is no cutoff or threshold involved.

For your other equation you instead have the intercept term, ##a_0##, is the value expected for someone with age=0 and chol=0. But again there is no threshold, it is just telling you how to interpret the intercept term.
 
lavoisier said:
If I had data on age, cholesterol, etc, vs presence/absence of the disease, and ran a logistic regression, I think I would get something like this:
x = a0 + aage age + achol chol + ...

I.e. I would only know that age and chol increase P, but not 'when' someone should start to worry about their age and cholesterol.

We should first be clear on whether logistic regression can only produce models where an increase in value of a dependent variable either always increases the predicted probability or always decreases it. From glancing at the web, this appears to be true, but I myself have never used logistic regression.

There are situations where the probability of being healthy depends on keeping some variables near a "healthy" mean value. This must be an often encountered situation in the biological sciences. Surely people employ methods that handle non-monotonic effects.

Are you sure the paper you recall didn't define the dependent variable ##y## as the probability of being disease free and use an independent variable like ##x = c_0 + c_{age}( age - age_0)^2 + c_{chol} (chol - chol_0)^2 ## ?
 
OK, I see, thanks.
I will not try to read into this more than the numbers tell me.
Yes, I am pretty sure x was a linear function of the explanatory variables.
 
Hey lavoisier.

Is there anything that might suggest that the numbers are derived from some principle or some experimental data?

I think you might find that the numbers are representative of something - like an average/median figure that is used to calibrate the model.

It may be that the age of 50 and cholesterol of 200 are benchmarks for the experiment or science being discussed and there is probably a good reason for it.
 
Hi chiro,
I don't know, maybe. See message #4 where I mentioned the population mean, which is sort of close to what you're saying, at least conceptually.
In any case, those numbers should make sense, or 'mean' something, otherwise why put them there?
Because from the above discussion it's clear that one is in principle free to rewrite the linear exponent in any way that is consistent with the total.
And I believe there are infinite ways of doing that.
 
  • #10
I think it has more to do with either a "cut-off" value if it isn't a mean or median.

I don't want to go further because that would be too much speculation on my part.
 
  • #11
Again, those values are completely arbitrary and only define the interpretation of the intercept term. They are not derived from the data not do they represent any sort of threshold or cutoff.

If an important value is known from prior data (population average, decision point, etc), then it is certainly possible to arbitrarily choose to use that important value. Then the intercept would be interpreted as referring to that value.

But again, that is not from the logistic regression and does not itself imply anything special like a cutoff.
 
  • #12
lavoisier said:
Something like: c0 + cage (age - 50) + cchol (chol - 200) + ...

This morning's web browsing says that there are various regression models, like the Tobit, that estimate thresholds from data. So if the authors used such a model they might present their findings with the above formula accompanied by instructions like "if the person's age is less than 50, set age = 50" etc.
 
  • #13
If the "thresholds" were calculated using any model, it is very unlikely that the results would be such exact numbers like age_threshold=50, chol_threshold=200. Those "threshold" numbers seem to be human generated convenient numbers.

It seems to be an unresolved question as to how the thresholds are really used in the regression:
Option 1: x' = x-x_threshold (No real model change. Just using intermediate variables in the regression.)
Option 2: x' = max( x, x_threshold) (A significant model change, not adequately indicated by the regression equation in the original post.)
 

Similar threads

Replies
3
Views
3K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 13 ·
Replies
13
Views
5K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 21 ·
Replies
21
Views
3K
  • · Replies 19 ·
Replies
19
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 13 ·
Replies
13
Views
2K