Can I get thresholds from logistic regression coefficients?

In summary, the conversation discussed an example of applying logistic regression to medicine/epidemiology, where the probability of a person having a heart attack was related to variables such as age and cholesterol level. The equation included different thresholds for each variable, but it was mentioned that these thresholds were not determined by the regression but rather picked from previous knowledge or arbitrarily. The conversation then explored the idea of estimating non-arbitrary thresholds by calculating the value of each variable that corresponds to a certain probability, but it was noted that these values are not actual thresholds and are just used to interpret the intercept term in the equation. Finally, there was a discussion about whether logistic regression can handle non-monotonic effects and if the numbers used in the equation were derived
  • #1
lavoisier
177
24
Hello,
I remember an example of application of the logistic regression to medicine / epidemiology, which said (more or less) that the probability of a person having miocardial infarction was related to some variables such as age, cholesterol level, etc, and the equation included the various 'thresholds' for each of these variables.
Something like: c0 + cage (age - 50) + cchol (chol - 200) + ...
This was the x in the logistic formula P=1/(1+e-x).
If the coefficients are all positive, it follows that when age > 50 and chol > 200, a positive contribution is given to x by these two variables, which makes e-x smaller, and P closer to 1.

Now my question is, how did they find the thresholds (50 and 200) for age and chol?
If I had data on age, cholesterol, etc, vs presence/absence of the disease, and ran a logistic regression, I think I would get something like this:
x = a0 + aage age + achol chol + ...

I.e. I would only know that age and chol increase P, but not 'when' someone should start to worry about their age and cholesterol.

Am I completely off the mark here, or is there a technique to calculate these thresholds from the data?

Thanks!
L
 
Physics news on Phys.org
  • #2
lavoisier said:
c0 + cage (age - 50) + cchol (chol - 200) + ...

I expect those parameters, ##\text{age}_0=50## & ##\text{chol}_0=200##, were picked from some other knowledge (or arbitrarily). They don't come out of the regression. If they did, you would have a regression model like

$$\begin{align}
x & = c_0 & + & c_{age} (\text{age} - \text{age}_0) & + & c_{chol} (\text{chol} - \text{chol}_0) & + \ldots \\
& = a_0 & + & c_{age} \text{age} & + & c_{chol} \text{chol} & + \ldots
\end{align}$$

where those parameters can all be wrapped up into a single parameter, ##a_0 = c_0 - c_{age} \text{age}_0 - c_{chol} \text{chol}_0##. The regression can tell you what ##a_0## should be, but not how that breaks down into ##c_0, \text{age}_0, \text{chol}_0##.
 
  • #3
Yes you are right. Those numbers do not come from the regression. You can use any numbers you like there. It just changes the meaning and value of the constant term.
 
  • Like
Likes EnumaElish
  • #4
Thank you both for your replies.

On the subject matter: ouch! I feared this would be the case.

Maybe I could estimate non-arbitrary thresholds by calculating what value of each variable corresponds to P=50%, when all other variables are set to their own population means.

E.g. if we have only age and cholesterol as explanatory variables, to calculate the age threshold:
P = 1/2 when x = 0
a0 + aage agethr + achol chol mean = 0
agethr = (- a0 - achol chol mean) / aage

I guess this would be like asking: what value of the age threshold makes an individual that is 'average' in all other respects reach 50% probability of disease?

Or maybe I should first calculate the probability of disease when all variables are set to their means, and use that as the probability threshold rather than a generic 50%.

In a related problem, where the probability was reduced to a binary outcome, the approach to find a threshold was based on comparing the distributions of the explanatory variables separated between the two outcome classes (in this case, diseased or not diseased).
If each explanatory variable is normally distributed in the two classes, apparently one can use standard formulae to compute in each case the value of the variable for which the probability of belonging to one class is the same as the probability of belonging to the other class; and that would be the threshold.
Not sure if this would be applicable here. Worth a try maybe.
 
  • #5
lavoisier said:
Maybe I could estimate non-arbitrary thresholds by calculating what value of each variable corresponds to P=50%, when all other variables are set to their own population means.
But those numbers are not thresholds of any kind.

All they mean is that the intercept term, ##c_0## in your original equation, is the value expected for someone with age=50 and chol=200. There is no cutoff or threshold involved.

For your other equation you instead have the intercept term, ##a_0##, is the value expected for someone with age=0 and chol=0. But again there is no threshold, it is just telling you how to interpret the intercept term.
 
  • #6
lavoisier said:
If I had data on age, cholesterol, etc, vs presence/absence of the disease, and ran a logistic regression, I think I would get something like this:
x = a0 + aage age + achol chol + ...

I.e. I would only know that age and chol increase P, but not 'when' someone should start to worry about their age and cholesterol.

We should first be clear on whether logistic regression can only produce models where an increase in value of a dependent variable either always increases the predicted probability or always decreases it. From glancing at the web, this appears to be true, but I myself have never used logistic regression.

There are situations where the probability of being healthy depends on keeping some variables near a "healthy" mean value. This must be an often encountered situation in the biological sciences. Surely people employ methods that handle non-monotonic effects.

Are you sure the paper you recall didn't define the dependent variable ##y## as the probability of being disease free and use an independent variable like ##x = c_0 + c_{age}( age - age_0)^2 + c_{chol} (chol - chol_0)^2 ## ?
 
  • #7
OK, I see, thanks.
I will not try to read into this more than the numbers tell me.
Yes, I am pretty sure x was a linear function of the explanatory variables.
 
  • #8
Hey lavoisier.

Is there anything that might suggest that the numbers are derived from some principle or some experimental data?

I think you might find that the numbers are representative of something - like an average/median figure that is used to calibrate the model.

It may be that the age of 50 and cholesterol of 200 are benchmarks for the experiment or science being discussed and there is probably a good reason for it.
 
  • #9
Hi chiro,
I don't know, maybe. See message #4 where I mentioned the population mean, which is sort of close to what you're saying, at least conceptually.
In any case, those numbers should make sense, or 'mean' something, otherwise why put them there?
Because from the above discussion it's clear that one is in principle free to rewrite the linear exponent in any way that is consistent with the total.
And I believe there are infinite ways of doing that.
 
  • #10
I think it has more to do with either a "cut-off" value if it isn't a mean or median.

I don't want to go further because that would be too much speculation on my part.
 
  • #11
Again, those values are completely arbitrary and only define the interpretation of the intercept term. They are not derived from the data not do they represent any sort of threshold or cutoff.

If an important value is known from prior data (population average, decision point, etc), then it is certainly possible to arbitrarily choose to use that important value. Then the intercept would be interpreted as referring to that value.

But again, that is not from the logistic regression and does not itself imply anything special like a cutoff.
 
  • #12
lavoisier said:
Something like: c0 + cage (age - 50) + cchol (chol - 200) + ...

This morning's web browsing says that there are various regression models, like the Tobit, that estimate thresholds from data. So if the authors used such a model they might present their findings with the above formula accompanied by instructions like "if the person's age is less than 50, set age = 50" etc.
 
  • #13
If the "thresholds" were calculated using any model, it is very unlikely that the results would be such exact numbers like age_threshold=50, chol_threshold=200. Those "threshold" numbers seem to be human generated convenient numbers.

It seems to be an unresolved question as to how the thresholds are really used in the regression:
Option 1: x' = x-x_threshold (No real model change. Just using intermediate variables in the regression.)
Option 2: x' = max( x, x_threshold) (A significant model change, not adequately indicated by the regression equation in the original post.)
 

1. What are thresholds in logistic regression?

Thresholds in logistic regression refer to the cut-off points used to classify data into different categories. They are typically based on the predicted probabilities from the logistic regression model and are used to determine if an observation belongs to one category or another.

2. Can I get thresholds from logistic regression coefficients?

Yes, you can get thresholds from logistic regression coefficients. The threshold can be calculated using the logistic function, which takes the coefficients, the intercept, and the input variables as inputs. The resulting value represents the probability of an observation belonging to a certain category.

3. How do I interpret thresholds in logistic regression?

Thresholds in logistic regression can be interpreted as the probability at which an observation is classified into one category or another. For example, if the threshold is 0.5, it means that an observation with a predicted probability of 0.5 or higher will be classified into one category, while an observation with a predicted probability lower than 0.5 will be classified into the other category.

4. Can I change the thresholds in logistic regression?

Yes, you can change the thresholds in logistic regression. However, it is important to note that changing the thresholds can significantly affect the performance of the model. It is recommended to use a threshold that maximizes the overall accuracy of the model.

5. What factors affect the thresholds in logistic regression?

The main factor that affects the thresholds in logistic regression is the choice of the cut-off point. Different cut-off points can lead to different thresholds, and ultimately, different classifications of data. Other factors that can affect thresholds include the number of input variables, the distribution of the data, and the sample size.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
658
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
21
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
19
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
5K
  • General Math
Replies
1
Views
1K
  • Precalculus Mathematics Homework Help
Replies
4
Views
2K
Back
Top