Logistic Regression: Estimating Probability of Survival

In summary, the conversation discusses a logistic regression model with one predictor and one response variable. The estimated coefficients for the intercept and the coefficient of the predictor are ##\hat{\beta_0}## and ##\hat{\beta_1}##, respectively. The model's estimates show that ##\hat{p}(X=1)+\hat{p}(X=0)\neq 1##, causing confusion as to why the survival rate for males and females does not sum to 1. The conversation also delves into the calculation of p-values for logistic regression models and the use of different libraries in Python for this purpose. The use of p-values to judge the accuracy of a model
  • #1
EngWiPy
1,368
61
I have a simple dataset that consists of one predictor Sex and one response variable Survived. Let's say the estimated coefficients are ##\hat{\beta_0}## and ##\hat{\beta_1}## for the intercept and the coefficient of Sex predictor, respectively. Mathematically this means that:

[tex]\hat{p}(X) = \frac{1}{1+e^{-(\hat{\beta_0}+\hat{\beta_1}X)}}[/tex]

where X is 1 for male, and 0 for female, and where ##\hat{p}## is the estimated probability. This model's estimates give ##\hat{p}(X=1)+\hat{p}(X=0)\neq 1##. Why? Is it because these are estimates, or because there is/are other reason(s) like Survived depends on other predictors?
 
Physics news on Phys.org
  • #2
S_David said:
This model's estimates give ^p(X=1)+^p(X=0)≠1p^(X=1)+p^(X=0)≠1\hat{p}(X=1)+\hat{p}(X=0)\neq 1. Why?
Why would the survival rate for males and females sum to 1? In a good transportation method all passengers survival would be very close to 1. Summing to 1 would mean that the only way for 90% of the female passengers to survive would be for 90% of the male passengers to die. That would be bad. Even worse would be that in order to improve the male survival you would have to kill more females!
 
  • Like
Likes jim mcnamara and EngWiPy
  • #3
I got this wrong. For females, the probability of surviving is

[tex]p(X = 0) = Pr[S = 1/X = 0] = 1-Pr[S = 0/X=0][/tex]

Similarily, the probability of surviving for males is

[tex]p(X = 1) = Pr[S = 1/X = 1] = 1-Pr[S = 0/X=1][/tex]
 
Last edited:
  • #5
I found the estimates using Python. However, Python doesn't automatically compute the p-value for ##\hat{\beta_1}## to decide that this relationship is not due to pure chance. To compute the p-value I need to find (the steps were summarized from this book):

[tex]t = \frac{\hat{\beta_1}}{SE\{\hat{\beta_1}\}}[/tex]

where

[tex]SE\{\hat{\beta_1}\} = \frac{\sigma^2}{\sum_i(x_i-\mu_X)^2}[/tex]

is the standard error of ##\hat{\beta_1}##, and ##\sigma^2## is the variance of the noise in the model ##y=f(X)+\epsilon##, which we don't have. So, we estimate it from the residual standard error as

[tex]\hat{\sigma} = \sqrt{\frac{1}{n-2}\sum_i(y_i-\hat{y_i})^2}[/tex]

Now we calculate t, and then find

[tex]Pr[T>|t|] = 1-Pr[T\leq |t|] = 1-F_{t,n-2}(|t|)[/tex]

where ##F_{t,n-2}(x)## is the CDF of t-distribution with n-2 degrees of freedom. Is this correct?

If so, I have a couple of questions:

1- why ##SE(\hat{\beta_1}) = \frac{\sigma^2}{\sum_i(x_i-\mu_X)^2} ##?
2- why ##\hat{\sigma} = \sqrt{\frac{1}{n-2}\sum_i(y_i-\hat{y_i})^2}##?
 
  • #6
@S_David - Please, when you say 'I got this from a book or journal', be sure to tell us book title, author, and page or section. If what you say happens to be off the mark, we cannot tell what in the world is going on. You gave us part of that, good start. Thanks.
 
  • #7
jim mcnamara said:
@S_David - Please, when you say 'I got this from a book or journal', be sure to tell us book title, author, and page or section. If what you say happens to be off the mark, we cannot tell what in the world is going on. You gave us part of that, good start. Thanks.

You are right, my fault. This book is available online via the link I provided, and the section I am referring to is 3.1.2 starting from page 63. I summarized the subsection in the steps above. Also, I have to mention these steps are for simple linear regression, but I believe they are the same for logistic regression with one predictor.
 
  • #8
Ok. Are you considering just one outcome variable and two states of that variable ( the variable having only two states to start with (categorical))? I think so.
In linear regression models the dependent variable y is supposed to be continuous, whereas in logistic regression it is categorical, i.e., discrete. So I am not sure about what you are doing. Or quite how you got there. ... maybe I did not have enough coffee this morning.
 
  • #9
That is correct, so, what changes in this case? In the book in section 4.3.2 it says the following when addressing the accuracy of the logistic regression model:

Many aspects of the logistic regression output shown in Table 4.1 are similar to the linear regression output of Chapter 3. For example, we can measure the accuracy of the coefficient estimates by computing their standard errors. The z-statistic in Table 4.1 plays the same role as the t-statistic in the linear regression output, for example in Table 3.1 on page 68. For instance, the z-statistic associated with β1 is equal to βˆ1/SE(βˆ1), and so a large (absolute) value of the z-statistic indicates evidence against the null hypothesis H0 : β1 = 0. This null hypothesis implies that ...

From this paragraph I understand that instead of using the t-statistic in linear regression we use z-statistic for logistic regression (I mentioned t-statistic in my post, which must be replaced by z-statistic). Am I getting this wrong?

Let me ask the question in another way: how can I find the p-value of the estimated coefficients in logistic regression with one predictor?
 
  • #10
S_David said:
how can I find the p-value of the estimated coefficients in logistic regression with one predictor?
I just use R. I have never calculated this manually.
 
  • #11
I think it should be incorporate in Python as well for the different methods because these are important outputs to judge the accuracy of models. I'm learning R, but Python seems to be more common for data analysis for some reasons among data scientists.
 
  • #12
It turned out that there is another library in Python other than the one I use that calculates the p-value for logistic regression. I use sklearn, but the other library is statsmodels.
 
  • Like
Likes Dale
  • #13
S_David said:
I'm learning R, but Python seems to be more common for data analysis for some reasons among data scientists.
Overall I think that Python is a much better language. The only advantage that R has is that it has so many libraries. It is basically the de-facto standard for statistics. But the language itself is awkward.
 
  • Like
Likes EngWiPy
  • #14
For a more complex model (with > 5 predictors), suppose I got the estimates of the coefficients of the logistic regressor, the z-statistic, and the p-value of these coefficients, now the question is what to do with them? For example, if a coefficient of a predictor is small (< 0.05 in absolute value and with and without p < 0.001), can I just drop it from the model (since ##e^{x+y}\simeq e^x## if y is small)? What about coefficients whose p-value is > 0.001, what to do with them?
 
  • #15
S_David said:
For example, if a coefficient of a predictor is small (< 0.05 in absolute value
Yes, but you also have to check the range of the predictor to see if the product of the predictor and the coefficient is small.
 
  • Like
Likes EngWiPy
  • #16
Dale said:
Yes, but you also have to check the range of the predictor to see if the product of the predictor and the coefficient is small.

You are right. I need to make sure the product is small, not just the coefficient.

What if a coefficient is not very small but it is not large neither (for example, 0.5), but its p-value > 0.001, and its 95% confidence interval includes 0, what to do with it? Can I drop it from the model, too, since the null hypothesis cannot be rejected for the associated predictor?
 
  • #17
S_David said:
What if a coefficient is not very small but it is not large neither (for example, 0.5), but its p-value > 0.001, what to do with it?
That is a judgement call. Do you have a theoretical reason to think that it is a real effect? Or is there expert opinion that could guide you? Does the purpose that you intend to use it for require the additional precision?

This is where scientists have to make tough choices that can change the analysis and can open a scientist up to criticism either way.
 
  • #18
I am trying to enhance the performance of the model by eliminating the least significant predictors. For example, there is little logical reason, in my own opinion, to think that the port of embarkation in the Titanic dataset has a significant effect on the survival chances. This intuition combined with the statistics I had, make me think that probably it is not a significant predictor. Is this a logical judgment?
 
  • #19
S_David said:
For example, there is little logical reason, in my own opinion, to think that the port of embarkation in the Titanic dataset has a significant effect on the survival chances. This intuition combined with the statistics I had, make me think that probably it is not a significant predictor. Is this a logical judgment?
Yes, sounds reasonable. However, if you are writing a paper on this analysis, then you would want to justify that decision well.

For example, a critic might suggest that different ports could have different populations with different values of altruism and selfishness. You could preemptively state that such a connection is not described in the literature and so was not accounted for in your study.

The key is not to just make your point, but also honestly consider why the opposite could be considered.
 
  • #20
Dale said:
Yes, sounds reasonable. However, if you are writing a paper on this analysis, then you would want to justify that decision well.

For example, a critic might suggest that different ports could have different populations with different values of altruism and selfishness. You could preemptively state that such a connection is not described in the literature and so was not accounted for in your study.

The key is not to just make your point, but also honestly consider why the opposite could be considered.

Yes, but in the dataset there is nothing to support these counter claims, so, I think in the absence of any further information it is reasonable to assume that there is no significant difference between populations of different ports. Eventually, probably these populations were grouped into classes (class 3 passengers from all ports were grouped together, and so for class 1 and 2) and within each class there were males and females, which (the gender and class) I think played a more significant role by reason, and the coefficients and statistics of the model support this.
 

1. What is logistic regression?

Logistic regression is a statistical method used to model the relationship between a binary dependent variable (such as survival or death) and one or more independent variables (such as age, gender, and medical condition).

2. How does logistic regression estimate probability of survival?

Logistic regression uses a mathematical function called the logistic function to convert the linear combination of independent variables into a probability value between 0 and 1. This probability represents the likelihood of an individual surviving based on their specific characteristics.

3. What are the assumptions of logistic regression?

The main assumptions of logistic regression include: a linear relationship between the independent variables and the log odds of the dependent variable, absence of multicollinearity among the independent variables, and independence of observations. Additionally, logistic regression assumes a large enough sample size for accurate estimation and that the dependent variable is binary and measured without error.

4. How is the accuracy of logistic regression evaluated?

The accuracy of logistic regression can be evaluated using various metrics such as classification accuracy, sensitivity, specificity, and Area Under the Curve (AUC). These metrics measure the performance of the model in predicting the correct outcome (survival or death) and can help determine the overall effectiveness of the model.

5. What are some common applications of logistic regression in survival analysis?

Logistic regression has a wide range of applications in survival analysis, including medical research, epidemiology, and social sciences. It can be used to predict the probability of survival for patients with a specific disease, identify risk factors for mortality, and evaluate the effectiveness of interventions or treatments in improving survival rates.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
836
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
483
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
961
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
745
  • Set Theory, Logic, Probability, Statistics
Replies
21
Views
2K
Back
Top