# Logistic Regression

• I
I have a simple dataset that consists of one predictor Sex and one response variable Survived. Let's say the estimated coefficients are ##\hat{\beta_0}## and ##\hat{\beta_1}## for the intercept and the coefficient of Sex predictor, respectively. Mathematically this means that:

$$\hat{p}(X) = \frac{1}{1+e^{-(\hat{\beta_0}+\hat{\beta_1}X)}}$$

where X is 1 for male, and 0 for female, and where ##\hat{p}## is the estimated probability. This model's estimates give ##\hat{p}(X=1)+\hat{p}(X=0)\neq 1##. Why? Is it because these are estimates, or because there is/are other reason(s) like Survived depends on other predictors?

Dale
Mentor
2020 Award
This model's estimates give ^p(X=1)+^p(X=0)≠1p^(X=1)+p^(X=0)≠1\hat{p}(X=1)+\hat{p}(X=0)\neq 1. Why?
Why would the survival rate for males and females sum to 1? In a good transportation method all passengers survival would be very close to 1. Summing to 1 would mean that the only way for 90% of the female passengers to survive would be for 90% of the male passengers to die. That would be bad. Even worse would be that in order to improve the male survival you would have to kill more females!

jim mcnamara and EngWiPy
I got this wrong. For females, the probability of surviving is

$$p(X = 0) = Pr[S = 1/X = 0] = 1-Pr[S = 0/X=0]$$

Similarily, the probability of surviving for males is

$$p(X = 1) = Pr[S = 1/X = 1] = 1-Pr[S = 0/X=1]$$

Last edited:
Dale
Mentor
2020 Award
Yes that is right

EngWiPy
I found the estimates using Python. However, Python doesn't automatically compute the p-value for ##\hat{\beta_1}## to decide that this relationship is not due to pure chance. To compute the p-value I need to find (the steps were summarized from this book):

$$t = \frac{\hat{\beta_1}}{SE\{\hat{\beta_1}\}}$$

where

$$SE\{\hat{\beta_1}\} = \frac{\sigma^2}{\sum_i(x_i-\mu_X)^2}$$

is the standard error of ##\hat{\beta_1}##, and ##\sigma^2## is the variance of the noise in the model ##y=f(X)+\epsilon##, which we don't have. So, we estimate it from the residual standard error as

$$\hat{\sigma} = \sqrt{\frac{1}{n-2}\sum_i(y_i-\hat{y_i})^2}$$

Now we calculate t, and then find

$$Pr[T>|t|] = 1-Pr[T\leq |t|] = 1-F_{t,n-2}(|t|)$$

where ##F_{t,n-2}(x)## is the CDF of t-distribution with n-2 degrees of freedom. Is this correct?

If so, I have a couple of questions:

1- why ##SE(\hat{\beta_1}) = \frac{\sigma^2}{\sum_i(x_i-\mu_X)^2} ##?
2- why ##\hat{\sigma} = \sqrt{\frac{1}{n-2}\sum_i(y_i-\hat{y_i})^2}##?

jim mcnamara
Mentor
@S_David - Please, when you say 'I got this from a book or journal', be sure to tell us book title, author, and page or section. If what you say happens to be off the mark, we cannot tell what in the world is going on. You gave us part of that, good start. Thanks.

@S_David - Please, when you say 'I got this from a book or journal', be sure to tell us book title, author, and page or section. If what you say happens to be off the mark, we cannot tell what in the world is going on. You gave us part of that, good start. Thanks.

You are right, my fault. This book is available online via the link I provided, and the section I am referring to is 3.1.2 starting from page 63. I summarized the subsection in the steps above. Also, I have to mention these steps are for simple linear regression, but I believe they are the same for logistic regression with one predictor.

jim mcnamara
Mentor
Ok. Are you considering just one outcome variable and two states of that variable ( the variable having only two states to start with (categorical))? I think so.
In linear regression models the dependent variable y is supposed to be continuous, whereas in logistic regression it is categorical, i.e., discrete. So I am not sure about what you are doing. Or quite how you got there. ... maybe I did not have enough coffee this morning.

That is correct, so, what changes in this case? In the book in section 4.3.2 it says the following when addressing the accuracy of the logistic regression model:

Many aspects of the logistic regression output shown in Table 4.1 are similar to the linear regression output of Chapter 3. For example, we can measure the accuracy of the coefficient estimates by computing their standard errors. The z-statistic in Table 4.1 plays the same role as the t-statistic in the linear regression output, for example in Table 3.1 on page 68. For instance, the z-statistic associated with β1 is equal to βˆ1/SE(βˆ1), and so a large (absolute) value of the z-statistic indicates evidence against the null hypothesis H0 : β1 = 0. This null hypothesis implies that ...

From this paragraph I understand that instead of using the t-statistic in linear regression we use z-statistic for logistic regression (I mentioned t-statistic in my post, which must be replaced by z-statistic). Am I getting this wrong?

Let me ask the question in another way: how can I find the p-value of the estimated coefficients in logistic regression with one predictor?

Dale
Mentor
2020 Award
how can I find the p-value of the estimated coefficients in logistic regression with one predictor?
I just use R. I have never calculated this manually.

I think it should be incorporate in Python as well for the different methods because these are important outputs to judge the accuracy of models. I'm learning R, but Python seems to be more common for data analysis for some reasons among data scientists.

It turned out that there is another library in Python other than the one I use that calculates the p-value for logistic regression. I use sklearn, but the other library is statsmodels.

Dale
Dale
Mentor
2020 Award
I'm learning R, but Python seems to be more common for data analysis for some reasons among data scientists.
Overall I think that Python is a much better language. The only advantage that R has is that it has so many libraries. It is basically the de-facto standard for statistics. But the language itself is awkward.

EngWiPy
For a more complex model (with > 5 predictors), suppose I got the estimates of the coefficients of the logistic regressor, the z-statistic, and the p-value of these coefficients, now the question is what to do with them? For example, if a coefficient of a predictor is small (< 0.05 in absolute value and with and without p < 0.001), can I just drop it from the model (since ##e^{x+y}\simeq e^x## if y is small)? What about coefficients whose p-value is > 0.001, what to do with them?

Dale
Mentor
2020 Award
For example, if a coefficient of a predictor is small (< 0.05 in absolute value
Yes, but you also have to check the range of the predictor to see if the product of the predictor and the coefficient is small.

EngWiPy
Yes, but you also have to check the range of the predictor to see if the product of the predictor and the coefficient is small.

You are right. I need to make sure the product is small, not just the coefficient.

What if a coefficient is not very small but it is not large neither (for example, 0.5), but its p-value > 0.001, and its 95% confidence interval includes 0, what to do with it? Can I drop it from the model, too, since the null hypothesis cannot be rejected for the associated predictor?

Dale
Mentor
2020 Award
What if a coefficient is not very small but it is not large neither (for example, 0.5), but its p-value > 0.001, what to do with it?
That is a judgement call. Do you have a theoretical reason to think that it is a real effect? Or is there expert opinion that could guide you? Does the purpose that you intend to use it for require the additional precision?

This is where scientists have to make tough choices that can change the analysis and can open a scientist up to criticism either way.

I am trying to enhance the performance of the model by eliminating the least significant predictors. For example, there is little logical reason, in my own opinion, to think that the port of embarkation in the Titanic dataset has a significant effect on the survival chances. This intuition combined with the statistics I had, make me think that probably it is not a significant predictor. Is this a logical judgment?

Dale
Mentor
2020 Award
For example, there is little logical reason, in my own opinion, to think that the port of embarkation in the Titanic dataset has a significant effect on the survival chances. This intuition combined with the statistics I had, make me think that probably it is not a significant predictor. Is this a logical judgment?
Yes, sounds reasonable. However, if you are writing a paper on this analysis, then you would want to justify that decision well.

For example, a critic might suggest that different ports could have different populations with different values of altruism and selfishness. You could preemptively state that such a connection is not described in the literature and so was not accounted for in your study.

The key is not to just make your point, but also honestly consider why the opposite could be considered.

Yes, sounds reasonable. However, if you are writing a paper on this analysis, then you would want to justify that decision well.

For example, a critic might suggest that different ports could have different populations with different values of altruism and selfishness. You could preemptively state that such a connection is not described in the literature and so was not accounted for in your study.

The key is not to just make your point, but also honestly consider why the opposite could be considered.

Yes, but in the dataset there is nothing to support these counter claims, so, I think in the absence of any further information it is reasonable to assume that there is no significant difference between populations of different ports. Eventually, probably these populations were grouped into classes (class 3 passengers from all ports were grouped together, and so for class 1 and 2) and within each class there were males and females, which (the gender and class) I think played a more significant role by reason, and the coefficients and statistics of the model support this.