Chapter 11 Logistic Regression Model

library(tidyverse)
library(mlbench) # for some datasets

Logistic regression is one of the most common methods for binary classification.

Suppose we observe \({(x_i, y_i): i = 1, \ldots, n}\), where \(y_i\) is a binary variable and \(x_i\) is a vector of covariates (including the intercept \(1\)). In logistic regression, we assume that \(Y_i \sim \text{Bernoulli}(p_i)\) with \(p_i = \frac{e^{x_i^T \beta}}{1+e^{x_i^T \beta}}\). That is, \[\begin{equation*} P(Y_i = 1 \mid x_i, \beta) = \frac{e^{x_i^T \beta}}{1 + e^{x_i^T \beta}} = \frac{e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}}} {1 + e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}}}. \end{equation*}\]

Since \(Y_i\) takes only two values, \[\begin{equation*} P(Y_i = 0 \mid x_i, \beta) = \frac{1}{1 + e^{x_i^T \beta}}. \end{equation*}\]

We can write both cases \(y_i = 0, 1\) in a single formula: \[\begin{equation*} P(Y_i = y_i \mid x_i, \beta) = \frac{e^{(x_i^T \beta) y_i}}{1 + e^{x_i^T \beta}}. \end{equation*}\]

The likelihood function (conditional on \(x_1, \ldots, x_n\)) is \[\begin{equation*} L(\beta \mid y_1, \ldots, y_n, x_1, \ldots, x_n) = \prod_{i=1}^n P(Y_i = y_i \mid x_i, \beta) = \prod_{i=1}^n \frac{e^{(x_i^T \beta) y_i}}{1 + e^{x_i^T \beta}}. \end{equation*}\]

The MLE of \(\beta\) is obtained by maximizing \(L(\beta \mid y, x)\) with respect to \(\beta\). In practice, we maximize the natural logarithm of the likelihood, which is easier to work with. The log-likelihood function is \[\begin{equation*} \sum_{i=1}^n \left( (x_i^T \beta) y_i - \log(1 + e^{x_i^T \beta}) \right), \end{equation*}\] which is maximized numerically.

Odds

In statistics, the odds of an event are defined as the ratio of the probability that the event occurs to the probability that it does not occur.

For example, if the event of interest is getting a \(6\) when rolling a die, then \[\begin{equation*} \text{odds} = \frac{1/6}{5/6} = \frac{1}{5}. \end{equation*}\]

In the logistic regression model, the odds of \(Y = 1\) given \(x\) are \[\begin{equation*} \frac{P(Y = 1 \mid x)}{P(Y = 0 \mid x)} = e^{\beta^T x}. \end{equation*}\]

Interpretation of \(\beta_j\): Increasing \(x_j\) by one unit, while keeping all other predictors fixed, multiplies the odds by \(e^{\beta_j}\).

Log odds

The log odds is the logarithm of the odds.

In the logistic regression model, the log odds of \(Y = 1\) given \(x\) are \[\begin{equation*} \log \frac{P(Y = 1 \mid x)}{P(Y = 0 \mid x)} = \beta^T x. \end{equation*}\]

Interpretation of \(\beta_j\): Increasing \(x_j\) by one unit, while keeping all other predictors fixed, increases the log odds by \(\beta_j\).

logit function

The logit function, \(\text{logit} : (0,1) \to \mathbb{R}\), is defined as \[\begin{equation*} \text{logit}(p) = \log \frac{p}{1 - p}. \end{equation*}\]

logistic function

The standard logistic function is defined as \[\begin{equation*} f(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}. \end{equation*}\]

It is the inverse of the logit function.

Example

The package mlbench contains some artificial and real-world machine learning benchmark problems (datasets). See https://cran.r-project.org/web/packages/mlbench/mlbench.pdf

We will use the dataset PimaIndiansDiabetes2 from mlbench.

Details:

  • pregnant: Number of times pregnant
  • glucose: Plasma glucose concentration (glucose tolerance test)
  • pressure: Diastolic blood pressure (mm Hg)
  • triceps: Triceps skin fold thickness (mm)
  • insulin: 2-Hour serum insulin (mu U/ml)
  • mass: Body mass index (weight in kg/(height in m)^2)
  • pedigree: Diabetes pedigree function
  • age: Age (years)
  • diabetes: Class variable (test for diabetes)
data(PimaIndiansDiabetes2)

PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)

# Split into training and test sets
set.seed(1)
random_index <- sample(nrow(PimaIndiansDiabetes2),
                       size = floor(0.7 * nrow(PimaIndiansDiabetes2)))

train_data <- PimaIndiansDiabetes2[random_index, ]
test_data  <- PimaIndiansDiabetes2[-random_index, ]

# Sample proportions in test set
table(test_data$diabetes) / nrow(test_data)
## 
##       neg       pos 
## 0.6694915 0.3305085

Simple model:

  • Fit a logistic regression model using glm() with family = binomial.

  • predict(fit_simple, test_data, type = "response"): give us the estimated probability

  • ifelse(prob > 0.5, "pos", "neg"): classify the case as pos when the probability is greater than \(0.5\). The threshold 0.5 is conventional but not necessarily optimal.

# Estimation
fit_simple <- glm(diabetes ~ glucose, 
                  data = train_data, 
                  family = binomial)

summary(fit_simple)
## 
## Call:
## glm(formula = diabetes ~ glucose, family = binomial, data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2089  -0.7182  -0.4459   0.7004   2.4145  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.419871   0.774032  -8.294  < 2e-16 ***
## glucose      0.044509   0.005775   7.707 1.29e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 348.35  on 273  degrees of freedom
## Residual deviance: 262.14  on 272  degrees of freedom
## AIC: 266.14
## 
## Number of Fisher Scoring iterations: 4

# Prediction
prob <- predict(fit_simple, test_data, type = "response")
predicted_class <- ifelse(prob > 0.5, "pos", "neg")

# Confusion matrix
table(predicted_class, test_data$diabetes)
##                
## predicted_class neg pos
##             neg  71  23
##             pos   8  16

# Accuracy
mean(predicted_class == test_data$diabetes)
## [1] 0.7372881

For this simple model, the accuracy is \(73.7\%\).

Model with all covariates

# Estimation
fit <- glm(diabetes ~ ., data = train_data, family = binomial)

# Summary
summary(fit)
## 
## Call:
## glm(formula = diabetes ~ ., family = binomial, data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6139  -0.6178  -0.3523   0.5804   2.1311  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.130e+01  1.558e+00  -7.250 4.18e-13 ***
## pregnant     2.511e-02  6.582e-02   0.381   0.7028    
## glucose      4.038e-02  7.028e-03   5.745 9.17e-09 ***
## pressure     1.368e-02  1.562e-02   0.876   0.3811    
## triceps      1.381e-02  2.151e-02   0.642   0.5209    
## insulin     -9.559e-04  1.808e-03  -0.529   0.5970    
## mass         7.334e-02  3.586e-02   2.045   0.0408 *  
## pedigree     8.706e-01  5.093e-01   1.709   0.0874 .  
## age          3.612e-02  2.323e-02   1.555   0.1200    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 348.35  on 273  degrees of freedom
## Residual deviance: 233.54  on 265  degrees of freedom
## AIC: 251.54
## 
## Number of Fisher Scoring iterations: 5

# Prediction
prob <- predict(fit, test_data, type = "response")
predicted_class <- ifelse(prob > 0.5, "pos", "neg")

# Confusion Matrix
table(predicted_class, test_data$diabetes)
##                
## predicted_class neg pos
##             neg  69  15
##             pos  10  24

# Accuracy
mean(predicted_class == test_data$diabetes)
## [1] 0.7881356

The accuracy is \(78.8\%\).