Logistic Regression

A classification algorithm that predicts probabilities using the sigmoid function

Logistic Regression predicts probabilities for binary outcomes—spam or not spam, fraud or legitimate, churn or stay. The name is misleading; it's a classification algorithm, not regression.

How It Works

Apply the sigmoid function to a linear combination of features:

p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots)}}

Where:

p: probability of the positive class (0 to 1)
βᵢ: coefficients weighting each feature
xᵢ: input features

If p > 0.5, predict class 1. Otherwise, predict class 0.

Estimating Probabilities

Logistic Regression doesn't just classify—it estimates how confident it is. The output p is a genuine probability that an observation belongs to the positive class.

Say you're predicting email spam. An output of p = 0.95 means "95% confident this is spam." An output of p = 0.51 means "barely spam, could go either way." This probabilistic output is valuable because:

You can rank predictions by confidence
You can adjust the decision threshold based on cost (e.g., use 0.7 instead of 0.5 if false positives are expensive)
The probabilities themselves carry information beyond the binary label

The model learns coefficients that push p toward 1 for positive examples and toward 0 for negative ones. Training adjusts these coefficients to maximize the likelihood of the observed labels.

The Sigmoid Function

The sigmoid squeezes any real number into [0, 1]:

\sigma(z) = \frac{1}{1 + e^{-z}}

Without this transformation, linear combinations could output -10 or 5, which can't be probabilities.

The sigmoid curve shows how any linear input z gets mapped to a probability. Large negative values approach 0, large positive values approach 1, and z = 0 maps to exactly 0.5. The S-shape creates a smooth transition between the two classes.

Log-Odds (Logit)

Logistic Regression models log-odds, which is linear in the coefficients:

\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots

Odds = p/(1-p) tells you how much more likely something is to happen than not. The logarithm makes this relationship linear.

Assumptions

1. Binary or Categorical Target

Binary classification uses standard Logistic Regression. Multiple classes need Multinomial Logistic Regression.

2. Independent Observations

Data points must be independent. If you have repeated measurements from the same person or correlated samples, use Mixed-Effects Logistic Regression instead.

3. Linear Log-Odds

Features must relate linearly to log-odds, not probability. If the relationship is curved, transform features or switch to a non-linear model.

4. No Multicollinearity

Highly correlated features muddy coefficient estimates. Check Variance Inflation Factor (VIF)—values above 5-10 mean trouble. Fix by removing features, using PCA, or adding regularization.

When features are highly correlated (multicollinearity), coefficient estimates become unstable. Small changes in the data lead to large swings in coefficients. This visualization shows how correlated features cause coefficient variance to explode—the model can't reliably determine which feature deserves which weight. Regularization or feature removal stabilizes these estimates.

5. Enough Data

You need at least 10-15 observations per predictor. Maximum Likelihood Estimation needs data to estimate probabilities reliably.

Training and the Cost Function

Training finds coefficients that make the model's probability predictions match the actual labels. This happens through Maximum Likelihood Estimation—pick coefficients that maximize the likelihood of observing your training data.

For each observation, the model predicts probability p. The likelihood across all data is:

L(\beta) = \prod_{i=1}^{m} p_i^{y_i} (1-p_i)^{1-y_i}

This expression elegantly handles both classes:

When yᵢ = 1 (positive), the term simplifies to pᵢ
When yᵢ = 0 (negative), it becomes 1 - pᵢ

Taking the logarithm turns products into sums (easier to optimize):

\log L(\beta) = \sum_{i=1}^{m} [y_i \log(p_i) + (1-y_i) \log(1-p_i)]

Maximizing log-likelihood = minimizing Binary Cross-Entropy Loss (also called Log Loss):

\text{Log Loss} = -\frac{1}{m}\sum_{i=1}^{m} [y_i \log(p_i) + (1-y_i) \log(1-p_i)]

This cost function penalizes confident mistakes aggressively. If you predict p = 0.9 for a negative example, log(1 - 0.9) = log(0.1) ≈ -2.3. If you predict p = 0.6 for the same example, log(0.4) ≈ -0.9. The first mistake costs more than twice as much.

This asymmetric penalty encourages the model to be honest about uncertainty. Better to say "60% sure" when wrong than "90% sure."

Why Cross-Entropy?

Cross-entropy measures the difference between two probability distributions: the true distribution (actual labels) and the predicted distribution (model outputs).

For a binary classification problem, the true distribution is either [1, 0] (positive class) or [0, 1] (negative class). The model predicts [p, 1-p]. Cross-entropy quantifies how far apart these distributions are.

When the model predicts the correct distribution perfectly, cross-entropy = 0 (minimum). When predictions diverge from truth, cross-entropy increases. This makes it ideal for training classifiers that output probabilities.

Cross-entropy also has nice mathematical properties:

Convex function (gradient descent finds the global minimum)
Derivative has a simple form (error × input)
Directly related to likelihood maximization

That's why nearly every probabilistic classifier—from logistic regression to neural networks—uses cross-entropy as the loss function.

Gradient Descent

The sigmoid is non-linear, so you can't solve for coefficients directly. Gradient descent adjusts them iteratively:

\beta_j := \beta_j - \alpha \frac{\partial J(\beta)}{\partial \beta_j}

α is the learning rate—how big each update step is.

Regularization

Penalize large coefficients to prevent overfitting:

Ridge (L2): Squared penalty keeps coefficients small.

\text{Loss}_{\text{Ridge}} = \text{Log Loss} + \lambda \sum_{j=1}^{p} \beta_j^2

Lasso (L1): Absolute value penalty drives some coefficients to zero, doing automatic feature selection.

\text{Loss}_{\text{Lasso}} = \text{Log Loss} + \lambda \sum_{j=1}^{p} |\beta_j|

Decision Boundaries

The decision boundary is where the model is perfectly uncertain—where p = 0.5. At this line (or hyperplane in higher dimensions), the model switches from predicting one class to the other.

Since p = 0.5 when the sigmoid's input equals 0:

\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots = 0

This is the equation of the decision boundary. For two features, it's a straight line. For three features, a plane. For more, a hyperplane.

Example: Say you're predicting loan approval based on income (x₁) and credit score (x₂), and the model learns:

z = -3 + 0.001 \cdot \text{income} + 0.01 \cdot \text{credit}

The boundary is at z = 0:

-3 + 0.001 \cdot \text{income} + 0.01 \cdot \text{credit} = 0

Solving: credit = 300 - 0.1 × income. This line separates approved loans from denied ones. Higher income lets you get approved with a lower credit score.

The boundary is linear because the relationship between features and log-odds is linear. If you need curved boundaries, add polynomial features or use a non-linear model.

This visualization shows the decision boundary separating two classes. The line represents where p = 0.5 (perfect uncertainty). Points on one side get classified as positive (colored regions show probability contours), while points on the other side are classified as negative. The boundary is perfectly straight because logistic regression creates linear decision boundaries in the original feature space.

Evaluation

See Classification Evaluation Metrics for details on accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrices.

Quick summary:

Accuracy: Misleading with imbalanced data
Precision: Are positive predictions trustworthy?
Recall: Do we catch most positives?
F1-score: Balance between precision and recall
ROC-AUC: Performance across all thresholds
Log Loss: How well-calibrated are probabilities?

Practical Tips

Feature Scaling

Gradient descent is sensitive to scale. Features with large ranges (like income: 0-100,000) dominate small ones (like age: 0-100). Always standardize:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Categorical Variables

One-hot encode nominal categories (colors, cities). Ordinal encode ordered categories (low/medium/high). Drop one category to avoid the dummy variable trap—perfect multicollinearity crashes the model.

Imbalanced Data

When one class dominates (95% non-fraud, 5% fraud):

Stratify train-test splits
Use precision, recall, F1, or ROC-AUC—not accuracy
Weight classes or adjust thresholds
Try SMOTE for synthetic minority samples

Feature Interactions

Logistic Regression treats features independently. If income and education interact (high income + high education might behave differently), add interaction terms:

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(interaction_only=True, include_bias=False)
X_interact = poly.fit_transform(X)

Interpreting Coefficients

Coefficients live in log-odds space. Exponentiate for odds ratios:

\text{Odds Ratio} = e^{\beta_j}

An odds ratio of 1.5 means a 1-unit increase in xⱼ boosts the odds by 50%, holding everything else constant.

Calibration

A calibrated model means predicted probabilities match reality. If the model says 70%, roughly 70% of those cases should be positive. Check with calibration curves. Fix with Platt scaling or isotonic regression.

Softmax Regression (Multi-Class Extension)

Binary Logistic Regression handles two classes. For multiple classes, use Softmax Regression (also called Multinomial Logistic Regression).

Instead of one sigmoid, compute a score for each class:

z_k = \beta_{k0} + \beta_{k1} x_1 + \beta_{k2} x_2 + \cdots

Then apply the softmax function to turn scores into probabilities:

p_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

Where K is the number of classes. The softmax ensures probabilities sum to 1 across all classes.

Example: Classifying images as cat, dog, or bird. The model computes three scores, then softmax converts them to probabilities: [0.7, 0.2, 0.1]. The model predicts cat (highest probability).

The cost function becomes Categorical Cross-Entropy:

\text{Loss} = -\frac{1}{m}\sum_{i=1}^{m} \sum_{k=1}^{K} y_{ik} \log(p_{ik})

Where yᵢₖ is 1 if observation i belongs to class k, 0 otherwise (one-hot encoding).

Softmax is the multi-class generalization of logistic regression. The sigmoid is just softmax with K = 2.