Linear Regression

Linear regression models relationships between input variables (features) and an output variable (target) using a straight line or hyperplane.

Simple Linear Regression

One feature, one target:

y = \beta_0 + \beta_1 x + \epsilon

Where:

y: target variable
x: input feature
β₀: intercept (y when x=0)
β₁: slope (change in y per unit change in x)
ε: error term (unexplained variance)

Multiple Linear Regression

Multiple features:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon

Fits a hyperplane in n-dimensional space. Each coefficient βᵢ shows how much the target changes when that feature increases by 1 unit, holding other features constant. This interpretability makes linear regression valuable—understanding coefficients is as important as making predictions.

This visualization shows how linear regression fits a line (or hyperplane) through the data points by minimizing the distance between predictions and actual values. The line represents the learned relationship between features and target.

Assumptions of Linear Regression

1. Linearity

The relationship between features and target must be linear. For example, if predicting BMI from height and weight, linear regression assumes a constant rate of change. However, since BMI = weight/height², the true relationship is non-linear, which would violate this assumption.

How to test: Plot residuals against predicted values. Random scatter around zero indicates linearity. Curves or patterns suggest non-linear relationships.

Residual plots diagnose linearity and homoscedasticity violations. Random scatter indicates good fit. Patterns (curves, funnels) indicate problems.

Fixes: Apply transformations (log, square root), add polynomial features, or use non-linear models.

2. Independence of Errors

Residuals for one data point should not correlate with others. Often violated in time-series data where yesterday's error influences today's, leading to overconfident predictions.

How to test: Durbin-Watson test. Values around 2 indicate independence; values near 0 or 4 suggest autocorrelation.

Fixes: Use time-series models (ARIMA) or add lag variables.

3. Homoscedasticity

Errors must have constant variance. The spread of residuals should be similar across all predicted values. Heteroscedasticity (non-constant variance) causes the model to overweight certain data regions.

How to test: Plot residuals vs. predicted values. Even spread indicates homoscedasticity. Funnel shapes indicate heteroscedasticity. Use Breusch-Pagan or White's test.

This visualization shows different model fits. Underfitting (left) occurs when the model is too simple. Good fit (center) captures the underlying pattern with appropriate complexity. Overfitting (right) captures noise and shows high variance in predictions.

Fixes: Transform the target variable (log(y)) or use Weighted Least Squares.

4. Normality of Errors

Residuals should follow a normal distribution. Important for hypothesis testing and confidence intervals. The target variable itself doesn't need to be normal.

How to test: Use Q-Q plots (residuals should follow diagonal line) or Shapiro-Wilk test.

Fixes: Less critical with large datasets (Central Limit Theorem). Otherwise, apply transformations or use non-parametric methods.

5. No Multicollinearity

Features should not be highly correlated. Multicollinearity (e.g., using both "age" and "years of experience" to predict salary) inflates coefficient variance, making estimates unstable and hard to interpret.

How to test: Compute Variance Inflation Factor (VIF). VIF > 5 (or 10) indicates problems. Use correlation heatmaps to spot correlated features.

Fixes: Remove correlated variables, combine them, or use regularization (Ridge, Lasso).

These assumptions rarely hold perfectly in practice. The key is detecting violations and addressing them appropriately.

How Linear Regression is Optimized

Once we set up the linear regression equation, the next question is: how do we find the best values for the coefficients ? In other words, how does the model "learn"? The key idea is to choose the coefficients that make the predictions as close as possible to the actual values in the data. This is where optimization comes in.

The most common method is Ordinary Least Squares (OLS). OLS works by minimizing the sum of squared residuals — the squared difference between the actual value yᵢ and the predicted value ŷᵢ for each data point. The cost function looks like this:

\text{Cost} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Why squared residuals? Squaring ensures that positive and negative errors don't cancel out, and it penalizes larger errors more heavily than smaller ones. By minimizing this function, OLS finds the "line of best fit."

There are two main ways to perform this minimization:

Analytical Solution (Closed-Form)

For relatively small datasets with not too many features, we can directly compute the optimal coefficients using linear algebra. The closed-form solution is:

\hat{\beta} = (X^T X)^{-1} X^T y

Here, X is the feature matrix, y is the target vector, and β̂ is the vector of coefficients. This solution comes from setting the derivative of the cost function to zero and solving for β. It's exact, fast for low-dimensional data, and forms the mathematical foundation of linear regression.

However, the matrix inversion step becomes computationally expensive when the number of features is very large, and sometimes the matrix is not even invertible (especially with multicollinearity).

Alternative: Singular Value Decomposition (SVD)

SVD provides a more numerically stable way to compute the solution. It decomposes the feature matrix X into:

X = U \Sigma V^T

Where U and V are orthogonal matrices, and Σ is diagonal containing singular values. The regression coefficients become:

\hat{\beta} = V \Sigma^{-1} U^T y

Advantages of SVD:

More numerically stable than direct matrix inversion
Handles multicollinearity better
Works even when X^T X is singular or near-singular
Used in dimensionality reduction and regularization methods

Gradient Descent (Iterative Solution)

For larger datasets or high-dimensional problems, gradient descent is often preferred. Instead of solving the equations directly, gradient descent takes small steps in the direction that reduces the cost function the most.

The update rule for each coefficient looks like this:

\beta_j := \beta_j - \alpha \frac{\partial}{\partial \beta_j} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Here, α is the learning rate, which controls how big each step is. If α is too small, training is slow; if it's too large, the algorithm may overshoot and fail to converge.

This visualization shows gradient descent with different learning rates. Small learning rate (left) takes many small steps and converges slowly. Optimal learning rate (center) converges efficiently. Large learning rate (right) takes large steps and may overshoot the minimum, failing to converge.

Gradient Descent Variants:

Batch Gradient Descent: Uses all data points to compute gradients. Accurate but slow for large datasets.
Stochastic Gradient Descent (SGD): Uses one data point at a time. Fast but noisy updates.
Mini-batch Gradient Descent: Uses small batches of data. Balances speed and stability. Most commonly used in practice.

This visualization compares the convergence paths of different gradient descent variants. Batch Gradient Descent (smooth path) computes gradients using all data points, resulting in stable but slow updates. Stochastic Gradient Descent (noisy path) updates after each sample, converging faster but with erratic steps. Mini-batch Gradient Descent (moderate path) balances both approaches, providing relatively smooth convergence with good computational efficiency.

Gradient descent is more flexible than the closed-form solution. It can handle massive datasets, can be parallelized, and forms the basis of how many modern machine learning algorithms (like neural networks) are trained.

Interpretation of Coefficients

An important aspect of optimization in regression is not just finding the coefficients, but understanding them. Each coefficient represents the change in the target variable for a one-unit change in the corresponding feature, assuming all other features remain constant. This interpretability is one of the biggest strengths of linear regression compared to more complex models.

Evaluation Metrics for Regression

Once a regression model is trained, evaluate its performance using appropriate metrics. Different evaluation metrics capture different aspects of model performance.

This plot compares predicted values against actual values. Points along the diagonal line indicate perfect predictions. Deviations from this line show prediction errors, helping visualize model accuracy.

For detailed explanations of regression evaluation metrics including MSE, RMSE, MAE, R², and Adjusted R², see Regression Evaluation Metrics.

Practical Tips and Pitfalls

Feature Scaling

Linear regression doesn't require feature scaling, but scaling is essential with regularization (Ridge, Lasso). Without scaling, features with larger numeric ranges dominate the penalty terms. Always standardize or normalize features when using regularization.

Categorical Variables

Linear regression requires numeric inputs. Encode categorical features (city, job title) using one-hot encoding—each category becomes a binary column.

Dummy variable trap: Including all categories creates perfect multicollinearity. Drop one category during encoding.

Overfitting

Too many predictors cause the model to capture noise instead of signal. Use regularization:

Ridge regression penalizes squared coefficients:

\text{Cost}_{\text{Ridge}} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2

Discourages large coefficients and stabilizes models with multicollinearity.

Lasso regression penalizes absolute coefficients:

\text{Cost}_{\text{Lasso}} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|

Drives some coefficients to zero, useful for feature selection.

Outliers

Linear regression minimizes squared errors, making it sensitive to outliers. For example, when predicting BMI, an extreme data point (very tall athlete with high muscle mass) can shift the fitted line dramatically.

Solutions: Check residual plots and use robust regression techniques.

Coefficient Interpretability

Coefficients show associations, not causation. Violations of assumptions (independence, omitted variables) make interpretations misleading. Acknowledge these limitations.