Support Vector Machines (SVM)

Support Vector Machines (SVMs) find the decision boundary that best separates classes by maximizing the margin—the distance between the boundary and the nearest points from each class. The same core principle of margin maximization applies to both classification and regression, but with different objectives and outputs.

Core Principle

SVMs don't just find any separating boundary—they find the one that stays as far as possible from both classes. This maximum margin principle provides better generalization by creating a buffer zone around the decision boundary.

The points closest to the boundary are called support vectors. These are the critical points that define the margin. The rest of the training data could be removed without changing the decision boundary—only support vectors matter.

The visualization shows the maximum margin hyperplane (solid line) separating two classes. The dashed lines indicate the margin boundaries, and the circled points are support vectors—the only training points that influence the decision boundary. All other points could be moved or removed without affecting the model.

The Model Family

SVC (Support Vector Classification)

Finds a hyperplane that separates classes with maximum margin. For linearly separable data, the decision boundary is:

f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b

Predict class +1 if f(x) ≥ 0, class -1 otherwise. The margin is 2/||w||, so maximizing margin means minimizing ||w||.

Soft margin: When data isn't perfectly separable, allow some points to violate the margin using slack variables. The C parameter controls the tradeoff between margin width and classification errors.

Multi-class: Extend to multiple classes using one-vs-one or one-vs-all strategies.

SVR (Support Vector Regression)

Applies the margin concept to regression. Instead of maximizing distance between classes, SVR finds a tube around the true function where errors are tolerated.

Points outside the tube contribute to the loss, similar to how misclassified points contribute in SVC. The ε parameter defines tube width.

The prediction is still a linear combination of support vectors—the points that fall outside or on the edge of the tube.

The Kernel Trick

SVMs handle nonlinear boundaries by mapping data to a higher-dimensional space where it becomes linearly separable. The kernel trick computes these mappings implicitly without explicitly constructing the high-dimensional space.

Common kernels:

Linear: $K(\mathbf{x}, \mathbf{x}') = \mathbf{x}^T \mathbf{x}'$ — Standard dot product, for linearly separable data.

Polynomial: $K(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^T \mathbf{x}' + c)^d$ — Captures polynomial relationships up to degree d.

RBF (Radial Basis Function): $K(\mathbf{x}, \mathbf{x}') = \exp(-\gamma \|\mathbf{x} - \mathbf{x}'\|^2)$ — Most popular, creates smooth nonlinear boundaries. The γ parameter controls influence radius.

The kernel function measures similarity between points. RBF gives high similarity to nearby points, polynomial captures feature interactions.

Key Hyperparameters

C (regularization): Controls margin violation penalty. Large C forces few violations (hard margin, can overfit). Small C allows more violations (soft margin, more robust).

kernel: Choice of kernel function determines boundary flexibility.

γ (gamma): For RBF/polynomial kernels, controls how far the influence of a single training point reaches. Large γ means small influence radius (complex boundaries, can overfit). Small γ means wide influence (smoother boundaries).

ε (epsilon): For SVR, defines the width of the tube where no penalty is given.

Shared Characteristics

All SVM variants share:

Margin maximization: The core optimization principle—maximize distance to decision boundary or regression tube.

Support vectors: Only a subset of training points (support vectors) determine the final model. Others are ignored.

Kernel trick: Handle nonlinearity by implicitly mapping to higher dimensions.

Convex optimization: Training solves a convex quadratic programming problem with a unique global optimum.

Sparsity: Only support vectors contribute to predictions, making the model compact.

When to Use SVMs

SVMs work well when:

Data is high-dimensional (text, genomics)
Clear margin exists between classes
Dataset is small to medium sized (thousands to tens of thousands of points)
Need robust decision boundaries that generalize well
Kernel methods provide computational advantage

SVMs struggle when:

Dataset is very large (millions of points)—training becomes slow
Data is very noisy—support vectors can be unstable
Features are not meaningful under distance metrics
Need probability estimates (requires additional calibration)
Interpretability is critical (kernelized SVMs are black boxes)

Comparison Within Family

SVC vs Logistic Regression: Both find linear boundaries, but SVC maximizes margin while logistic regression minimizes log loss. SVC is more robust to outliers near the boundary.

SVR vs Linear Regression: Linear regression minimizes squared errors everywhere. SVR ignores small errors (within ε tube) and focuses on large deviations, making it more robust to outliers.

SVM vs Neural Networks: SVMs have fewer hyperparameters and often work better on small datasets. Neural networks scale better to very large datasets and can learn complex feature representations.

SVM vs Tree Models: Trees handle mixed features and missing data naturally. SVMs require numeric, scaled features but often generalize better in high dimensions.

The SVM family demonstrates how a single principle—margin maximization—can be adapted to different tasks through different loss functions and output types, while maintaining the core strengths of the approach.