K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a simple, intuitive algorithm that classifies new data points based on similarity to training examples. Instead of learning a model, KNN stores the training data and makes predictions by looking at the k most similar examples.

How KNN Works

Given a new data point, KNN:

Computes distance to every training point
Selects the k closest neighbors
Takes a majority vote of their labels

The prediction is the most common class among the k nearest neighbors. That's it. No training phase, no parameters to learn—just direct comparison to stored examples.

Example: Predicting house prices based on size and location. A new house is compared to all houses in the dataset. The k most similar houses (by size and location) vote on whether it's "expensive" or "affordable."

Distance Metrics

Distance determines what "nearest" means. The choice dramatically affects which neighbors are selected.

Euclidean distance: Straight-line distance. Most common, assumes features trade off smoothly.

d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}

Creates circular neighborhoods. A house can be similar by being slightly smaller with more bedrooms, or larger with fewer.

Manhattan distance: Sum of absolute differences along each axis. Features contribute independently.

d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} |x_i - y_i|

Creates diamond-shaped neighborhoods. Large changes in one feature can't be compensated by small changes in another.

Cosine similarity: Measures angle between vectors, ignoring magnitude. Common for text and embeddings.

\text{sim}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}

Hamming distance: Counts differing features. Used for categorical or binary data.

The distance metric encodes your assumptions about similarity. KNN enforces those assumptions exactly—it doesn't adapt or correct them.

Choosing k: Bias-Variance Tradeoff

The value of k controls how local the predictions are.

Small k (k=1): Very local, follows noise closely. High variance, low bias. Decision boundaries are jagged and sensitive to outliers.

Large k (k=50): Smooths over noise, averages large neighborhoods. Low variance, high bias. Decision boundaries are smooth but may miss local patterns.

In practice: Choose k through cross-validation. Start with k = sqrt(n) as a baseline. Odd values avoid ties in binary classification.

Feature Scaling

KNN relies entirely on distance. If features have different scales, the largest-scale feature dominates distance calculations regardless of importance.

Example: House prices with square footage (500-5000) and bedrooms (1-5). Euclidean distance will be driven almost entirely by square footage. Two houses with identical bedrooms but different sizes appear more similar than houses with identical sizes but different bedrooms.

Solution: Standardize features to zero mean and unit variance, or normalize to [0,1] range. This ensures all features contribute comparably to distance.

Feature scaling is not optional for KNN. Without it, predictions are driven by measurement units, not meaningful similarity.

Weighted KNN

Standard KNN gives all k neighbors equal votes. Weighted KNN assigns more influence to closer neighbors and less to distant ones.

Inverse distance weighting: Each neighbor's contribution is proportional to 1/distance. Very close points have high weight, distant points have low weight.

Benefits:

More robust to noise (distant outliers have less impact)
Reduces sensitivity to exact choice of k
Allows closer points to override marginal neighbors

Weighted KNN doesn't change which neighbors are selected—it changes how much each neighbor matters.

The Curse of Dimensionality

KNN struggles as the number of features grows. Each feature adds another way for points to differ. Small differences across many dimensions accumulate into large total distances.

In high dimensions:

All points become far apart
All distances become similar in magnitude
"Nearest" neighbors aren't meaningfully near
Distance loses its ability to measure similarity

Result: KNN predictions become unreliable. More data doesn't fix this—you'd need exponentially more points to find truly similar neighbors.

Mitigation: Use dimensionality reduction (PCA, feature selection) before applying KNN. Remove irrelevant features.

Computational Efficiency

Brute force: Compute distance to every training point. O(n) per query. Simple but slow for large datasets.

KD-Trees: Partition space along axes to skip distant regions. Effective in low dimensions (≤20 features) but degrades in high dimensions.

Ball Trees: Group points into nested spheres. Similar to KD-trees but with different geometry. Better for some distance metrics.

Approximate Nearest Neighbors: Trade accuracy for speed. Methods like LSH (locality-sensitive hashing) return "close enough" neighbors much faster. Works well in high dimensions where exact methods fail.

Both KD-trees and Ball trees rely on pruning—skipping regions that can't contain closer neighbors. In high dimensions, distances become too similar for effective pruning.

Imbalanced Datasets

When one class is much more common, most neighborhoods contain mostly majority class points. KNN can fail to identify minority class examples even when present.

Mitigation:

Use smaller k to preserve local structure
Use weighted KNN to emphasize closest neighbors
Adjust decision rules with class weights or cost-sensitive voting

When to Use KNN

KNN works well when:

Similarity is well-defined and meaningful
Dimensionality is low to moderate (≤20 features)
Dataset is small to medium sized
Decision boundaries are complex and nonlinear
You need a simple, interpretable baseline

KNN struggles when:

Features are high-dimensional or noisy
Dataset is very large (millions of points)
Real-time, low-latency predictions are required
Features don't naturally encode similarity
Data is very imbalanced

Evaluation Metrics

For evaluation metrics, see Classification Evaluation Metrics.

Comparison to Other Methods

vs Linear Models: KNN is more flexible (no global structure assumption) but slower at prediction and sensitive to scaling.

vs Decision Trees: Trees handle mixed features and irrelevant features better. KNN is better when smooth similarity matters.

vs SVMs: SVMs learn decision boundaries and handle high dimensions better. KNN is simpler but relies on raw distance.

vs Neural Networks: Neural networks invest in training to learn representations. KNN stores data and pays cost at prediction time.

KNN often appears in modern systems as a retrieval component operating over learned embeddings from neural networks, combining the flexibility of learned representations with the simplicity of neighbor-based decisions.