Decision Trees

Decision Trees (DT) are the foundation for ensemble methods like Random Forest and Gradient Boosting Machines. See Tree Ensembles for more on these methods.

DT break a complex problem into smaller, simpler ones by asking a series of questions that gradually separate the data into groups. Each question (= split) makes the resulting groups as pure as possible. For clasification this means that that each subset mostly contains data points from a single class or in regression has similar values. In the end we reach groups that contain data points that behaves the same way.

Each questions is an internal node and represents a decision rule, each branch a possible outcome, and each leaf node a final prediction. This makes DT interpretable as we can follow the path.

But this means a tree keeps growing until it divides every detail of the training data and that is called overfitting.

Assumptions of Decision Trees (DT)

DT can model nonlinear relationships and feature interactions, as they don’t assume any particular shape of the data distribution. Where linear models assume straight lines or planes. So DTs perform well even when relationships between input features and output labels are complex or irregular.

Still they have some underlying assumptions:

Feature Relevance: DTs assume that the input features contain information that meaningfully separates classes or explains the target. If features are mostly noisy or irrelevant, the tree may learn unstable or misleading splits. Feature selection or feature importance analysis is used to mitigate this.
Independent and Identically Distributed (i.i.d.) Samples: Training samples are assumed to be independent and drawn from the same distribution as future data. Violations (e.g., temporal or domain shifts) can reduce generalization performance.
Sufficient Data for Each Split: Reliable splits require enough data at each node. Small datasets or rare categories lead to unstable thresholds and overfitting. Parameters such as min_samples_split and min_samples_leaf enforce minimum sample sizes to improve stability.
Axis-Aligned Decision Boundaries: DT split on one feature at a time, producing axis-aligned boundaries (e.g., Age < 35). This limits their ability to model diagonal or complex multivariate boundaries, a limitation often addressed by ensemble methods such as Random Forests.

This image shows how a decision tree makes predictions by splitting the data space into rectangular regions using simple "if–then" rules. Each colored area represents the class the model would predict for points in that region, while the dots show the actual training data. The straight horizontal and vertical lines come from the tree checking one feature at a time (for example, "is feature_1 greater than a certain value?"). This makes decision trees easy to understand, but also explains why their boundaries look blocky rather than smooth.

Monotonic Relationships Within Splits: Each split assumes that a simple threshold (e.g., < or >) meaningfully separates the data. Trees perform best when such threshold-based differences exist and are less effective for relationships that vary smoothly or require fine-grained continuous modeling.

How a Decision Tree is Built and Optimized

At each node, the DT chooses the most informative questions by evaluating all features and candidate split points. Then the algorithm selects the split that minimizes impurity (classification) or variance (regression).

The Splitting Criteria

At every node, the tree searches through all available features and possible split points to find the one that best separates the data. The “best” split depends on how we measure impurity — how mixed the classes (or values) are within a node.

Entropy and Information Gain (Classification Trees)

Entropy measures the level of disorder in a dataset. For a dataset S with c classes, entropy is defined as:

H(S) = -\sum_{i=1}^{c} p_i \log_2 p_i

Here, $p_i$ is the proportion of samples belonging to class $i$. A node with perfectly mixed classes (e.g., 50% A and 50% B) has high entropy, while a node containing only one class has entropy 0.

When we split a node based on a feature $A$, we calculate Information Gain (IG) — the reduction in entropy after the split:

IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)

where $S_v$ is the subset of samples where feature $A$ has value $v$.

The higher the information gain, the better the feature and threshold for splitting.

Gini Impurity (CART Algorithm)

Another popular metric used in the CART (Classification and Regression Trees) algorithm is Gini impurity, which measures how often a randomly chosen sample would be misclassified if it were labeled randomly according to the class distribution in the node.

\text{Gini}(S) = 1 - \sum_{i=1}^{c} p_i^2

Like entropy, Gini impurity is 0 for a pure node and higher for mixed ones. It's computationally simpler and often performs comparably to information gain.

Variance Reduction (Regression Trees)

For regression tasks, DT use variance reduction instead of class impurity. The goal is to minimize the variability of target values in each split:

\text{Var}(S) = \frac{1}{|S|} \sum_{i=1}^{|S|} (y_i - \bar{y})^2

\Delta\text{Var} = \text{Var}(S) - \sum_{v} \frac{|S_v|}{|S|} \text{Var}(S_v)

The algorithm selects the split that leads to the largest reduction in variance across the child nodes.

Stopping Criteria

If we let the tree keep splitting, it can easily memorize the training data. To prevent this, we define stopping criteria, such as:

Maximum depth – Limits how deep the tree can grow.
Minimum samples per split/leaf – Ensures each decision is supported by enough data.
Minimum impurity decrease – Requires a certain level of improvement to justify further splits.

These constraints act as regularization, helping control overfitting and improving generalization.

This visualization shows different maximum depths. Shallow trees (lower depth) create simpler decision boundaries that generalize well but may underfit the data. On depth 0 we can still see some rejected samples under the depth 0 line that would have been classified as approved. As depth increases, the tree captures more complex patterns and fits the training data more closely. However, very deep trees create overly complex boundaries that memorize noise in the training data, leading to poor performance on new data. In our case depth 2 does not split the classes anymore and does not add new value. But an outlier could lead to classifying the wrong class.

The maximum depth parameter helps find the right balance between model complexity and generalization.

Pruning to Avoid Overfitting

Even with stopping criteria, trees can still become overly complex. Pruning helps simplify them without sacrificing too much accuracy.

Pre-pruning (Early Stopping): Stop growing the tree once certain conditions are met (like reaching a maximum depth or minimum leaf size).

Post-pruning (Cost Complexity Pruning): Grow the full tree first, then trim back branches that add little predictive power.

Post-pruning uses a penalty term for complexity, balancing accuracy and simplicity. The cost-complexity function is:

C_\alpha(T) = R(T) + \alpha |T|

Here, $R(T)$ is the training error of the tree $T$, $|T|$ is the number of leaves, and $\alpha$ controls the trade-off between model fit and simplicity. A higher $\alpha$ leads to a smaller, more general tree.

Evaluation Metrics

Once a decision tree is trained, evaluate its performance using appropriate metrics. Trees can easily overfit, so using the right metric is critical to judge both accuracy and generalization.

The choice of evaluation metric depends on whether you're solving a classification or regression problem:

For classification tasks, see Classification Evaluation Metrics
For regression tasks, see Regression Evaluation Metrics

Practical Tips and Common Pitfalls

Controlling Overfitting

Without constraints, trees keep splitting until they memorize the training data, including noise.

Control tree growth with these hyperparameters:

max_depth: Limits tree depth. Shallower trees generalize better.
min_samples_split: Minimum samples required to split a node.
min_samples_leaf: Minimum samples required in each leaf.
max_leaf_nodes: Maximum number of leaf nodes.

Use cost-complexity pruning to find optimal depth:

path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

Test different ccp_alpha values with cross-validation to balance simplicity and accuracy.

Model Stability

Decision trees are high-variance models. Small changes in training data can produce completely different tree structures.

Always use k-fold cross-validation:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X, y, cv=5)
print(scores.mean(), scores.std())

If scores vary significantly across folds, the model is unstable. This indicates the tree may be too deep or the dataset too small. Random Forests address this by averaging multiple trees.

Imbalanced Data

Trees favor majority classes because impurity-based splits work better with larger groups.

High accuracy with poor minority class recall indicates imbalance. Solutions:

Use class_weight='balanced' in DecisionTreeClassifier to adjust weights by class frequency.
Apply resampling: SMOTE (oversample minority) or RandomUnderSampler (downsample majority).
Use F1-score or ROC-AUC instead of accuracy.

Feature Preprocessing

Decision trees are scale-invariant. They only care about order, not magnitude, so feature scaling is unnecessary.

Categorical features must be encoded numerically:

One-hot encoding: For nominal categories (color, city).
Ordinal encoding: For ordered categories (education level).

High-cardinality features (ZIP codes, user IDs) create too many small splits. Combine rare categories into "Other" or use target encoding with cross-validation to prevent leakage.

Model Interpretability

Decision trees are highly interpretable. Each prediction follows a clear path of conditions, useful for regulated industries (finance, healthcare).

Visualize trees in scikit-learn:

from sklearn.tree import plot_tree, export_text
plot_tree(clf, filled=True)
print(export_text(clf, feature_names))

For ensemble models, use SHAP (SHapley Additive Explanations) to quantify feature influence on specific predictions.

Ensemble Methods

Single decision trees are unstable. Ensemble methods improve performance:

Random Forests: Reduce variance by averaging multiple trees trained on random data and feature subsets.
Gradient Boosting: Reduce bias by sequentially training trees that correct previous errors.

Ensembles balance bias and variance while maintaining interpretability of individual trees.

Feature Importance

Decision trees compute feature importance based on impurity reduction across splits:

importances = clf.feature_importances_

Impurity-based importance can favor continuous or high-cardinality features. Use permutation importance for a more robust measure—it shows how randomizing each feature affects accuracy.

Combine both methods for better feature selection.

Data Leakage

Decision trees memorize patterns effectively, making them vulnerable to data leakage. Avoid:

Features that depend on the target (future sales data).
Random splitting of time-series data (use time-based validation instead).

Use preprocessing pipelines (sklearn.pipeline) to ensure identical transformations for training and testing.

Decision trees work best in ensembles (Random Forests, AdaBoost, Gradient Boosting) that combine multiple trees to balance bias and variance.