Decision Trees: Splitting Data to Make Predictions

simulator intermediate ~10 min
Loading simulation...
93.8% accuracy — 8-leaf tree on circular pattern

A depth-3 decision tree on 80 points with circular decision boundary achieves 93.8% training accuracy with 8 leaf nodes. The rectangular partitions approximate the curved boundary through axis-aligned splits.

Formula

Entropy: H(S) = −Σ p_i × log₂(p_i)
Information gain: IG = H(parent) − Σ (|S_j|/|S|) × H(S_j)
Gini impurity: G = 1 − Σ p_i²

Divide and Classify

A decision tree is one of the most intuitive machine learning models — it mirrors how humans make decisions through a series of questions. Given data with features and labels, the algorithm finds the best feature and threshold to split the data into purer groups. It repeats recursively on each subset until a stopping criterion is met. The result is a tree of if-then rules that can classify new data points by traversing from root to leaf.

Axis-Aligned Partitions

Each split in a standard decision tree tests a single feature against a threshold, creating a boundary that is perpendicular to one feature axis. The simulation shows these rectangular partitions overlaid on the 2D data — you can see how the tree approximates complex boundaries (circles, XOR patterns) through many small rectangular regions. This axis-alignment is both a strength (simple, interpretable) and a limitation (requires many splits for diagonal boundaries).

Growing and Pruning

An unpruned tree will keep splitting until every leaf is pure — perfectly classifying the training data but memorizing noise. The max_depth parameter controls this: shallower trees underfit (too simple) while deeper trees overfit (too complex). The sweet spot depends on the data complexity and noise level. In production, cross-validation is used to find the optimal depth, and ensemble methods aggregate many trees to get robust predictions.

From Single Trees to Forests

The power of decision trees was amplified by ensemble methods. Random Forests (Breiman, 2001) train hundreds of trees on bootstrap samples with random feature subsets, then vote on predictions — dramatically reducing overfitting. Gradient Boosting (XGBoost, LightGBM) builds trees sequentially, with each tree correcting the errors of the previous ones. These ensemble methods consistently win machine learning competitions on structured/tabular data.

FAQ

How does a decision tree work?

A decision tree makes predictions by asking a series of yes/no questions about the input features. At each internal node, it tests one feature against a threshold and branches left or right. The process repeats until reaching a leaf node, which contains the prediction. Training involves finding the best feature and threshold at each node to maximize information gain or minimize impurity.

What is information gain?

Information gain measures how much a split reduces uncertainty (entropy) about the class labels. A perfect split separates classes completely (gain equals initial entropy). The tree algorithm greedily picks the split with the highest information gain at each node. Gini impurity is an alternative measure — computationally cheaper and nearly identical in practice.

Why do decision trees overfit?

Without constraints, a tree can grow until every training point has its own leaf — achieving 100% training accuracy but memorizing noise. This is why depth limits, minimum samples per leaf, and pruning are essential. Ensemble methods like Random Forests and Gradient Boosting use many shallow trees to get the expressive power of deep trees without overfitting.

What are Random Forests?

A Random Forest trains many decision trees on random subsets of the data and features, then averages their predictions. This reduces the high variance (overfitting) of individual trees while preserving their ability to capture complex patterns. Random Forests are among the most reliable out-of-the-box ML algorithms, competitive with neural networks on tabular data.

Sources

Embed

<iframe src="https://homo-deus.com/lab/machine-learning/decision-tree/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub