Decision Trees: How Algorithms Make Classification Decisions

simulator intermediate ~9 min
Loading simulation...
Accuracy ≈ 92% — clear decision boundaries

A depth-4 decision tree typically achieves ~92% training accuracy on well-separated 2-class data, creating interpretable rectangular decision regions in feature space.

Formula

Gini(node) = 1 - Σᵢ pᵢ²
Information Gain = Gini(parent) - Σⱼ (nⱼ/n) × Gini(childⱼ)
Entropy(node) = -Σᵢ pᵢ log₂(pᵢ)

The Most Interpretable Classifier

Decision trees are the only machine learning model that thinks the way humans naturally do: by asking a sequence of yes/no questions. 'Is the email longer than 50 words? Does it contain the word free? Is the sender unknown?' Each answer leads to another question or a final classification. This makes decision trees uniquely interpretable — you can explain every prediction by tracing the path from root to leaf.

Building the Tree: Greedy Splitting

The algorithm builds the tree top-down by choosing the best split at each node. 'Best' means the split that most reduces impurity — typically measured by Gini impurity or information gain (entropy). The algorithm considers every feature and every possible threshold, evaluating each split's ability to separate the classes. This greedy approach is fast but doesn't guarantee the globally optimal tree.

The Overfitting Problem

An unconstrained decision tree will perfectly classify every training point by creating a leaf for each one. This memorization of noise produces beautiful training accuracy and terrible real-world performance. The simulation above shows how increasing tree depth improves training accuracy but creates increasingly complex, jagged decision boundaries. Controlling depth is the key to balancing accuracy and generalization.

From Single Trees to Forests

Individual decision trees are simple but fragile — small changes in data can produce completely different trees. Random Forests (Breiman, 2001) solve this by growing hundreds of trees on random data subsets and averaging their predictions. Gradient Boosted Trees (XGBoost, LightGBM) build trees sequentially, each correcting the errors of the previous ones. These ensemble methods dominate Kaggle competitions and real-world applications alike.

FAQ

How does a decision tree classify data?

A decision tree recursively splits the data by choosing the feature and threshold that best separates the classes. At each node, it asks a yes/no question (e.g., 'Is x₁ > 3.5?'). Following the answers from root to leaf gives the predicted class.

What is Gini impurity?

Gini impurity measures how often a randomly chosen element would be misclassified. For a node with class proportions p₁, p₂, ..., Gini = 1 - Σpᵢ². Pure nodes (all one class) have Gini = 0. The tree chooses splits that minimize weighted Gini of child nodes.

Why do decision trees overfit?

Without constraints, a tree will keep splitting until every training point is correctly classified — memorizing the data including noise. This produces high training accuracy but poor generalization. Pruning, maximum depth limits, and minimum samples per leaf prevent overfitting.

What are Random Forests?

Random Forests combine many decision trees, each trained on a random subset of data and features. Individual trees may overfit, but their averaged predictions are robust. This ensemble method typically outperforms single trees and is one of the most reliable off-the-shelf classifiers.

Sources

Embed

<iframe src="https://homo-deus.com/lab/data-science/decision-tree/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub