The Explore-Exploit Dilemma
Imagine you are in a casino with five slot machines, each with a different (unknown) payout rate. Should you keep pulling the lever that has paid out most so far, or try others that might be even better? This is the multi-armed bandit problem, one of the most fundamental challenges in sequential decision-making. The tension between exploration (gathering information) and exploitation (using what you know) appears everywhere — from clinical trials deciding which treatment to give patients, to websites running A/B tests.
Strategies Compared
This simulator lets you compare four classic strategies. Random selection serves as a baseline with linear regret. Epsilon-greedy exploits the best-known arm most of the time but randomly explores with probability ε. UCB1 adds an optimistic confidence bonus to under-explored arms, achieving logarithmic regret without any tuning parameter. Thompson sampling uses Bayesian inference, drawing samples from posterior distributions and selecting the arm with the highest sample.
Understanding Regret
The key metric in bandit problems is cumulative regret — how much reward you have lost compared to an oracle that always picks the best arm. Random selection accumulates regret linearly forever. Smart algorithms like UCB1 and Thompson sampling achieve logarithmic regret: the regret curve flattens over time as the algorithm converges on the best arm. Watch the regret curves in the simulation — the difference between strategies becomes dramatically clear over hundreds of pulls.
Real-World Applications
Multi-armed bandits are the engine behind modern tech platforms. Google uses bandit algorithms to optimize ad placement across billions of impressions. Netflix and Spotify use contextual bandits for recommendations. Pharmaceutical companies use adaptive clinical trial designs based on bandit theory to allocate more patients to effective treatments. The mathematical insights from this simple slot machine model have transformed how we make sequential decisions at scale.