Multi-Armed Bandit Simulator: Explore vs Exploit Strategies Compared

simulator intermediate ~8 min
Loading simulation...
UCB1 Strategy: Achieves logarithmic regret by balancing confidence bounds

With 5 arms and UCB1 strategy over 500 pulls, the algorithm typically identifies the best arm within 50-100 pulls and achieves cumulative regret of roughly O(ln N). UCB1 systematically explores under-sampled arms by adding an optimistic confidence bonus, achieving near-optimal exploration without any tuning parameters.

Formula

UCB1 selection: aₜ = argmax[x̄ᵢ + √(2·ln(t)/nᵢ)]
Cumulative regret: R(T) = T·μ* - Σ rewards
Thompson sampling: sample θᵢ ~ Beta(αᵢ, βᵢ), select argmax θᵢ

The Explore-Exploit Dilemma

Imagine you are in a casino with five slot machines, each with a different (unknown) payout rate. Should you keep pulling the lever that has paid out most so far, or try others that might be even better? This is the multi-armed bandit problem, one of the most fundamental challenges in sequential decision-making. The tension between exploration (gathering information) and exploitation (using what you know) appears everywhere — from clinical trials deciding which treatment to give patients, to websites running A/B tests.

Strategies Compared

This simulator lets you compare four classic strategies. Random selection serves as a baseline with linear regret. Epsilon-greedy exploits the best-known arm most of the time but randomly explores with probability ε. UCB1 adds an optimistic confidence bonus to under-explored arms, achieving logarithmic regret without any tuning parameter. Thompson sampling uses Bayesian inference, drawing samples from posterior distributions and selecting the arm with the highest sample.

Understanding Regret

The key metric in bandit problems is cumulative regret — how much reward you have lost compared to an oracle that always picks the best arm. Random selection accumulates regret linearly forever. Smart algorithms like UCB1 and Thompson sampling achieve logarithmic regret: the regret curve flattens over time as the algorithm converges on the best arm. Watch the regret curves in the simulation — the difference between strategies becomes dramatically clear over hundreds of pulls.

Real-World Applications

Multi-armed bandits are the engine behind modern tech platforms. Google uses bandit algorithms to optimize ad placement across billions of impressions. Netflix and Spotify use contextual bandits for recommendations. Pharmaceutical companies use adaptive clinical trial designs based on bandit theory to allocate more patients to effective treatments. The mathematical insights from this simple slot machine model have transformed how we make sequential decisions at scale.

FAQ

What is the multi-armed bandit problem?

The multi-armed bandit problem models the dilemma of choosing between exploring new options (to learn their value) and exploiting the best-known option (to maximize reward). Named after a gambler facing a row of slot machines, it formalizes the explore-exploit tradeoff that appears in clinical trials, A/B testing, ad placement, and recommendation systems.

How does the UCB1 algorithm work?

UCB1 (Upper Confidence Bound) selects the arm with the highest upper confidence bound: UCB = x̄ᵢ + √(2·ln(n)/nᵢ), where x̄ᵢ is the empirical mean reward of arm i, n is the total pulls, and nᵢ is the number of times arm i was pulled. The confidence term naturally decreases as an arm is sampled more, providing automatic exploration.

What is Thompson sampling?

Thompson sampling is a Bayesian approach where each arm's reward probability is modeled with a Beta distribution. At each step, a sample is drawn from each arm's posterior distribution, and the arm with the highest sample is selected. This naturally balances exploration (uncertain arms get lucky draws) and exploitation (well-known good arms consistently draw high).

What is regret in bandit problems?

Regret measures the total reward lost compared to always pulling the optimal arm. Cumulative regret R(T) = T·μ* - Σᵢμᵢ·nᵢ, where μ* is the best arm's mean reward. Optimal algorithms achieve O(ln T) regret, meaning the per-round regret approaches zero as the algorithm learns.

Sources

Embed

<iframe src="https://homo-deus.com/lab/decision-theory/multi-armed-bandit/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub