Question 1

What is the multi-armed bandit problem?

Accepted Answer

The multi-armed bandit problem models the dilemma of choosing between exploring new options (to learn their value) and exploiting the best-known option (to maximize reward). Named after a gambler facing a row of slot machines, it formalizes the explore-exploit tradeoff that appears in clinical trials, A/B testing, ad placement, and recommendation systems.

Question 2

How does the UCB1 algorithm work?

Accepted Answer

UCB1 (Upper Confidence Bound) selects the arm with the highest upper confidence bound: UCB = x̄ᵢ + √(2·ln(n)/nᵢ), where x̄ᵢ is the empirical mean reward of arm i, n is the total pulls, and nᵢ is the number of times arm i was pulled. The confidence term naturally decreases as an arm is sampled more, providing automatic exploration.

Question 3

What is Thompson sampling?

Accepted Answer

Thompson sampling is a Bayesian approach where each arm's reward probability is modeled with a Beta distribution. At each step, a sample is drawn from each arm's posterior distribution, and the arm with the highest sample is selected. This naturally balances exploration (uncertain arms get lucky draws) and exploitation (well-known good arms consistently draw high).

Question 4

What is regret in bandit problems?

Accepted Answer

Regret measures the total reward lost compared to always pulling the optimal arm. Cumulative regret R(T) = T·μ* - Σᵢμᵢ·nᵢ, where μ* is the best arm's mean reward. Optimal algorithms achieve O(ln T) regret, meaning the per-round regret approaches zero as the algorithm learns.

Multi-Armed Bandit Simulator: Explore vs Exploit Strategies Compared

Formula

The Explore-Exploit Dilemma

Strategies Compared

Understanding Regret

Real-World Applications

FAQ

Sources

Embed

Multi-Armed Bandit Simulator: Explore vs Exploit Strategies Compared

Formula

The Explore-Exploit Dilemma

Strategies Compared

Understanding Regret

Real-World Applications

FAQ

Sources

Other simulations: Decision Theory

Kelly Criterion Betting

Auction Theory Simulator

Markowitz Portfolio Frontier

Prospect Theory & Loss Aversion

Embed