Upper confidence bounds
Today we’re going to prove the upper bound that results from the upper confidence bound (UCB) strategy. The UCB strategy allowed us to bound the pseudo-regret by a term which is logarithmic in the number of rounds played. Generally speaking, “good” regret bounds for bandit problems are sublinear in the number of rounds played.
Stochastic bandit problems
As a quick review, here’s the set-up for the stochastic multi-armed bandit problem.
- There are arms.
- Each arm corresponds to an unknown probability distribution .
- At each time , the player selects an arm (independently of previous choices) and receives a reward .
- We assess a player’s performance by her regret, which compares her performance to (something like) the best-possible performance.
- After the first rounds of the game, the player’s regret is
- where is the number of bandit arms, is the reward of arm at time , and is the player’s (possibly randomly) chosen arm.
We’ll denote the mean of distribution with and define
as optimal parameters. Instead of using the regret to assess our player’s performance, we’ll use the pseudo-regret, which is defined as
We know that because of Jensen’s inequality for convex functions. For our purposes, a more informative form of the pseudo-regret is
where is the suboptimality of arm and the random variable is the number of times arm is selected in the first rounds.
Upper confidence bounds
One approach for simultaneously performing exploration and exploitation is to use upper confidence bound (UCB) strategies. UCB strategies produce upper bounds (based on a confidence level) on the current estimate for each arm’s mean reward. The player selects the arm that is the best using the UCB-modified estimate.
In the spirit of brevity, we’re going to gloss over a few important details (e.g., Legendre–Fenchel duals and Hoeffding’s lemma) but here’s the gist of the UCB set-up.
- Define as the sample mean of arm after being played times.
- For a convex function with convex conjugate and parameter , the cumulant generating function of is bounded by :
- By Markov’s inequality and the Cramér–Chernoff technique
- Markov’s inequality means that, with probability , we have an upper bound on the estimate of the th arm:
The UCB strategy is (very) simple; at each time , select the arm by
The tradeoff. Let’s pick apart this term and see how it encourages exploration and exploitation. We’re going to use with convex conjugate , so that .
When arm has not been played many times before time $t$, i.e., when is small numerator of dominates and the UCB-adjusted estimate of arm is likely to relatively high, which encourages arm to be chosen. In other words, it encourages exploration.
As arm is chosen more, increases and the estimate improves. If arm is the optimal arm, then the term dominates the estimates for other arms and the UCB-adjustment further encourages selection of this arm. This means that the algorithm exploits high-yielding arms.
Upper bound on pseudo-regret. If our player follows the UCB scheme specified above, then for her pseudo-regret is upperbounded as
Proof
Okay, here goes. We incur regret when we don’t pick the optimal arm, i.e., when . So, we need to analyze the conditions that lead us to choose a suboptimal arm.
Assume we’ve played arm a total of times and arm a total of times. By the set up of our bandit scenario, if , i.e.,
then at least one of the following holds
- The first indicates that our current UCB-adjusted estimate of the mean for the optimal arm is less than its true value. It occurs with probability
- The second indicates that our estimate for arm exceeds its true value by an amount greater than the UCB—hence the UCB adjustment is not yet helpful. It occurs with probability
- The third follows from similar logic as the second, and is illuminated by the rewriting:
-
The probability of the third event is 0 when .
-
If #1, #2, and #3 are false, then we pick the optimal arm, meaning we incur no regret.
Now, we let’s consider the implications of these events on , because this will allow us to bound the pseudo-regret. We have that
(It took me a few times to follow that argument, so I would re-read it a few times. I also put some notes in the intermediate steps subsection below.)
Now, we just sum over the suboptimal arms to get the pseudo-regret bound
This is the proof given in Bubeck and Cesa-Bianchi’s bandits survey, which is an adaptation of the proof of Auer, Cesa-Bianchi, and Fischer’s original analysis.
Next up
… a proof of that cool inequality that helped us in the proof of concentration inequality implied by the logarithmic Sobolev inequality a few posts back!
References
This post is based on material from:
- Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems by S. Bubeck and N. Cesa-Bianchi. A digital copy can be found here.
- Finite-time Analysis of the Multiarmed Bandit Problem by P. Auer, N. Cesa-Bianchi, and P. Fischer. A digital copy can be found here.
- Prediction, Learning, and Games (a.k.a. “Perdition, Burning, and Flames”) by N. Cesa-Bianchi and G. Lugosi. Find a description of the book and its table of contents here.
The intermediate steps
We can interpret the relationship
by observing that the indicator and be written as . Inequality #3 is true when . So, the bound follows by assuming that indicator is active at .
Because we allow either #1 or #2 to be true, the event
meaning the probability of the second event is larger.