Upper confidence bounds

Today we’re going to prove the upper bound that results from the upper confidence bound (UCB) strategy. The UCB strategy allowed us to bound the pseudo-regret by a term which is logarithmic in the number of rounds played. Generally speaking, “good” regret bounds for bandit problems are sublinear in the number of rounds played.

Stochastic bandit problems

As a quick review, here’s the set-up for the stochastic multi-armed bandit problem.

There are $\{1,\dots,K\}$ arms.
Each arm $i$ corresponds to an unknown probability distribution $p_i$ .
At each time $t$ , the player selects an arm $I_t$ (independently of previous choices) and receives a reward $X_{I_t,t} \sim p_{I_t}$ .
We assess a player’s performance by her regret, which compares her performance to (something like) the best-possible performance.
After the first $n$ rounds of the game, the player’s regret is

$R_n = \max_{i=1,\dots,K} \sum_{i=1}^n X_{i,t} - \sum_{i=1}^n X_{I_t,t},$

where $K$ is the number of bandit arms, $X_{i,t}$ is the reward of arm $i$ at time $t$ , and $I_t$ is the player’s (possibly randomly) chosen arm.

We’ll denote the mean of distribution $i$ with $\mu_i$ and define

$\mu^\star = \max_{i=1,\dots,K} \mu_i \quad \textsf{and} \quad i^\star = \argmax_{i=1,\dots,K} \mu_i$

as optimal parameters. Instead of using the regret to assess our player’s performance, we’ll use the pseudo-regret, which is defined as

$\begin{align*} \bar{R}_n &= \max_{i=1,\dots,K} \E \left[ \sum_{i=1}^n X_{i,t} - \sum_{i=1}^n X_{I_t,t} \right] \\ &= n \mu^\star - \E \sum_{t=1}^n \mu_{I_t}. \end{align*}$

We know that $\bar{R}_n \le \E R_n$ because of Jensen’s inequality for convex functions. For our purposes, a more informative form of the pseudo-regret is

$\bar{R}_n = \sum_{i=1}^K \Delta_i \E T_i(n),$

where $\Delta_i = \mu^\star - \mu_i$ is the suboptimality of arm $i$ and the random variable $T_i(n) = \sum_{i=1}^n \ind{}\{I_t = i\}$ is the number of times arm $i$ is selected in the first $n$ rounds.

Upper confidence bounds

One approach for simultaneously performing exploration and exploitation is to use upper confidence bound (UCB) strategies. UCB strategies produce upper bounds (based on a confidence level) on the current estimate for each arm’s mean reward. The player selects the arm that is the best using the UCB-modified estimate.

In the spirit of brevity, we’re going to gloss over a few important details (e.g., Legendre–Fenchel duals and Hoeffding’s lemma) but here’s the gist of the UCB set-up.

Define $\widehat{\mu}_{i,n} = \frac{1}{n} \sum_{j=1}^n X_{i,j}$ as the sample mean of arm $i$ after being played $n$ times.
For a convex function $\psi$ with convex conjugate $\psi^*$ and parameter $\lambda > 0$ , the cumulant generating function of $X$ is bounded by $\psi$ :

$\ln \E e^{\lambda(X - \E X)} \le \psi(\lambda) \quad \textsf{and} \quad \ln \E e^{\lambda(\E X - X)} \le \psi(\lambda).$

By Markov’s inequality and the Cramér–Chernoff technique

$\prob \left( \mu_i - \widehat{\mu}_{i,n} \ge \varepsilon \right) \le e^{-n \psi^*(\varepsilon)}.$

Markov’s inequality means that, with probability $1 - \delta$ , we have an upper bound on the estimate of the $i$ th arm:

$\mu_i < \widehat{\mu}_{i,n} + (\psi^*)^{-1} \left(\frac{1}{n} \ln \frac{1}{\delta} \right).$

The UCB strategy is (very) simple; at each time $t$ , select the arm $I_t$ by

$I_t \in \argmax_{i=1,\dots,K} \left\{ \widehat{\mu}_{i,T_i(t-1)} + (\psi^*)^{-1} \left( \frac{\alpha \ln t}{T_i(t-1)} \right) \right\}.$

The tradeoff. Let’s pick apart this term and see how it encourages exploration and exploitation. We’re going to use $\psi(x) = x^2/8$ with convex conjugate $\psi^*(z) = 2z^2$ , so that $(\psi^*)^{-1}(z) = (x/2)^{1/2}$ .

When arm $i$ has not been played many times before time $t$, i.e., when $T_i(t-1)$ is small numerator of $\frac{\alpha \ln t}{T_i(t-1)}$ dominates and the UCB-adjusted estimate of arm $i$ is likely to relatively high, which encourages arm $i$ to be chosen. In other words, it encourages exploration.

As arm $i$ is chosen more, $T_i(t-1)$ increases and the estimate $\widehat{\mu}_{i,T_i(t-1)}$ improves. If arm $i$ is the optimal arm, then the $\widehat{\mu}_{i,T_i(t-1)}$ term dominates the estimates for other arms and the UCB-adjustment further encourages selection of this arm. This means that the algorithm exploits high-yielding arms.

Upper bound on pseudo-regret. If our player follows the UCB scheme specified above, then for $\alpha > 2$ her pseudo-regret is upperbounded as

$\bar{R}_n \le \sum_{i: \Delta_i > 0} \left( \frac{\alpha \Delta_i}{\psi^*(\Delta_i / 2)} \ln n + \frac{\alpha}{\alpha - 2} \right).$

Proof

Okay, here goes. We incur regret when we don’t pick the optimal arm, i.e., when $I_t \neq i^\star$ . So, we need to analyze the conditions that lead us to choose a suboptimal arm.

Assume we’ve played arm $i$ a total of $T_i(t-1)$ times and arm $i^\star$ a total of $T_{i^\star}(t-1)$ times. By the set up of our bandit scenario, if $I_t = i$ , i.e.,

$I_t = i = \argmax_{j=1,\dots,K} \left\{ \widehat{\mu}_{j,T_j(t-1)} + (\psi^*)^{-1} \left( \frac{\alpha \ln t}{T_i(t-1)} \right) \right\},$

then at least one of the following holds

$\begin{align*} \widehat{\mu}_{i^\star, T_{i^\star}(t-1)} + (\psi^*)^{-1} \left( \frac{\alpha \ln t}{T_{i^\star}(t-1)} \right) &\le \mu^\star && \#1 \\ \mu_i + (\psi^*)^{-1} \left( \frac{\alpha \ln t}{T_i(t-1)} \right) &< \widehat{\mu}_{i, T_i(t-1)} && \#2 \\ \mu^\star - 2(\psi^*)^{-1} \left( \frac{\alpha \ln n}{T_i(t-1)} \right) &< \mu_i && \#3. \end{align*}$

The first indicates that our current UCB-adjusted estimate of the mean for the optimal arm is less than its true value. It occurs with probability

$\prob \left( \widehat{\mu}_{i^\star, T_{i^\star}(t-1)} + (\psi^*)^{-1} \left( \frac{\alpha \ln t}{T_{i^\star}(t-1)} \right) \ge \mu^\star \right) \le \exp \left ( -\alpha \ln t \right) = t^{-\alpha}.$

The second indicates that our estimate for arm $i$ exceeds its true value by an amount greater than the UCB—hence the UCB adjustment is not yet helpful. It occurs with probability

$\prob \left( \mu_i + (\psi^*)^{-1} \left( \frac{\alpha \ln t}{T_i(t-1)} \right) > \widehat{\mu}_{i, T_i(t-1)} \right) \le \exp \left ( -\alpha \ln t \right) = t^{-\alpha}.$

The third follows from similar logic as the second, and is illuminated by the rewriting:

$\begin{align*} \Delta_i = \mu^\star - \mu_i &< 2(\psi^*)^{-1} \left( \frac{\alpha \ln n}{T_i(t-1)} \right) \\ \psi^*(\Delta_i / 2) < \frac{\alpha \ln n}{T_i(t-1)} &\iff T_i(t-1) < \frac{\alpha \ln n}{\psi^*(\Delta_i / 2)}. \end{align*}$

The probability of the third event is 0 when $T_i(n) \ge u = \left\lceil \frac{\alpha \ln n}{\psi^*(\Delta_i / 2)} \right\rceil$ .
If #1, #2, and #3 are false, then we pick the optimal arm, meaning we incur no regret.

Now, we let’s consider the implications of these events on $\E T_i(t-1)$ , because this will allow us to bound the pseudo-regret. We have that

$\begin{align*} \E T_i(n) &= \E \sum_{t=1}^n \ind{}\{I_t = i\} \\ &\le u + \E \sum_{t=u+1}^n \ind{}\{I_t = i \textsf{ and inequality #3 is false}\} \\ &\le u + \E \sum_{t=u+1}^n \ind{}\{\textsf{inequality #1 or #2 is true}\} \\ &= u + \sum_{t=u+1}^n \prob\left( \ind{\textsf{inequality #1}} \right) + \prob\left( \ind{\textsf{inequality #2}} \right) \\ &\le u + 2\sum_{t=1}^n \sum_{s=1}^t \frac{1}{t^\alpha} \\ &= u + 2\sum_{t=1}^n \frac{1}{t^{\alpha-1}} \\ &\le \frac{\alpha \ln n}{\psi^*(\Delta_i / 2)} + 2 \int_{1}^{\infty} \frac{dt}{t^{\alpha-1}} \\ &= \frac{\alpha \ln n}{\psi^*(\Delta_i / 2)} + \frac{\alpha}{\alpha-2}. \end{align*}$

(It took me a few times to follow that argument, so I would re-read it a few times. I also put some notes in the intermediate steps subsection below.)

Now, we just sum over the suboptimal arms to get the pseudo-regret bound

$\bar{R}_n \le \sum_{i: \Delta_i > 0} \left( \frac{\alpha \Delta_i}{\psi^*(\Delta_i / 2)} \ln n + \frac{\alpha}{\alpha - 2} \right).$

This is the proof given in Bubeck and Cesa-Bianchi’s bandits survey, which is an adaptation of the proof of Auer, Cesa-Bianchi, and Fischer’s original analysis.

Next up

… a proof of that cool inequality that helped us in the proof of concentration inequality implied by the logarithmic Sobolev inequality a few posts back!

References

This post is based on material from:

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems by S. Bubeck and N. Cesa-Bianchi. A digital copy can be found here.
Finite-time Analysis of the Multiarmed Bandit Problem by P. Auer, N. Cesa-Bianchi, and P. Fischer. A digital copy can be found here.
Prediction, Learning, and Games (a.k.a. “Perdition, Burning, and Flames”) by N. Cesa-Bianchi and G. Lugosi. Find a description of the book and its table of contents here.

The intermediate steps

We can interpret the relationship

$\E T_i(n) = \E \sum_{t=1}^n \ind{}\{I_t = i\} \le u + \E \sum_{t=u+1}^n \ind{}\{I_t = i \textsf{ and inequality #3 is false}\}$

by observing that the indicator $\ind{}\{I_t = i\}$ and be written as $\ind{}\{I_t = i \textsf{ and inequality #3 is false or true}\}$ . Inequality #3 is true when $T_i(k) \le u = \left\lceil \frac{\alpha \ln n}{\psi^*(\Delta_i / 2)} \right\rceil$ . So, the bound follows by assuming that indicator is active at $k=1,\dots,u$ .

Because we allow either #1 or #2 to be true, the event

$\big\{I_t = i \textsf{ and inequality #3 is false}\big\} \subseteq \big\{\textsf{inequality #1 or #2 is true}\big\},$

meaning the probability of the second event is larger.