In this blog, I will give a summary of the paper “Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization” and some of my understandings. As its title suggests, this paper proposes an efficient algorithm for bandit linear optimization, which achieves the optimal \(O^*(\sqrt{T})\) regret. This method studies the Hessian matrix of regularization function \(R(x)\) to estimate the cost function which shows its connection to interior point methods.

1. Bandit Linear Optimization

As introduced in the previous blog, the online linear optimization problem is defined as the following repeated game between the learner (player) and the environment(adversary).

At each time step t=1 to T

  • Player chooses \(x_t \in \mathcal{K}\)
  • Adversary independently chooses \(f_t\in \mathbb{R}^n\)
  • Player suffers loss \(f_t^T x_t\) and observes feedback \(s\).

The Player’s goal is to minimize his regret \(R_T\) defined as

\[R_T = \sum_{t=1}^T f_t^Tx_t - \min_{x^*\in\mathcal{K}}\sum_{t=1}^T f_t^T x^*\]

The above problem has two versions: full-information and bandit. In full-information setting, the Player may observe the entire function \(f_t\) and its feedback \(s\), as discussed in the earlier online gradient descent (OGD) blog. This paper considers a more challenging bandit setting, where only the feedback \(s=f_t x_t\) is provided to the Player, but not the function \(f_t\). At the first glance, the bandit setting is more difficult than the full-information setting bacause of lacking information. But suprisingly, the optimal regret bound is \(O^*(\sqrt{T})\) similar to full-information setting. The method proposed in this paper actually achieves this optimal bound. The cost we pay for less information only relates to the dimension the desition variables \(n\) but not the number of total rounds \(T\).

2. Explore and Exploit

To cope with lack of information, one reasonable direction is to construct an unbiased estimation \(\tilde{f}_t\) of the cost function \(f_t\) and feed the estimates \(\tilde{f}_t\) to a full-information algorithm \(\mathcal{A}\) lick OGD or follow the regularized leader (FTRL). Formally, we introduce the concepts of explore and exploit:

  • Explore: Construct some random estimate \(\tilde{f}_t\) in such a way that \(E[\tilde{f}_t]=f_t\).
  • Exploit: Query some full-information algorithm \(\mathcal{A}\) to get the decision variable \(x_t\).

In the literature, there are roughly two categories of approches:

  • Alternating Explore/Exploit: Flip an \(\epsilon\)-biased coin to determine whether to explore or exploit. On explore rounds, sample uniformly on some wide region around \(\mathcal{K}\) and estimate \(f_t\) accordingly, and input this into \(\mathcal{A}\). On exploit rounds, query \(\mathcal{A}\) for \(x_t\) and predict this. These methods fail to obtain the desired \(O(\sqrt{T})\) bound since they do not make fully use of the information at each round. In fact, \(\Omega(T^{2/3})\) regret is unavoidable for these alorithms.
  • Simultaneous Explore/Exploit: Query \(\mathcal{A}\) for \(x_t\) and construct a random vector \(X_t\) such that \(E[X_t] = x_t\). Construct \(\tilde{f}_t\) randomly based on the outcome of \(X_t\) and the learned value \(f_t^T X_t\). The methods in this category are more sophisticated and motivate the approach in this paper.

3. The Curse of High Variance and the Blessing of Regularization

The authors then review two of previous work falling in the second category. They point out that one of the main difficulty in bandit setting comes from the high variance of the estimate \(\tilde{f}_t\). That is, the regret bound OGD on \(\tilde{f}_t\) involves terms \(E[||\tilde{f}_t||^2]=O(1/r^2)\) where \(r\) is the distance to the boundry. This means that the regret scales as the inverse of the squared distance to the bountry. If we require \(\tilde{f}_t\) be unbiased and \(x_t\) be the center of the sampling distribution, the high variance property can be shown to be intrinsic to the problem and thus not avoidable.

Fortunately, if we choose the regularization function \(R\) carefully, the algorithm will give a better behavior. The authors call it “the blessing of regularization”. Indeed, the formulation of the regularized minimization as a dual-space mirror descent comes to the rescue. The choice of regularization term motivates from the entropy function \(R(x) = \sum_{i=1}^n x[i]log(x[i])\). By taking the second derivative, the curvature of the entropy function increases \(1/x[i]\). So in this paper, they choose a regularization function \(R(x)\) that curves as inverse squared distance to the boundary, which is so called self-concordant barriers in the area of theory of interior point methods.

A self-concordant function \(R(x): int\,\mathcal{K} \rightarrow \mathbb{R}\) is a \(C^3\) convex function such that

\[| D^3 R(x)[h,h,h] | \leq 2 \left( D^2R(x)[h,h] \right)^{3/2}\]

And the third-order differential is defined as

\[D^3R(x)[h_1,h_2,h_3] = \frac{\partial^3}{\partial t_1 \partial t_2\partial t_3} |_{t_1=t_2=t_3=0} R(x+t_1h_1+t_2h_2+t_3h_3)\]

where \(h_1,h_2,h_3 \in \mathbb{R}^n\) are arbitrary direction vectors in \(\mathbb{R}^n\). Besides, A self-concordant barrier \(R\) is called \(\theta\)-self-concordant if

\[|DR(x)[h]| \leq \theta^{1/2} [D^2 R(x)[h,h]]^{1/2}\]

The \(\theta\)-self-concordant functions lie in the fundation of the method proposed in this paper. See section 5 of the original paper for a deailed discussion on its properties.

4. Main Results

The alorithm for bandit online linear optimization is stated as follows. For initialization, we choose \(\eta>0\) and a \(\theta\)-self-concordant function \(R\). Let \(x_1 = \arg\min_{x\in\mathcal{K}} R(x)\). Then at each time step \(t=1\) to \(T\):

  1. Let \(\{ e_1,\dots,e_n \}\) and \(\{ \lambda_1,\dots,\lambda_n \}\) be the set of eigenvectors and eigenvalues of \(\nabla^2 R(x_t)\).
  2. Choose \(i_t\) uniformly at random from \(\{1,\dots,n\}\) and \(\epsilon_t = \pm 1\) with probability \(1/2\).
  3. Predict \(y_t = x_t + \epsilon_t \lambda_{i_t}^{-1/2}e_{i_t}\)
  4. Observe the gain \(f_t^\top y_t \in \mathbb{R}\).
  5. Define \(\tilde{f}_t = n(f_t^T y_t)\epsilon_t\lambda_{i_t}^{1/2}\cdot e_{i_t}\)
  6. Update \(x_{t+1} = \arg\min_{x\in\mathcal{K}} \left[ \eta \sum_{s=1}^t \tilde{f}_s^T + R(x) \right]\).

Basically, this method makes use of the hessian metrix of the \(\theta\)-self-concordant regularization \(R(x)\) to approximate the curves of boundary. The square root terms are added to the eigenvalues to make the estimation and prediction unbiased. With some technical assumptions, the regret of the algorithm is bounded by the following inequality:

\[E\sum_{t=1}^T f_t^T y_t \leq \min_{u \in \mathcal{K}^\prime} E\left( \sum_{t=1}^T f_t^T u \right) + 16n\sqrt{\theta T log(T)}\]

where \(\mathcal{K}^\prime \subset \mathcal{K}\) is some subset of \(\mathcal{K}\). The expected regret over the original set \(\mathcal{K}\) is within an additive \(O(\sqrt{nT})\) factor from the above guarantee. Although \(O(\sqrt{Tlog(T)})\) is quite close to the desired regret bound, it is not \(O(\sqrt{T})\) as claimed in the introduction. I am not sure if I miss anything.

The remaining part of the paper mainly focuses on the proof of the above regret bound and is skipped for simplicity.

5. Some Thoughts

To be honest, this is a quite hard paper to read. There are tons of mathemtical concepts and proofs which are not easy to follow. I can hardly say I fully understand all the insights behind this paper. But this is also a good paper that worth reading multiple times since there are many insights and techniques which we can learn.