In this blog, I will give a summary of the paper “Project-free Online Learning” and some of my understandings. As its title suggests, this paper proposes an efficient projection-free algorithm for online convex optimization. As we discussed earlier, traditional online gradient descent method requires a projection step onto feasible set. In Euclidean space, this projection step is essentially a step to solve a convex quadratic programming problem on a convex set. This step can be computational expensive for certain constraint set \(\mathcal{K}\). So in this paper, the authors eschew projections in favor of much more efficient linear optimization steps using Frank-Wolfe technique. A range of regret bounds are derived for different settings. Experiments on applications of collaborative filtering show the improvements on standard datasets.

1. Convex sets on which projection is expensive

For a convex set \(\mathcal{K}\), a projection step in Euclidean space is given by

\[P_{\mathcal{K}}(y) = \arg \min_{x\in\mathcal{K}} ||x-y||^2_2\]

Moreover, a linear optimization step involves solving

\[\min_{x\in\mathcal{K}} v^\top x\]

for some vector \(v\). Next, I will give some examples of decision sets on which linear programming is much more efficient.

Bounded Trace Norm Matrices. The set of matrices with bounded trace norm is a common decision set for applications like matrix completion and collaborative filtering. Formally, the set is given by

\[\mathcal{K} = \{ A\in \mathbb{R}^{m\times n}: \|A\|_1 = \text{trace}(\sqrt{A^{\top} A}) \leq \tau \}\]

The projection of a matrix \(X\) involves computing SVD of \(X\) which requires \(O(nm^2)\) time, while linear optimization over \(\mathcal{K}\) can be done in linear time in the number of non-zero entries in the matrix \(X\).

Flow polytope. Given a directed acyclic graph \(G\) with \(n\) nodes and \(m\) edges with a source node \(s\) and a sink node \(t\), consider the set of all paths from \(s\) to \(t\). Let \(\mathcal{K}\) be the convex hull of all indicator vectors over edges for those paths. The set \(\mathcal{K}\) can be equivalently represented as a polytope with \(O(m+n)\) linear inequalities. This set \(\mathcal{K}\) arises in the online shortest path problem. Computing the projection on \(\mathcal{K}\) takes \(O(n^{3.5})\) time, while linear optimization on \(\mathcal{K}\) amounts to finding the shortest path on the graph, which can be efficiently solved.

Matroid polytope

Given a matroid \(M = (E, I)\), where \(|E| = n\), any independent set \(A\in I\) can be represented as a vector in \(R^n\) by its indicator vector. The corresponding polytope \(\mathcal{K}\) can be defined using \(O(2^n)\) linear inequalities. The projection on \(\mathcal{K}\) is therefore a difficult operation, while linear optimization over \(\mathcal{K}\) amounts to sorting the objective vector which solves it in \(O(nlog(n))\) time.

Rotations

Rotation matrices are \({n\times n}\) orthogonal matrices with determinant \(1\). The convex hull of all rotation matrices \(\mathcal{K}\) arises in the online learning of rotations problem. The projection on \(\mathcal{K}\) is very difficult while linear optimization over \(\mathcal{K}\) can be solved using one SVD.

2. Preliminaries

The online convex optimization setting is similar to previous blogs. In each round \(t = 1,2,\dots,T\), the player gives a decision \(x_t \in \mathcal{K}\) and the adversary produces a convex cost function \(f_t\) and the player suffers the cost \(f_t(x_t)\). The goal of the player is to produce points \(x_t\) such that the regret

\[\text{Regret} := \sum_{t=1}^T f_t(x_t) - \min_{x\in \mathcal{K}} \sum_t f_t(x)\]

is sub-linear in \(T\).

A more interesting concept called smoothed functions is introduced in the paper. Given \(\delta >0\), the \(\delta\)-smoothing of a function \(f\) is given by:

\[\hat{f}_\delta = E_{u\in B} [f(x+\delta u)]\]

where \(u\) is chosen uniformly at random from the unit ball \(B \subset \mathbb{R}^n\). If the original function \(f\) is convex and L-Lipschitz, The smoothed function has serval important properties. For example, \(\hat{f}_{\delta}\) is convex, L-Lipschitz and \(\frac{dL}{\delta}\)-smooth. The gradient of \(\hat{f}_{\delta}\) is bounded by \(dL\). Besides, the distance between smoothed and original function is bounded by \(\delta L\). See Lemma 2.1 in the original paper.

3. The algorithm

The algorithm uses Frank-Wolfe technique, so the authors call it Online Frank-Wolfe (OFW). For initialization, we choose a input parameter \(a\geq 0\) and some arbitrary initial point \(x_1\). Then for round \(t=1,2,\dots,T\), we perform the following main steps:

  1. Play \(x_t\) and observe \(f_t\).
  2. Compute \(F_t = \frac{1}{t} \sum_{\tau = 1}^{t} f_{\tau}\).
  3. Compute \(v_t = \arg\min_{x\in \mathcal{K}} \{ \nabla F_t(x_t)\cdot x \}\).
  4. Set \(x_{t+1}=(1-t^{-a})x_t + t^{-a}v_t\).

The linear optimization step shows in step 3. Define

\[\Delta_t = F_t(x) - F_t(x^{\star}_t)\]

where \(x^* = \arg\min_{x\in \mathcal{K}} F_t(x)\). The following general theorem lays a foundation in the regret analysis.

Theorem 3.1 Assume that for \(t=1,2,\dots,T,\) the function \(f_t\) is L-Lipschitz, \(Bt^{-b}\)-smooth for some constants \(b\in[-1,1/2]\) and \(B\geq 0\), and \(St^{-s}\)-strongly convex for some constants \(s\in[0,1)\) and \(S\geq 0\). Then for all \(t>1\), we have

\[\Delta_t \leq Ct^{-d}\]

for both the following values of \(C\) and \(d\):

\[(C,d) = (max\{9D^2B,3LD\},\frac{1+b}{2})\]

and

\[(C,d) = (max\{9D^2B,36L^2/S,3LD\},\frac{2+2b-s}{3})\]

In either case, this bound is obtained by setting a=d-b in the above algorithm.

With the above theorem, in the different settings, we can obtain regret bounds by setting different parameters. For example, if the cost functions are stochastic and smooth, then for \(\beta\)-smooth stochastic convex loss functions \(f_t\), we can set \(B=\beta\), \(b=S=s=0\), and \(d=a=1/2\). A sub-linear regret bound \(O(\sqrt{T})\) can be derived with the help of above theorem (c.f. section 4.1.1).

In summary, the authors give the following table of regret bounds in different settings. Although all of regret bounds are sub-linear, the algorithm only achieves the optimal regret in the stochastic smooth setting. The additional regret costs in adversarial setting come from the projection-free operation.

  Stochastic Adversarial
Smooth costs \(\tilde{O}(\sqrt{T})\) \(O(T^{3/4})\)
Non-smooth costs \(\tilde{O}(T^{2/3})\) \(O(T^{3/4})\)

4. Conclusion

In the following section, the authors implemented the proposed algorithm on two standard datasets. The experiment results show the effectiveness of OFW over OGD. In conclusion, the authors proposed an an efficient algorithm scheme for online convex optimization that performs one linear optimization per iteration rather than one quadratic optimization. The advantages include computational efficiency, parameter free and efficient representation and sparsity. The regret analysis is performed in different settings. However, the regret bounds are not always optimal. So the open problem of how to improve the regret bounds are left as on future direction.