Online learning theory - strong convexity
Today’s post is a neat example of what exploiting structure can do. It ties back to the post that introduced online learning. In that discussion, we had an online gradient descent algorithm that relied on an auxiliary sequence of models. We calculated the auxiliary sequence using gradient descent and then we projected back into our model space, which was all fine and good; however, in this post we’re going to show that we can avoid that projection step when we have a loss function that is strongly convex.
Strong convexity
Before we defined strong convexity, try to guess what it could look like. If you imagined a function that has positive (more precisely, nonnegative) curvature everywhere, then you’re spot on!
We’ll assume that our loss function is -strongly, meaning
(Technically, strong convexity is defined with respect to a norm—in this case, it’s the norm.) So, strong convexity means that our function grows faster than linearly in all directions.
Online gradient descent
With strongly convex losses, the online gradient descent algorithm is a one-liner.
Algorithm online gradient descent with strongly convex losses.
- given parameters , an initial model , and a bound on the norm of the gradient, i.e., .
- repeat for
- compute a gradient step:
That’s it! Now here’s the cool part. If we know the minimum eigenvalue , associated with the Hessian of , then we can choose which gives
We can interpret this bound as: convergence is fast when we have global information about the sequence of loss functions. No proof this time (although it’s very similar to the online gradient descent proof), so that’s it for today!
Next up
… a foray into the mind of R.A. Fisher with Brad Efron’s help!
References
This post is related to material from Nicolò Cesa-Bianchi’s lectures at the Bordeaux Machine Learning Summer School in 2011. Watch them here: MLSS 2011 lectures.