Learning theory - online learning
Welcome back! Today’s post focuses on the online learning perspective of learning. The online learning approach differs from statistical learning because it makes no assumption on the data source. In other words, the process generating the data is arbitrary!1 Instead of focusing on the data source, online learning makes assumptions on the sequence of loss functions to attain some level of tractability. Yep, you read that correctly: we will have a sequence of loss functions—not just one to rule them all.
A protocol for online learning
Here’s how our online learning world works.
- An algorithm generates an initial model .
- At each time , the model is tested on a data point.
- After being tested, the algorithm updates .
The second step suggests that we’re going to need some notion of loss—i.e., something like risk—to help us test our model. For our purposes, we’ll assume that the loss functions we encounter are convex, nonnegative, and differentiable.
For application-minded readers, model spaces are typically linear but can be finite dimensional (e.g., regression or support vector machine parameters) or infinite dimensional, via reproducing Kernel Hilbert spaces.2
Assessing performance
We’re going to assess the performance of our (model-generating) algorithm via its loss rate, defined as
But this is only part of the story. A loss rate by itself is uninteresting because losses are arbitrary. So, we’ll focus on controlling our regret
where
is the best model in the class. When the regret goes to zero, then we are learning and also experiencing some type of consistency.
The algorithm of choice - online gradient descent
Online gradient descent (OGD) is akin to empirical risk minimization in statistical learning theory, but it’s a bit different than typical gradient descent because the losses can vary over time. Nonetheless, the algorithm is quite simple.
Algorithm online gradient descent.
- given parameters , a radius , and an initial model .
- repeat for
- compute a tentative gradient step:
- project back to the model space:
Neural connection. Online gradient descent is similar to stochastic gradient except there’s no stochasticity in the data.
Food for thought. Online gradient descent is a local optimization perspective rather than global optimization as in empirical risk minimization. Because of this, online gradient descent scales gracefully to large problems.
Analyzing online gradient descent
Let’s fix a time and analyze the instantaneous regret
By convexity, the first-order Taylor expansion of around gives,
This is pretty much the definition of convexity.3 By the OGD algorithm, we calculate
In (a) we compare the current model at time to the best model plus a penalty for changing models. In (b) we use the fact that projecting reduces the distance to ; since that term is negative, we get the inequality. In (c) we add and subtract the same term for reasons which will become clear shortly.
Summing it up
Now, we need to sum over
where , , and . Continuing from our last line and using , we calculate
by removing the last two terms,
from the summation term involving . The terms cancel each other and we throw about the term since it’s negative, leaving us with
where we use telescoping and the throw-away-a-negative-term trick to get the inequality (a).
Putting it all together
We can pick our terms to equate terms in the sum. Specifically, we choose them as
So, choosing as above, selecting , and dividing by , our final bound, i.e., our rate, is
Linear regression with squared loss
Let’s do linear regression with squared loss using the assumptions
- ,
- , and
- .
Under this set up our regret is
since . (The norm bound on makes use of the Cauchy–Schwarz inequality.) Now, hold on a sec… this is the same rate as the Rademacher complexity for the statistical learning approach!
Next up
… reconciling and connecting statistical learning and online learning!
References
This post is related to material from Nicolò Cesa-Bianchi’s lectures at the Bordeaux Machine Learning Summer School in 2011. Watch them here: MLSS 2011 lectures.