Learning theory - reconciliations and connections
Howdy learning theorists! Today we’re going to connect statistical learning and online learning. We’ll do so via model averaging and exploiting online learning’s local perspective. To that end, this post could have easily been called “the power of local optimization”, which has a nice ring to it.
Our goal is address a few questions that pop up when the online learning approach is unfamiliar.
- What does the online learning regret bound mean?
- How can that bound be used?
- How well does the loss rate on the model sequence generalize?
The approach
Because online learning generates a sequence of models rather than a single model, we have to pick a model to compare against the empirical risk minimizer. In particular, we’ll choose the average model. Then we can use the online analysis machinery to provide a risk bound on that model.
Assume we have
- an -dimensional sample drawn i.i.d. from a training set ;
- a loss function for linear models ;
- a “best” model .
We want to control the differences
where is the average model. Here’s how we get .
- run online gradient descent on our sample set ;
- obtain a sequence of models ;
- compute the average model
Aside on averaging. Online analysis gives insight on the behavior of the model sequence but not on any one particular model. So, we need to consolidate our sequence into something we can analyze. Moreover, averaging is a good way (think of Leo Breiman and bagging) to reduce variance without impacting an estimator’s bias.
Expectation bounds
We start by showing that the expected value of the loss of the average model is less than or equal to the average loss on the model sequence via Jensen’s inequality:
Cool, now we have something to compare to the best model in the class! But something feels funny because we have something i.i.d. and something sequential—martingales to the rescue!
Martingale differences
Martingales are perfectly equipped for taming our sequential process. In particular, we’ll define the conditional expectation up to time as
For now, let’s focus on time and take the expectation
where is the th model and is the expected loss on the next example. The expectation is zero because
- the first examples are held fixed, meaning is a constant; and
- the expected loss, i.e., risk, at time is because our data is i.i.d. from .
Now, we average across our examples
because for each . The random variables form a martingale difference sequence, which means we can martingale concentration results to bound the average. In particular, with probability at least ,
With this fact, we can bound the risk of the average model!
Bounding via online analysis
Thanks to our martingale convergence result, this is a plug-and-play step. So, with probability at least with respect to the random draw of our sample ,
Neural connection. This is the same object as what we had in the online learning analysis because we have a loss rate on the first models in the sequence—and all we needed was a slightly modified (i.e., martingalified) concentration result!
Our last step is to use online analysis to bound the loss rate
which will give us something that we can compare to the empirical risk minimizer. To make things concrete, we’ll look at a specific example.
Linear regression with squared loss
We’re going to show that
We’ll define as the minimizer of , which allows us to say that
where is the model with the smallest statistical risk over , not just . So, may not have the smallest risk over . (Neural connection! This is the same trick we used in the Rademacher analysis in the statistical learning theory post!)
Via the Chernoff bound, with probability at least we have that
This last expression yields the bound on the average model:
where is a constant. This is the same bound obtained from the Rademacher analysis!
Final thoughts
- The average loss on a sequence of models is a good proxy for the risk of the average model.
- Local optimization performs well (up to constant factors), compared to empirical risk
minimization, i.e., global optimization.
- So, local optimization isn’t doing anything crazy—it’s doing something quite similar to global optimization.
- In other words, doing local optimization and then averaging is a good and viable strategy, especially when we’re dealing with big problems.
Now, let’s revisit our questions.
- What does the online learning regret bound mean?
- In some sense, it means that we’re bounding an object not unlike the empirical risk.
- How can that bound be used?
- The online learning bound can be used to bound an average model, which is comparable to a model found via empirical risk minimization.
- How well does the loss rate on the model sequence generalize?
- Up to constant factors, it generalizes as well as well the model found through empirical risk minimization.
Next up
… a short and sweet example of how strong convexity speeds things up in the online learning setting! Then a foray into the mind of R.A. Fisher with Brad Efron’s help.
References
This post is related to material from Nicolò Cesa-Bianchi’s lectures at the Bordeaux Machine Learning Summer School in 2011. Watch them here: MLSS 2011 lectures.