Statistical learning theory - complexity example
This post illustrates how we use Rademacher complexities in statistical learning theory. To that end, we’ll assume that we’re working with our toolbox developed in the last post.
A quick recap
Last post, our main result was that
which allowed us to bound the statistical risk as
with at least probability . In other words, Rademacher complexity helps us measure the tendency of a model class to overfit a sample.
Warm up
Let’s Rademacherize our minds with a straightforward application of last post’s bound. We’ll assume that our loss functions are -Lipschitz. With this and this alone, we can show that
This is cool for at least two reasons. The first is that the righthand side doesn’t depend on anymore—only . The second is that the expectation no longer depends on , but only on .
Linear regression with squared loss
To make these ideas concrete, we’ll look at one of the most commone models arising in statistics / machine learning: ordinary linear regression. Here’s our set up.
- Our class of functions is where and .
- By assumption on and , we have .
- Our loss function , with .
We’re interested in the Rademacher complexity of our function class on samples , i.e.,
The trickiest step is the one that exploits the independence of the Rademacher random variables and the bound on their norm. It’s easy to see if we write out the double sum
because the and are independent.
Putting it all together
From last post and the above calculation, we conclude that with probability at least
for empirical risk minimization. Hot diggity!
Next up
… the online learning perspective!
References
This post is related to material from Nicolò Cesa-Bianchi’s lectures at the Bordeaux Machine Learning Summer School in 2011. Watch them here: MLSS 2011 lectures.