This post illustrates how we use Rademacher complexities in statistical learning theory. To that end, we’ll assume that we’re working with our toolbox developed in the last post.

A quick recap

Last post, our main result was that

which allowed us to bound the statistical risk as

with at least probability . In other words, Rademacher complexity helps us measure the tendency of a model class to overfit a sample.

Warm up

Let’s Rademacherize our minds with a straightforward application of last post’s bound. We’ll assume that our loss functions are -Lipschitz. With this and this alone, we can show that

This is cool for at least two reasons. The first is that the righthand side doesn’t depend on anymore—only . The second is that the expectation no longer depends on , but only on .

Linear regression with squared loss

To make these ideas concrete, we’ll look at one of the most commone models arising in statistics / machine learning: ordinary linear regression. Here’s our set up.

  • Our class of functions is where and .
  • By assumption on and , we have .
  • Our loss function , with .

We’re interested in the Rademacher complexity of our function class on samples , i.e.,

The trickiest step is the one that exploits the independence of the Rademacher random variables and the bound on their norm. It’s easy to see if we write out the double sum

because the and are independent.

Putting it all together

From last post and the above calculation, we conclude that with probability at least

for empirical risk minimization. Hot diggity!

Next up

… the online learning perspective!

References

This post is related to material from Nicolò Cesa-Bianchi’s lectures at the Bordeaux Machine Learning Summer School in 2011. Watch them here: MLSS 2011 lectures.