Learning theory - statistical learning

Today’s post is the first in a set of three that will discuss two theories of learning. In particular, today’s post will cover (a very narrow aspect of) statistical learning theory. The next post will present a game-theoretic view of learning, using online learning as its primary analytical framework. The final post will connect the two perspectives.

A bit of background

Learning problems, specifically machine learning problems can be analyzed from at least two perspectives. In either case, our goal is to have a toolbox from which we can analyze how well a learning system works. For example, we want to be able to say when learning occurs, when it doesn’t, and how quickly it occurs or doesn’t. To make this problem tractable, we’ll require

a model for the data;
assumptions on the source of that data;
functions to generate predictions;
loss functions to assess the quality of our predictions; and
learning algorithms to help us improve our predictions by understanding our losses.

The set-up

We assume that our data consists of $(x, y)$ feature-output pairs with $x \in \reals^d$ and $y \in \reals$ . The data comes from some fixed but unknown source that could be stochastic (as in the statistical learning framework) or nonstochastic (as in online learning framework).

We a function $f: \reals^d \to \reals$ that maps data to estimates/forecasts/predictions. So, $f$ gives us a mechanism to generate our own $\hat{y}$ ’s that we can compare against the true $y$ ’s; we assess our estimates via a loss function $l: \reals^n \times \reals \to \reals$ . You probably have many loss functions that you know and love…

squared-error loss: $l(y, \hat{y}) = \|y - \hat{y}\|^2_2$ .
absolute-error: $l(y, \hat{y}) = \|y - \hat{y} \|_1$ .
Kullback–Leibler: $l(y, \hat{y}) = y \log \frac{y}{\hat{y}}$ .
hinge loss: $l(y, \hat{y}) = \max \{0, 1 - y \cdot \hat{y} \}$ .
logistic loss: $l(y, \hat{y}) = \ln \frac{\hat{y}}{1 - \hat{y}}$ .

The final ingredient is a learning algorithm, which we can think of as a map from data to models. This is way that V. Vapnik and A. Chervonenkis conceptualized learning algorithms and we think that it’s a useful mental model.

Before diving into either framework let’s give ourselves a learning goal: our algorithms should output good models with respect to the data source.

Statistical learning theory

In the statistical learning setting, we assume that our data is generated by a probability (i.e., stochastic) model with distribution $\mathcal{D} \subseteq \reals^d \times \reals$ . In particular, we’ll assume

$(x, y) \stackrel{\textsf{i.i.d.}}{\sim} \mathcal{D}.$

We are interested in learning by controlling the statistical risk

$R_{\mathcal{D}}(f) = \E \left[ l\big(f(X), Y\big) \right],$

where the expectation is over the randomness in the data $(X,Y)$ and the model $f$ belongs to some class of models $\mathcal{F}$ . The class of models is a design parameter that we specify. For example, we could choose a set of linear predictors

$\mathcal{F} = \left\{\theta \in \reals^d \mid x \mapsto \theta^T x\right\}.$

Since we’re restricting our models to the class $\mathcal{F}$ , we’re really interested in controlling our risk over that class. So we want a model

$f^\star \in \argmin_{f \in \mathcal{F}} R_{\mathcal{D}}(f).$

But this is statistics, so we don’t have access to $\mathcal{D}$ at the population level. Instead, we have knowledge of $\mathcal{D}$ through $n$ samples $(x_1,y_1),\dots,(x_n,y_n)$ . Hmmm…

Empirical risk minimization

As statistically-minded individuals, we’ll use the empirical risk as a surrogate for the population risk and pick our model as

$f_n^\star \in \argmin_{f \in \mathcal{F}} R_n(f) = \argmin_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n l\big(f(x_i), y_i\big).$

We want the empirical risk to a be a good proxy for the statistical risk. More specifically, we want it to be uniformly close to the statistical risk. By uniformly close, we mean that we’re close over all models within our class and all data.

This might seem a bit extreme, but it prevents us from being lulled into believing we have a “good” model by getting easy training data or having an overly complex model class—think interpolating the data.

Rademacher complexity

To help us assess our algorithms and models, we turn to Rademacher averages, which is a Vapnik–Chervonenkis type notion of complexity that (also) has its origins in the Russian school of statistics (~1970s). We defined the empirical Rademacher average

$\mathcal{R}_n(\mathcal{F}, S) = \E_\sigma \left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i f(x_i) \right],$

for a class of functions $\mathcal{F}$ , $n$ samples $S = \{x_1,\dots,x_n\}$ . Before moving on, note that the empirical Rademacher average does not depend on $y$ , but only on the data. This will prove useful in the sequel.

Aside: see the Rademachers are rad - part I and Rademachers are rad - part II post for Rademacher fun in the spirit of Rademacher complexity.

Assessing empirical risk

Okay, now we’re going ready to assess how well empirical risk minimization works by studying the difference

$R_{\mathcal{D}}(f) - R_{\mathcal{D}}(f^\star),$

which is the difference between a suboptimal model and the optimal model—note that we don’t have an empirical risk term… yet.

Nonetheless, here goes. We can add and subtract the empirical risk getting

$\begin{align*} R_{\mathcal{D}}(f_n^\star) - R_{\mathcal{D}}(f^\star) &= R_{\mathcal{D}}(f_n^\star) - R_n(f_n^\star) + R_n(f_n^\star) - R_{\mathcal{D}}(f^\star) \\ &\le \sup_{f \in \mathcal{F}} \big\{ R_{\mathcal{D}}(f) - R_n(f) \big\} + R_n(f^\star) - R_{\mathcal{D}}(f^\star). \end{align*}$

The $\sup$ term is a comparison of the selected model on the whole distribution versus the sample. The second term is the error of using the “best” model on the training sample versus the whole distribution. The inequality follows from the $\sup$ and the fact that $R_n(f_n^\star) \le R_n(f^\star)$ by definition of $f_n^\star$ . Also, we can interpret the first term as a measurement of how much the model overfits the data. And, in some sense, this is what we want to understand!

High probability bounds

Both terms on the righthand side are random variables because they depend on the training data. So via Chernoff bounds (which we’re excluding from the current discussion), we can say that with probability $1 - \delta$

$\begin{align*} \sup_{f \in \mathcal{F}} \big\{ R_{\mathcal{D}}(f) - R_n(f) \big\} &\le \E \left[\sup_{f \in \mathcal{F}} \big\{ R_{\mathcal{D}}(f) - R_n(f) \big\}\right]+ \mathcal{O} \left( \sqrt{\frac{1}{n} \ln \frac{1}{\delta}} \right) \\ R_n(f^\star) - R_{\mathcal{D}}(f^\star) &\le \E \bigg[ R_n(f^\star) - R_{\mathcal{D}}(f^\star) \bigg]+ \mathcal{O} \left( \sqrt{\frac{1}{n} \ln \frac{1}{\delta}} \right). \end{align*}$

The expected value of the second term is 0 because the expected value of the empirical risk, by our i.i.d. assumption, is the statistical risk. You can think of this term as a bias-like term. The first term is more interesting; it may have caused your Rademacher neurons to fire, but it’s not quite a Rademacher average… yet.

The ghost sample

Now, we’re going to use one of “modern” statistics favorite tricks: the ghost sample. The ghost sample allows us to replace a (population) quantity by the sample quantity by pretending that we have access to a second, made-up sample from the distribution—it’s like we have a second training set. By the ghost sample trick (first equality) and Jensen’s inequality, our term is bounded as

$\begin{align*} \E_S \left[\sup_{f \in \mathcal{F}} \big\{ R_{\mathcal{D}}(f) - R_n(f) \big\}\right] &= \E_S \left[\sup_{f \in \mathcal{F}} \bigg\{\E_{\tilde{S}} R_{\tilde{n}}(f) - R_n(f)\bigg\}\right]\\ &\le \E_S \E_{\tilde{S}} \left[\sup_{f \in \mathcal{F}}\big\{R_{\tilde{n}}(f) - R_n(f)\big\}\right]. \end{align*}$

Okay, cool. The righthand side is the difference between two empirical risk quantities, one for the original training sample and the other for the ghost sample. We’re getting closer to bounding the overfitting term…

Rademacher magic

Now, we introduce Rademacher random variables $\sigma_i \sim \mathsf{Rademacher}$ into the problem which will allow us to bound the overfitting. Think of the Rademachers as randomly permuting an example from the training set with an example from the ghost sample. We want to control how much the risk changes when we exchange examples and do so via the empirical Rademacher average

$\E_\sigma \left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i \big( l_f(\tilde{z}_i) - l_f(z_i) \big) \right],$

where $z_i = (x_i, y_i)$ and $l_f(z_i)$ is the loss function for $f$ evaluated on example $z_i$ . Now let’s manipulate this quantity, first distributing terms and then exchanging the supremum with a sum to get an inequality:

$\begin{align*} \E_\sigma &\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i \big( l_f(\tilde{z}_i) - l_f(z_i) \big) \right]\\ &= \E_\sigma \left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i l_f(\tilde{z}_i) + \frac{1}{n} \sum_{i=1}^n (-\sigma_i) l_f(z_i) \right] \\ &\le \E_\sigma \left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i l_f(\tilde{z}_i) + \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n (-\sigma_i) l_f(z_i) \right] \\ &= \E_\sigma \left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i l_f(\tilde{z}_i) \right] + \E_\sigma \left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n (-\sigma_i) l_f(z_i) \right]. \end{align*}$

So, this last calculation, the fact that $\sigma$ has the same distribution as $-\sigma$ , and $z \sim \tilde{z}$ , imply that

$\E_S \left[\sup_{f \in \mathcal{F}} \big\{ R_{\mathcal{D}}(f) - R_n(f) \big\}\right] \le 2 \E_S \E_\sigma \left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i l_f(z_i) \right].$

Next up

… a short-and-sweet example of the material from this post and then the online learning perspective!

References

This post is related to material from Nicolò Cesa-Bianchi’s lectures at the Bordeaux Machine Learning Summer School in 2011. Watch them here: MLSS 2011 lectures.