In this post, we’re going to talk about statistical decision theory, which is a topic with its origins in the 1950s. Decision theory is cool because it gives us a way to compare statistical procedures. Moreover, it connects frequentist and Bayesian inference.

This post sets us on the path towards minimax rules and estimators. Caveat: In no way is this an attempt to fully cover this topic—it’s huge and philosophically deep.

A bit of background

Let’s say we have a parameter that lives in some parameter space . We’re going to estimate our parameter of interest with . Since we’re in the business of comparing estimators, we need a way to assess their quality. We’ll do this using loss functions.

A loss function measures the discrepancy between and . Formally,

Some common loss functions are:

  • squared-error loss: .
  • absolute-error: .
  • Kullback–Leibler: .

Now we have some tools to assess the quality of our estimators, but we still have some work to do because our estimators (most likely) depend on random variables. To remove the randomness, we’ll take an average which will give us the risk of our estimator. The risk of is given

where the expectation is over the data. (The subscript on the expectation is just a way for us to index our distribution—in other words, we can read it as “the expectation when the parameter is .”) The risk quantifies how good or bad our estimator is, where “low risk” is good and “high risk” is bad.

If we took an additional expectation across the parameter (your Bayesian neurons should be lighting up), then we have the Bayes risk of the estimator:

where is a prior distribution for . But don’t get too attached to the name because this is an expectation over the data (frequentist) and over the parameter (Bayesian).

A true Bayesian statistician would be more interested in the posterior risk

where is the posterior density of .

So, now we have three notions of risk at our disposal:

  • the (regular) risk, which has a frequentist feel;
  • the posterior risk, which has a Bayesian feel; and
  • the Bayes risk, which has an independent-of-statistical-philosophy feel.

Neural connection! When we use the squared-error loss (also known as the -loss or the mean squared error), we can write the risk using bias and variance of our estimator . We’ll use the “add-and-subtract” trick to show this relationship. Let , then

A warm-up calculation

Let’s assume we have data points, ,…, sampled from Bernoulli distribution with an unknown probability of success . We’ll estimate with two different estimators

and compare their risk functions, using the squared-error loss function.

The estimator is an unbiased estimator of the true parameter . So, the risk of is

The estimator has a bias but no variance term. Its risk is given by

The risk of the first estimator is a concave quadratic and the risk of the second estimator is a convex quadratic. So, neither one of these risk functions is strictly better (i.e., uniformly dominates) the other one as ranges from 0 to 1.

Maximum risk

In the spirit of (working towards) minimax, we could ask: what value of maximizes the risk of our estimator?

For , we can expand the quadratic , take derivatives , set it equal to 0, and find that maximizes the risk. (Note that this results holds independent of ; however, the value of the risk function decreases at a rate of .)

For , the story is a bit different because we’re dealing with a convex quadratic—so, think endpoints. Because we chose , both 0 and 1 maximize the risk of this expression.

Setting up the minimax game

The central object of this post is the minimax risk, which is a (hopefully) minimized, worst-case risk, defined as

We can interpret the minimax risk as a type of game: it’s us (the statisticians) against nature.

  • Nature gets to choose .
  • We get to choose .
  • Nature chooses adversarially to maximize our risk.
  • We do our best to minimize our risk given nature’s choice of .
    • That is, we pick .

References

This post draws on material from

  • Theoretical Statistics by Robert Keener;
  • All of Statistics by Larry Wasserman;
  • lectures by Michael Jordan.