Fisherian vignette

Today’s post is our first (short) vignette of the Fisher series. These vignettes are taken from Brad Efron’s paper and they illustrate Fisher’s continued impact on modern statistics. For the first one, we’re going to connect the Fisher information with the bootstrap.

In what follows, we’re going to assume that we have data (i.e., realizations of random variables) $x_1,\dots,x_n$ from a fixed and unknown distribution with density function $f_\theta$ . The density is indexed by our parameter-of-interest $\theta$ .

Fisher information

The Fisher information is the expected value of the second derivative of the log-density (which is also the derivative of the so called score function):

$I_\theta = \E_\theta \left[\frac{-\partial^2 \log f_\theta(x)}{\partial \theta^2}\right].$

The Fisher information for the full sample is $n I_\theta$ because the observations are (assumed to be) i.i.d.

So, why do we care about the Fisher information? Well, Fisher showed us that the asymptotic standard errors of his maximum likelihood estimates (MLEs) are related to the information in the sample by

$\mathsf{se}_\theta \big( \hat{\theta} \big) = \frac{1}{\sqrt{n I_\theta}}.$

Moreover, by the asymptotic efficiency of MLEs, no other (asymptotically) unbiased estimator can do better (versus the Cramér–Rao lower bound).

Okay, that’s cool but there’s a pesky $\theta$ in the denominator. So let’s do what Fisher would do and replace it by its plug-in estimate, giving

$\widehat{\mathsf{se}} = \frac{1}{\sqrt{n I_{\hat{\theta}}}}.$

Heck yeah. Now we’ve got something actionable that we can use on MLEs as we please.

The bootstrap

If you haven’t met the bootstrap yet, do so pronto. It’s an unbelievably useful tool that every statistician, machine learner, data scientist, computer scientist, and person-who-works-with-real-data should know, use, and love.

Here’s the gist of it.

We compute our statistic of interest $\hat{\theta} = T(x_1,\dots,x_n)$ using the data we have.
Then we resample $n$ points from our data with replacement, which gives us a bootstrap sample. Some data may appear multiple times and others not at all.
From our bootstrap sample, we compute a bootstrap statistic $\hat{\theta}^\star_\textsf{boot} = T(x^\star_1,\dots,x^\star_n)$ , where $x^\star_i$ is the $i$ -th bootstrap sample.
(Note that $x_i \neq x^\star_i$ in general.)
Do this many times and estimate the standard error, $\widehat{\mathsf{se}}_\textsf{boot}$ of the statistic by the sample standard deviation.

For a large number of bootstrap samples, the bootstrap standard error estimate is approximately equal to Fisher’s standard error estimate using the Fisher information.

The bootstrap approach exploits computer-based computation, which is nice when our statistic is some nasty function of the data; on the other hand, Fisher’s approach often exploits clever manipulations, i.e., human-based computations to produce standard error estimates. Both approaches are worth having in your pocket when standard errors come a-knocking.

That’s it for today.

Next up

… we’ll dig into the plug-in principle (which we had a flavor of today) for our second vignette!

References

This post is related to material from:

“R.A. Fisher in the 21st Century” by Bradley Efron.
Computer Age Statistical Inference: Algorithms, Evidence, and Data Science by Bradley Efron and Trevor Hastie. A digital copy lives here: CASI.