A few flavors of Fisher's style

Today’s post is the first in a sequence about Sir Ronald Aylmer Fisher and his approach to statistics. These posts are very, very much related to the wonderful 1998 paper by Brad Efron, titled, “R.A. Fisher in the 21st Century”. A digital copy lives here: R. A. Fisher in the 21st century. To quote Efron’s last paragraph:

“Let me say finally that Fisher was a genius of the first rank, who has a solid claim to being the most important applied mathematician of the 20th century. His work has a unique quality of daring mathematical synthesis combined with the utmost practicality.”

If that doesn’t whet your appetite, then these posts aren’t for you. 😉

Fisherian inference

Statistics is often divided into two camps, the Bayesian (i.e., optimists, integrators, subjectivists) and the frequentists (i.e., pessimists, “derivatators”, objectivists). Although historically divided, we can related the two camps via decision theory, but that’s not our purpose here. Instead, we’re going to talk about a third flavor of inference, Fisherian inference.

As Efron and Hastie say, Fisherian inference “often drew on ideas neither Bayesian nor frequentist in nature, or sometimes the two in combination.” Fisher’s style balanced the correctness (“coherency”) of Bayesian inference with the the accuracy (“optimality”) of frequentist inference. Fisher’s style of statistics isn’t simply a convex combination of the Bayesian and frequentist schools, but something unique with its own merits and disadvantages. To make things less philosophical and more concrete let’s review Fisher’s primary tool, and the one that we’re almost surely used, maximum likelihood.

Maximum likelihood

Assume we have a family of probability distributions indexed by a finite-dimensional parameter $\theta$ , e.g., exponential family distributions. We’ll write the density functions as $f_\theta$ and the log-likelihood as

$\ell_x(\theta) = \log f_\theta(x),$

where the notation $\ell_x$ suggests that the likelihood is fixed with respect to the data $x$ . As a statistician, our goal is to estimate the parameter $\theta$ . To do so, we’ll use Fisher’s maximum likelihood estimate

$\hat{\theta} = \argmax_{\theta \in \Theta} \ell_x(\theta).$

For many distributions of “practical interest”, $\Theta$ is some subset of the real numbers and the log-likelihood is a concave function. So, the maximization problem is well posed (even if there are multiple maximizers). Moreover, maximum likelihood estimation has achieved “iconic status” because it has

a straightforward recipe—take derivatives, set them equal to zero, solve;
good frequentist properties;
a Bayesian interpretation;
a notion of optimality; and
a natural relationship with information and geometry.

Straightforward recipe. Maximum likelihood estimators are more-or-less automatic to constructs—numerical methods may be required, but Newton’s method will (often) find the the problem in just a few iterations.

Good frequentist properties. Under “suitable” regularity conditions, which we’ll scoot under the rug, maximum likelihood estimates are asymptotically normal, asymptotically efficient (i.e., optimal, i.e., achieve the Cramér-Rao lower bound), amenable to frequentist style confidence intervals. Moreover, it is a consistent estimator, meaning the estimator converges in probability to its true value

$\hat{\theta}_n \stackrel{\textsf{prob.}}{\to} \theta \iff \prob\left( \left| \hat{\theta}_n - \theta \right| > \varepsilon \right) \to 0,$

for every $\varepsilon > 0$ as $n \to \infty$ .

Bayesian interpretation. Maximum likelihood estimates can be viewed as Bayesian posterior estimates when the prior distribution was uniform (flat) over the parameter space $\Theta$ .

Notion of optimality. As mentioned above, these guys have the smallest asymptotic variance, meaning they use the data in an optimal way.

Natural relationship with information and geometry. The Fisher information of a random variable quantifies how much information that variable carries about the parameter of interest; it is closely related to the information theoretic idea of relative entropy. Moreover, the Fisher information is a measure of curvature in the log-likelihood, where more curvature means more information. Cool, huh?

Next up

… our first vignette, connecting the ideas of Fisher information and the bootstrap!

References

This post is related to material from:

“R.A. Fisher in the 21st Century” by Bradley Efron.
Computer Age Statistical Inference: Algorithms, Evidence, and Data Science by Bradley Efron and Trevor Hastie. A digital copy lives here: CASI.