In this post, we’re putting on our information theory hats and digging into to a simple but useful inequality gifted to us by Te Sun Han in 1978. Han’s inequality is a generalization of the Efron–Stein inequality with roots in information theory rather than the statistical analysis of the jackknife.1 Before teeing up Han’s inequality, we’ll cover a few ideas from information theory. Andiamo!

A modicum of information theory

We’ll need the idea of (Shannon’s) entropy in Han’s inequality, so here it is for a discrete random variable

with the standard convention that .

Also, we use the definition of conditional entropy, which is defined as

We can write the conditional entropy as

where we use the relationship between conditional and joint probability in the second line.

Lastly, we’ll use the intuitive idea that conditioning decreases entropy. This result follows from a relative entropy (i.e., Kullback–Leibler) calculation. For two random variables and , with marginal distributions and , and joint distribution , the relative entropy between and is

Since this is greater than 0, we have , meaning that conditioning reduces entropy.

Han’s inequality

In its simplest form, Han’s inequality works only with discrete random variables. We’re going to look at that case.

Han’s inequality. Assume are discrete random variables. Then

Before proving the inequality, let’s interpret it. Roughly, it says that the joint entropy is less than the average entropy with the $i$th observation removed. So, we’re better off (i.e., have less uncertainty) if we know everything instead of accumulating (averaging) partial information.

Proof. Han’s inequality relies on two things:

  1. the definition of conditional entropy and
  2. the fact that conditioning reduces entropy.

For , we have

where in (a) we used the defintion of conditional entropy and in (b) we used the fact that conditioning reduces entropy. In other words, we have

Now let’s sum both sides of the inequality, which gives

which can be rewritten as

And that’s it!

Next up

… we’ll put Han’s inequality to work and look at the sub-additivity of entropy! Stay tuned.

References

This post is based on material from Chapter 4 of:

  • Concentration Inequalities: A Nonasymptotic Theory of Independence by S. Boucheron, G. Lugosi, and P. Masart.
  1. That said, we now know and appreciate how closely related information theory and statistics really are.