Han's inequality

In this post, we’re putting on our information theory hats and digging into to a simple but useful inequality gifted to us by Te Sun Han in 1978. Han’s inequality is a generalization of the Efron–Stein inequality with roots in information theory rather than the statistical analysis of the jackknife.¹ Before teeing up Han’s inequality, we’ll cover a few ideas from information theory. Andiamo!

A modicum of information theory

We’ll need the idea of (Shannon’s) entropy in Han’s inequality, so here it is for a discrete random variable $X$

$H(X) = \E\big( -\log p(X)\big ) = -\sum_{x \in \mathcal{S}} p(x) \log p(x),$

with the standard convention that $0 \log 0 = 0$ .

Also, we use the definition of conditional entropy, which is defined as

$H(X | Y) = H(X, Y) - H(Y).$

We can write the conditional entropy as

$\begin{align*} H(X | Y) &= -\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x,y) \log p(x | y) \\ &= \sum_{y \in \mathcal{Y}} p_Y(y) \left(- \sum_{x \in \mathcal{X}} p(x | y) \log p(x | y) \right) \\ &= \E [-\log p(X | Y)], \end{align*}$

where we use the relationship between conditional and joint probability in the second line.

Lastly, we’ll use the intuitive idea that conditioning decreases entropy. This result follows from a relative entropy (i.e., Kullback–Leibler) calculation. For two random variables $X$ and $Y$ , with marginal distributions $P_X$ and $P_Y$ , and joint distribution $P_{X,Y}$ , the relative entropy between $P_{X,Y}$ and $P_X \otimes P_Y$ is

$\begin{align*} D_{\mathrm{kl}}(P_{X,Y}, P_X \otimes P_Y) &= \sum_{x \in \mathcal{X}, y \in \mathcal{Y}} P_{X,Y}(x,y) \log \frac{P_{X,Y}(x,y)}{P_X(x)P_Y(y)} \\ &= \sum_{x \in \mathcal{X}, y \in \mathcal{Y}} P_{X,Y}(x,y) \log \frac{P_{X|Y}(x|y)P_Y(y)}{P_X(x)P_Y(y)} \\ &= \sum_{x \in \mathcal{X}, y \in \mathcal{Y}} P_{X,Y}(x,y) \left( \log P_{X|Y}(x|y) - \log P_X(x) \right) \\ &= H(X) - H(X|Y) \ge 0. \end{align*}$

Since this is greater than 0, we have $H(X) \ge H(X|Y)$ , meaning that conditioning reduces entropy.

Han’s inequality

In its simplest form, Han’s inequality works only with discrete random variables. We’re going to look at that case.

Han’s inequality. Assume $X_1,\dots,X_n$ are discrete random variables. Then

$H(X_1,\dots,X_n) \le \frac{1}{n-1} \sum_{i=1}^n H(X_1,\dots,X_{i-1},X_{i+1},\dots,X_n).$

Before proving the inequality, let’s interpret it. Roughly, it says that the joint entropy is less than the average entropy with the $i$th observation removed. So, we’re better off (i.e., have less uncertainty) if we know everything instead of accumulating (averaging) partial information.

Proof. Han’s inequality relies on two things:

the definition of conditional entropy and
the fact that conditioning reduces entropy.

For $i = 1,\dots,n$ , we have

$\begin{align*} H(X_1,\dots,X_n) &\stackrel{\textsf{(a)}}{=} H(X_1,\dots,X_{i-1},X_{i+1},\dots,X_n) + H(X_i|X_1\dots,X_{i-1},X_{i+1},\dots,X_n) \\ &\stackrel{\textsf{(b)}}{\le} H(X_1,\dots,X_{i-1},X_{i+1},\dots,X_n) + H(X_i|X_1\dots,X_{i-1}), \end{align*}$

where in (a) we used the defintion of conditional entropy and in (b) we used the fact that conditioning reduces entropy. In other words, we have

$\begin{align*} H(X_1,\dots,X_n) &\le H(X_2,\dots,X_n) + H(X_1) \\ H(X_1,\dots,X_n) &\le H(X_1,X_3,\dots,X_n) + H(X_2|X_1) \\ H(X_1,\dots,X_n) &\le H(X_1,X_2,X_4,\dots,X_n) + H(X_3|X_1, X_2) \\ &\,\,\, \vdots \\ H(X_1,\dots,X_n) &\le H(X_1,\dots,X_{n-1}) + H(X_n|X_1\dots,X_{n-1}). \end{align*}$

Now let’s sum both sides of the inequality, which gives

$n H(X_1,\dots,X_n) \le \left( \sum_{i=1}^n H(X_1,\dots,X_{i-1},X_{i+1},\dots,X_n) \right) + H(X_1,\dots,X_n),$

which can be rewritten as

$(n-1) H(X_1,\dots,X_n) \le H(X_1,\dots,X_{i-1},X_{i+1},\dots,X_n).$

And that’s it!

Next up

… we’ll put Han’s inequality to work and look at the sub-additivity of entropy! Stay tuned.

References

This post is based on material from Chapter 4 of:

Concentration Inequalities: A Nonasymptotic Theory of Independence by S. Boucheron, G. Lugosi, and P. Masart.

That said, we now know and appreciate how closely related information theory and statistics really are. ↩