2 / 7

Mutual Information & KL Divergence

35 min Information Theory

Mutual Information & KL Divergence

The Kullback-Leibler divergence $D_{KL}(P\|Q)$ measures how much $P$ differs from a reference distribution $Q$. Mutual information $I(X;Y)$ quantifies the information shared between two random variables, equaling the KL divergence between joint and product distributions. These quantities are the backbone of information-theoretic proofs in statistics, machine learning, and communications.

KL Divergence & Mutual Information

For distributions $P,Q$ on $\mathcal{X}$: $$D_{KL}(P\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}=\mathbb{E}_P\left[\log\frac{P(X)}{Q(X)}\right]\geq 0$$ with equality iff $P=Q$ (Gibbs' inequality). KL divergence is not symmetric. Mutual information: $$I(X;Y)=D_{KL}(P_{XY}\|P_X P_Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)=H(X)+H(Y)-H(X,Y).$$

Data Processing Inequality

If $X\to Y\to Z$ form a Markov chain (i.e., $Z\perp X|Y$), then $I(X;Z)\leq I(X;Y)$. More generally, any deterministic processing of $Y$ cannot increase the information about $X$: $I(X;f(Y))\leq I(X;Y)$. Corollary: $I(X;Y)\geq I(X;g(Y))$ for any function $g$. This formalizes the intuition that processing data can only lose information, never gain it.

Example 1

Compute $I(X;Y)$ for $X\sim\text{Bernoulli}(1/2)$ and $Y=X\oplus Z$, $Z\sim\text{Bernoulli}(p)$, $Z\perp X$.

Solution: $H(X)=1$. $Y|X=0\sim\text{Bernoulli}(p)$, so $H(Y|X)=h(p)$. $I(X;Y)=H(Y)-H(Y|X)$. $P(Y=1)=P(X=0)P(Z=1)+P(X=1)P(Z=0)=p/2+(1-p)/2=1/2$, so $H(Y)=1$. Hence $I(X;Y)=1-h(p)$. At $p=0$: $I=1$ (perfect channel); $p=1/2$: $I=0$ (pure noise); $p=1$: $I=1$ (noiseless, just flipped). This is the capacity of the binary symmetric channel.

Example 2

Prove Gibbs' inequality: $D_{KL}(P\|Q)\geq 0$ with equality iff $P=Q$.

Solution: $D_{KL}(P\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}=-\sum_x P(x)\log\frac{Q(x)}{P(x)}$. Apply Jensen's inequality to the convex function $-\log$: $D_{KL}(P\|Q)\geq -\log\sum_x P(x)\cdot\frac{Q(x)}{P(x)}=-\log\sum_x Q(x)=-\log 1=0$. Equality in Jensen holds iff $Q(x)/P(x)$ is constant, i.e., $Q=P$ (since both are normalized). Alternatively: use $\log t\leq t-1$ for $t>0$, so $-\log(Q/P)\geq 1-Q/P$, summing over $x$ gives $D_{KL}\geq\sum P(1-Q/P)=1-\sum Q=0$.

Practice

Show that mutual information is symmetric: $I(X;Y)=I(Y;X)$.
Prove the chain rule for mutual information: $I(X;Y,Z)=I(X;Z)+I(X;Y|Z)$.
Find the distribution $Q$ on $\{1,\ldots,n\}$ minimizing $D_{KL}(P\|Q)$ subject to $\mathbb{E}_Q[X]=\mu$.
Show that $f$-divergences (of which KL is a special case with $f(t)=t\log t$) satisfy the data processing inequality.

Show Answer Key

1. $I(X;Y)=H(X)-H(X|Y)=H(X)+H(Y)-H(X,Y)=H(Y)-H(Y|X)=I(Y;X)$. The expression $H(X)+H(Y)-H(X,Y)$ is symmetric in $X,Y$.

2. $I(X;Y,Z)=H(X)-H(X|Y,Z)=H(X)-H(X|Z)+H(X|Z)-H(X|Y,Z)=I(X;Z)+I(X;Y|Z)$.

3. Minimize $D_{KL}(P\|Q)=\sum P_i\log(P_i/Q_i)$ over $Q$ with $\sum Q_i=1$ and $\sum iQ_i=\mu$. Lagrange multipliers: $Q_i\propto e^{\lambda i}$, a geometric (exponential family) distribution. The parameter $\lambda$ is determined by the mean constraint.

4. An $f$-divergence $D_f(P\|Q)=\sum Q_i f(P_i/Q_i)$ with convex $f$, $f(1)=0$. For a channel $P\to P'$, $Q\to Q'$: $D_f(P'\|Q')\le D_f(P\|Q)$. This is the data processing inequality: processing cannot increase divergence. Proof uses Jensen's inequality on the conditional expectation.

Entropy & Measures of Information Source Coding & Data Compression

Mutual Information & KL Divergence