10 / 10

Information Theory in ML: MDL, Mutual Information & Compression

42 min Mathematics of Machine Learning

Information Theory in ML: MDL, Mutual Information & Compression

Information theory provides a fundamental language for ML: Shannon entropy quantifies uncertainty, mutual information measures feature relevance, and MDL unifies compression with model selection. PAC-Bayes bounds connect these to generalization.

Entropy, KL & Mutual Information

Shannon entropy: $H(X)=-\sum_x p(x)\log p(x)$. KL divergence: $D_{\mathrm{KL}}(p\|q)=\sum_x p(x)\log\frac{p(x)}{q(x)}\ge 0$ (Gibbs' inequality). Mutual information: $I(X;Y)=H(X)-H(X|Y)=D_{\mathrm{KL}}(p(x,y)\|p(x)p(y))\ge 0$, with $I(X;Y)=0$ iff $X\perp Y$.

MDL & PAC-Bayes Bound

MDL: prefer $h$ minimizing $L(h)+L(\mathcal{D}|h)$; equivalent to MAP with $p(h)=2^{-L(h)}$. PAC-Bayes: for any prior $P$, with prob. $\ge 1-\delta$, for all posteriors $Q$: \[\mathbb{E}_{h\sim Q}[R(h)]\le\mathbb{E}_{h\sim Q}[\hat{R}(h)]+\sqrt{\frac{D_{\mathrm{KL}}(Q\|P)+\ln(n/\delta)}{2(n-1)}}\]

Example 1: Cross-Entropy & MLE

Show MLE equals minimizing cross-entropy.

Solution: Log-likelihood $\sum_i\log q_\theta(y_i)=-nH(\hat{p},q_\theta)$. Maximizing likelihood minimizes $H(\hat{p},q_\theta)=H(\hat{p})+D_{\mathrm{KL}}(\hat{p}\|q_\theta)$. Since $H(\hat{p})$ is fixed, MLE minimizes the KL divergence from empirical to model distribution.

Example 2: MI Feature Selection

Why is $I(X_j;Y)$ a principled feature relevance criterion?

Solution: $I(X_j;Y)$ measures uncertainty reduction about $Y$ given $X_j$, detecting nonlinear dependencies. $I(X_j;Y)=0\implies X_j$ is irrelevant. MRMR maximizes relevance $I(X_j;Y)$ while minimizing redundancy $I(X_j;X_k)$ among selected features.

Practice

Prove the data processing inequality: $I(X;Z)\le I(X;Y)$ for Markov chain $X\to Y\to Z$.
Show $D_{\mathrm{KL}}(p\|q)\ge 0$ using Jensen's inequality applied to $-\log$.
Derive the PAC-Bayes bound for a Gaussian posterior over linear models with Gaussian prior.
Explain how MDL selects model complexity and compare to AIC and BIC criteria.

Show Answer Key

1. Markov chain $X \to Y \to Z$: $p(x,y,z)=p(x)p(y|x)p(z|y)$. Mutual information: $I(X;Y) = H(X)-H(X|Y)$. $I(X;Z) = H(X)-H(X|Z)$. Since $Z$ depends on $X$ only through $Y$: conditioning on $Z$ cannot reveal more about $X$ than conditioning on $Y$. Formally: $H(X|Z) \geq H(X|Y)$ (data processing increases conditional entropy), so $I(X;Z) \leq I(X;Y)$. ✓

2. $D_{KL}(p\|q) = \sum_x p(x)\log\frac{p(x)}{q(x)} = -\sum p(x)\log\frac{q(x)}{p(x)}$. By Jensen's inequality (since $-\log$ is convex): $\geq -\log\sum p(x)\frac{q(x)}{p(x)} = -\log\sum q(x) = -\log 1 = 0$. Equality iff $q(x)/p(x)$ is constant a.s., i.e., $p=q$. ✓ (Gibbs' inequality.)

3. PAC-Bayes: for Gaussian prior $\pi = N(0,\sigma_0^2 I)$ and posterior $\rho = N(w, \sigma^2 I)$: $D_{KL}(\rho\|\pi) = \frac{d}{2}[\frac{\sigma^2}{\sigma_0^2}-1-\ln\frac{\sigma^2}{\sigma_0^2}]+\frac{\|w\|^2}{2\sigma_0^2}$. PAC-Bayes bound: with probability $\geq 1-\delta$, $E_{h\sim\rho}[R(h)] \leq E_{h\sim\rho}[\hat{R}(h)] + \sqrt{\frac{D_{KL}(\rho\|\pi)+\ln(n/\delta)}{2n}}$. For linear models: the bound penalizes large weights ($\|w\|^2$ term) and complex posteriors.

4. MDL: choose the model $M$ minimizing $L(D|M)+L(M)$ (total description length = data encoding + model encoding). Shorter descriptions = better generalization. AIC: $-2\ln L + 2k$ (penalizes $k$ parameters linearly). BIC: $-2\ln L + k\ln n$ (heavier penalty, $n$ = sample size). MDL is more principled (information-theoretic), AIC favors more complex models (underpunishes), BIC is consistent (selects true model as $n\to\infty$ if it's in the class).

Dimensionality Reduction: PCA, SVD & Manifold Learning Back to Module

Information Theory in ML: MDL, Mutual Information & Compression