Information Theory in ML: MDL, Mutual Information & Compression
Information Theory in ML: MDL, Mutual Information & Compression
Information theory provides a fundamental language for ML: Shannon entropy quantifies uncertainty, mutual information measures feature relevance, and MDL unifies compression with model selection. PAC-Bayes bounds connect these to generalization.
Entropy, KL & Mutual Information
Shannon entropy: \(H(X)=-\sum_x p(x)\log p(x)\). KL divergence: \(D_{\mathrm{KL}}(p\|q)=\sum_x p(x)\log\frac{p(x)}{q(x)}\ge 0\) (Gibbs' inequality). Mutual information: \(I(X;Y)=H(X)-H(X|Y)=D_{\mathrm{KL}}(p(x,y)\|p(x)p(y))\ge 0\), with \(I(X;Y)=0\) iff \(X\perp Y\).
MDL & PAC-Bayes Bound
MDL: prefer \(h\) minimizing \(L(h)+L(\mathcal{D}|h)\); equivalent to MAP with \(p(h)=2^{-L(h)}\). PAC-Bayes: for any prior \(P\), with prob. \(\ge 1-\delta\), for all posteriors \(Q\): \[\mathbb{E}_{h\sim Q}[R(h)]\le\mathbb{E}_{h\sim Q}[\hat{R}(h)]+\sqrt{\frac{D_{\mathrm{KL}}(Q\|P)+\ln(n/\delta)}{2(n-1)}}\]
Example 1: Cross-Entropy & MLE
Show MLE equals minimizing cross-entropy.
Solution: Log-likelihood \(\sum_i\log q_\theta(y_i)=-nH(\hat{p},q_\theta)\). Maximizing likelihood minimizes \(H(\hat{p},q_\theta)=H(\hat{p})+D_{\mathrm{KL}}(\hat{p}\|q_\theta)\). Since \(H(\hat{p})\) is fixed, MLE minimizes the KL divergence from empirical to model distribution.
Example 2: MI Feature Selection
Why is \(I(X_j;Y)\) a principled feature relevance criterion?
Solution: \(I(X_j;Y)\) measures uncertainty reduction about \(Y\) given \(X_j\), detecting nonlinear dependencies. \(I(X_j;Y)=0\implies X_j\) is irrelevant. MRMR maximizes relevance \(I(X_j;Y)\) while minimizing redundancy \(I(X_j;X_k)\) among selected features.
Practice
- Prove the data processing inequality: \(I(X;Z)\le I(X;Y)\) for Markov chain \(X\to Y\to Z\).
- Show \(D_{\mathrm{KL}}(p\|q)\ge 0\) using Jensen's inequality applied to \(-\log\).
- Derive the PAC-Bayes bound for a Gaussian posterior over linear models with Gaussian prior.
- Explain how MDL selects model complexity and compare to AIC and BIC criteria.
Show Answer Key
1. Markov chain $X \to Y \to Z$: $p(x,y,z)=p(x)p(y|x)p(z|y)$. Mutual information: $I(X;Y) = H(X)-H(X|Y)$. $I(X;Z) = H(X)-H(X|Z)$. Since $Z$ depends on $X$ only through $Y$: conditioning on $Z$ cannot reveal more about $X$ than conditioning on $Y$. Formally: $H(X|Z) \geq H(X|Y)$ (data processing increases conditional entropy), so $I(X;Z) \leq I(X;Y)$. ✓
2. $D_{KL}(p\|q) = \sum_x p(x)\log\frac{p(x)}{q(x)} = -\sum p(x)\log\frac{q(x)}{p(x)}$. By Jensen's inequality (since $-\log$ is convex): $\geq -\log\sum p(x)\frac{q(x)}{p(x)} = -\log\sum q(x) = -\log 1 = 0$. Equality iff $q(x)/p(x)$ is constant a.s., i.e., $p=q$. ✓ (Gibbs' inequality.)
3. PAC-Bayes: for Gaussian prior $\pi = N(0,\sigma_0^2 I)$ and posterior $\rho = N(w, \sigma^2 I)$: $D_{KL}(\rho\|\pi) = \frac{d}{2}[\frac{\sigma^2}{\sigma_0^2}-1-\ln\frac{\sigma^2}{\sigma_0^2}]+\frac{\|w\|^2}{2\sigma_0^2}$. PAC-Bayes bound: with probability $\geq 1-\delta$, $E_{h\sim\rho}[R(h)] \leq E_{h\sim\rho}[\hat{R}(h)] + \sqrt{\frac{D_{KL}(\rho\|\pi)+\ln(n/\delta)}{2n}}$. For linear models: the bound penalizes large weights ($\|w\|^2$ term) and complex posteriors.
4. MDL: choose the model $M$ minimizing $L(D|M)+L(M)$ (total description length = data encoding + model encoding). Shorter descriptions = better generalization. AIC: $-2\ln L + 2k$ (penalizes $k$ parameters linearly). BIC: $-2\ln L + k\ln n$ (heavier penalty, $n$ = sample size). MDL is more principled (information-theoretic), AIC favors more complex models (underpunishes), BIC is consistent (selects true model as $n\to\infty$ if it's in the class).