Statistical Learning Theory: Risk & Generalization
Statistical Learning Theory: Risk & Generalization
Statistical learning theory formalizes when a model trained on finite data performs well on unseen examples. The central objects are the true risk, empirical risk, and the generalization gap between them.
Risk and Empirical Risk
With \(n\) i.i.d. samples from \(P\): true risk \(R(h)=\mathbb{E}[\ell(h,z)]\); empirical risk \(\hat{R}(h)=\frac{1}{n}\sum_i\ell(h,z_i)\). ERM: \(\hat{h}=\arg\min_{h\in\mathcal{H}}\hat{R}(h)\). Excess risk: \(R(\hat{h})-\min_h R(h)\le 2\sup_h|R(h)-\hat{R}(h)|\).
Uniform Convergence & Rademacher
For \(|\mathcal{H}|<\infty\), \(\ell\in[0,1]\), with prob. \(\ge 1-\delta\): \(\sup_h|R(h)-\hat{R}(h)|\le\sqrt{(\log|\mathcal{H}|+\log(2/\delta))/(2n)}\). Rademacher complexity \(\mathfrak{R}_n(\mathcal{H})=\mathbb{E}_{\sigma,S}\sup_h\frac{1}{n}\sum_i\sigma_ih(z_i)\) extends this: \(R(\hat{h})\le\hat{R}(\hat{h})+2\mathfrak{R}_n+3\sqrt{\log(2/\delta)/(2n)}\).
Example 1: Bias-Variance Decomposition
Decompose \(R(\hat{h})\) into approximation and estimation error.
Solution: \(R(\hat{h})=\underbrace{R(h^*_{\mathcal{H}})}_\text{approx. error}+\underbrace{R(\hat{h})-R(h^*_{\mathcal{H}})}_\text{estimation error}\). Larger \(\mathcal{H}\) reduces approximation error but raises estimation error \(O(\sqrt{\log|\mathcal{H}|/n})\).
Example 2: Linear Classifier Rademacher Bound
Bound \(\mathfrak{R}_n\) for \(\{x\mapsto w^\top x:\|w\|\le B\}\) with \(\|x\|\le R\).
Solution: By Cauchy-Schwarz and symmetrization: \(\mathfrak{R}_n\le BR/\sqrt{n}\). The generalization gap is \(O(BR/\sqrt{n})\), dimension-independent.
Practice
- Prove the union bound step in the finite-\(\mathcal{H}\) uniform convergence theorem.
- Show \(\mathfrak{R}_n\le BR/\sqrt{n}\) for norm-bounded linear classifiers.
- Explain why training error alone is insufficient to assess model quality.
- How does cross-validation relate to the bias-variance tradeoff in learning theory?
Show Answer Key
1. For finite $|\mathcal{H}|$: $P(\exists h: |\hat{R}(h)-R(h)|>\epsilon) \leq \sum_{h\in\mathcal{H}} P(|\hat{R}(h)-R(h)|>\epsilon)$ (union bound) $\leq |\mathcal{H}|\cdot 2e^{-2n\epsilon^2}$ (Hoeffding per $h$). Set $= \delta$: $\epsilon = \sqrt{\frac{\ln|\mathcal{H}|+\ln(2/\delta)}{2n}}$. With probability $\geq 1-\delta$: $R(h) \leq \hat{R}(h)+\sqrt{\frac{\ln|\mathcal{H}|+\ln(2/\delta)}{2n}}$ for all $h$.
2. Rademacher complexity of $\mathcal{F} = \{x \mapsto w^Tx : \|w\|\leq B\}$ with $\|x\|\leq R$: $\mathfrak{R}_n(\mathcal{F}) = E[\sup_{\|w\|\leq B}\frac{1}{n}\sum_i \sigma_i w^Tx_i] = E[\frac{B}{n}\|\sum_i\sigma_i x_i\|] \leq \frac{B}{n}\sqrt{E[\|\sum\sigma_i x_i\|^2]} = \frac{B}{n}\sqrt{\sum\|x_i\|^2} \leq \frac{BR}{\sqrt{n}}$.
3. Low training error doesn't guarantee low test error — overfitting. A model memorizing training data has zero training error but high test error. Generalization error = $R(h)-\hat{R}(h)$ depends on hypothesis class complexity (VC dim, Rademacher complexity). Need either: (1) bound on complexity + sufficient data (PAC learning), or (2) validation/test set evaluation, or (3) cross-validation.
4. Cross-validation estimates generalization error by repeatedly splitting data into train/validation. Bias-variance tradeoff: simple models have high bias (underfitting) but low variance; complex models have low bias but high variance (overfitting). CV selects the model complexity that minimizes estimated test error, balancing bias and variance. $k$-fold CV averages over $k$ splits, reducing variance of the estimate.