3 / 10

Statistical Learning Theory: Risk & Generalization

42 min Mathematics of Machine Learning

Statistical Learning Theory: Risk & Generalization

Statistical learning theory formalizes when a model trained on finite data performs well on unseen examples. The central objects are the true risk, empirical risk, and the generalization gap between them.

Risk and Empirical Risk

With $n$ i.i.d. samples from $P$: true risk $R(h)=\mathbb{E}[\ell(h,z)]$; empirical risk $\hat{R}(h)=\frac{1}{n}\sum_i\ell(h,z_i)$. ERM: $\hat{h}=\arg\min_{h\in\mathcal{H}}\hat{R}(h)$. Excess risk: $R(\hat{h})-\min_h R(h)\le 2\sup_h|R(h)-\hat{R}(h)|$.

Uniform Convergence & Rademacher

For $|\mathcal{H}|<\infty$, $\ell\in[0,1]$, with prob. $\ge 1-\delta$: $\sup_h|R(h)-\hat{R}(h)|\le\sqrt{(\log|\mathcal{H}|+\log(2/\delta))/(2n)}$. Rademacher complexity $\mathfrak{R}_n(\mathcal{H})=\mathbb{E}_{\sigma,S}\sup_h\frac{1}{n}\sum_i\sigma_ih(z_i)$ extends this: $R(\hat{h})\le\hat{R}(\hat{h})+2\mathfrak{R}_n+3\sqrt{\log(2/\delta)/(2n)}$.

Example 1: Bias-Variance Decomposition

Decompose $R(\hat{h})$ into approximation and estimation error.

Solution: $R(\hat{h})=\underbrace{R(h^*_{\mathcal{H}})}_\text{approx. error}+\underbrace{R(\hat{h})-R(h^*_{\mathcal{H}})}_\text{estimation error}$. Larger $\mathcal{H}$ reduces approximation error but raises estimation error $O(\sqrt{\log|\mathcal{H}|/n})$.

Example 2: Linear Classifier Rademacher Bound

Bound $\mathfrak{R}_n$ for $\{x\mapsto w^\top x:\|w\|\le B\}$ with $\|x\|\le R$.

Solution: By Cauchy-Schwarz and symmetrization: $\mathfrak{R}_n\le BR/\sqrt{n}$. The generalization gap is $O(BR/\sqrt{n})$, dimension-independent.

Practice

Prove the union bound step in the finite-$\mathcal{H}$ uniform convergence theorem.
Show $\mathfrak{R}_n\le BR/\sqrt{n}$ for norm-bounded linear classifiers.
Explain why training error alone is insufficient to assess model quality.
How does cross-validation relate to the bias-variance tradeoff in learning theory?

Show Answer Key

1. For finite $|\mathcal{H}|$: $P(\exists h: |\hat{R}(h)-R(h)|>\epsilon) \leq \sum_{h\in\mathcal{H}} P(|\hat{R}(h)-R(h)|>\epsilon)$ (union bound) $\leq |\mathcal{H}|\cdot 2e^{-2n\epsilon^2}$ (Hoeffding per $h$). Set $= \delta$: $\epsilon = \sqrt{\frac{\ln|\mathcal{H}|+\ln(2/\delta)}{2n}}$. With probability $\geq 1-\delta$: $R(h) \leq \hat{R}(h)+\sqrt{\frac{\ln|\mathcal{H}|+\ln(2/\delta)}{2n}}$ for all $h$.

2. Rademacher complexity of $\mathcal{F} = \{x \mapsto w^Tx : \|w\|\leq B\}$ with $\|x\|\leq R$: $\mathfrak{R}_n(\mathcal{F}) = E[\sup_{\|w\|\leq B}\frac{1}{n}\sum_i \sigma_i w^Tx_i] = E[\frac{B}{n}\|\sum_i\sigma_i x_i\|] \leq \frac{B}{n}\sqrt{E[\|\sum\sigma_i x_i\|^2]} = \frac{B}{n}\sqrt{\sum\|x_i\|^2} \leq \frac{BR}{\sqrt{n}}$.

3. Low training error doesn't guarantee low test error — overfitting. A model memorizing training data has zero training error but high test error. Generalization error = $R(h)-\hat{R}(h)$ depends on hypothesis class complexity (VC dim, Rademacher complexity). Need either: (1) bound on complexity + sufficient data (PAC learning), or (2) validation/test set evaluation, or (3) cross-validation.

4. Cross-validation estimates generalization error by repeatedly splitting data into train/validation. Bias-variance tradeoff: simple models have high bias (underfitting) but low variance; complex models have low bias but high variance (overfitting). CV selects the model complexity that minimizes estimated test error, balancing bias and variance. $k$-fold CV averages over $k$ splits, reducing variance of the estimate.

Stochastic Gradient Descent & Convergence VC Dimension & PAC Learning

Statistical Learning Theory: Risk & Generalization