4 / 10

VC Dimension & PAC Learning

42 min Mathematics of Machine Learning

VC Dimension & PAC Learning

PAC (Probably Approximately Correct) learning provides a rigorous framework for when a hypothesis class is learnable from finite samples. The VC dimension characterizes combinatorial complexity and governs sample complexity.

VC Dimension

Set $C=\{x_1,\ldots,x_m\}$ is shattered by $\mathcal{H}$ if every binary labeling is realized by some $h\in\mathcal{H}$. $\mathrm{VCdim}(\mathcal{H})=\max\{m:\exists C\text{ of size }m\text{ shattered}\}$. Growth function: $\Pi_{\mathcal{H}}(m)=\max_{|C|=m}|\{(h(x_1),\ldots,h(x_m)):h\in\mathcal{H}\}|\le 2^m$.

Fundamental Theorem of PAC Learning

Let $d=\mathrm{VCdim}(\mathcal{H})<\infty$. Sauer's lemma: $\Pi_{\mathcal{H}}(m)\le(em/d)^d$. With prob. $\ge 1-\delta$: $R(\hat{h})\le\min_h R(h)+O\!\left(\sqrt{\frac{d\log(n/d)+\log(1/\delta)}{n}}\right)$. Conversely, $\mathrm{VCdim}=\infty$ implies the class is not PAC learnable.

Example 1: Halfspaces in $\mathbb{R}^d$

VC dimension of linear classifiers $h(x)=\mathrm{sign}(w^\top x+b)$?

Solution: $\mathrm{VCdim}=d+1$. Any $d+1$ points in general position are shattered by linear independence. No $d+2$ points can be shattered (Radon's theorem).

Example 2: Intervals on $\mathbb{R}$

Find VC dimension of classifiers $h_{a,b}(x)=\mathbf{1}[x\in[a,b]]$.

Solution: $\mathrm{VCdim}=2$. Two points $x_1<x_2$ are shattered. For three points, the labeling $(+,-,+)$ requires two disjoint intervals, impossible with one.

Practice

Prove $\mathrm{VCdim}$ of axis-aligned rectangles in $\mathbb{R}^2$ equals 4.
State the No-Free-Lunch theorem and explain its connection to VC theory.
Compute sample complexity for $\epsilon=0.05$, $\delta=0.05$, $d=10$.
How does VC dimension of a ReLU network scale with depth and width?

Show Answer Key

1. Need to shatter 4 points but not 5. Shatter 4: place at $(0,0),(1,0),(0,1),(1,1)$. Any labeling can be captured by choosing appropriate rectangle boundaries. Can't shatter 5: for any 5 points, the one with largest $x$, smallest $x$, largest $y$, smallest $y$, and middle point — the labeling 'all corners positive, middle negative' cannot be realized by an axis-aligned rectangle (it must include the middle if it includes all corners). VCdim = 4.

2. No-Free-Lunch: averaged over all possible target functions, every learning algorithm has the same expected error. Formally: $\sum_f E_{\text{test}}[L(h_A,f)] = \sum_f E_{\text{test}}[L(h_B,f)]$ for any algorithms $A,B$. Connection to VC: NFL says no universal learner exists; VC theory says for a specific hypothesis class, sample complexity depends on VCdim — restricting the class of functions makes learning possible.

3. PAC sample complexity: $n \geq \frac{c}{\epsilon}(d\ln(1/\epsilon)+\ln(1/\delta))$ where $d = $ VCdim $= 10$. For $\epsilon=\delta=0.05$: $n \geq \frac{c}{0.05}(10\ln 20+\ln 20) \approx \frac{c}{0.05}\cdot 11\cdot 3.0 \approx 660c$. With $c\approx 0.5$ (typical constant): $n \approx 330$. The VC bound is often loose; practical sample sizes are smaller.

4. For a fully connected ReLU network with $L$ layers and width $w$: total parameters $\sim Lw^2$. VCdim $= O(Lw^2\log(Lw))$ — roughly proportional to the number of parameters times a log factor. Deeper networks have larger VC dimension, scaling linearly with depth. However, VC bounds are often very loose for deep networks; in practice, implicit regularization (SGD, dropout) controls generalization better than VC theory predicts.

Statistical Learning Theory: Risk & Generalization Kernel Methods & Reproducing Kernel Hilbert Spaces

VC Dimension & PAC Learning