Training Mathematics of Machine Learning VC Dimension & PAC Learning
4 / 10

VC Dimension & PAC Learning

42 min Mathematics of Machine Learning

VC Dimension & PAC Learning

PAC (Probably Approximately Correct) learning provides a rigorous framework for when a hypothesis class is learnable from finite samples. The VC dimension characterizes combinatorial complexity and governs sample complexity.

VC Dimension

Set \(C=\{x_1,\ldots,x_m\}\) is shattered by \(\mathcal{H}\) if every binary labeling is realized by some \(h\in\mathcal{H}\). \(\mathrm{VCdim}(\mathcal{H})=\max\{m:\exists C\text{ of size }m\text{ shattered}\}\). Growth function: \(\Pi_{\mathcal{H}}(m)=\max_{|C|=m}|\{(h(x_1),\ldots,h(x_m)):h\in\mathcal{H}\}|\le 2^m\).

Fundamental Theorem of PAC Learning

Let \(d=\mathrm{VCdim}(\mathcal{H})<\infty\). Sauer's lemma: \(\Pi_{\mathcal{H}}(m)\le(em/d)^d\). With prob. \(\ge 1-\delta\): \(R(\hat{h})\le\min_h R(h)+O\!\left(\sqrt{\frac{d\log(n/d)+\log(1/\delta)}{n}}\right)\). Conversely, \(\mathrm{VCdim}=\infty\) implies the class is not PAC learnable.

Example 1: Halfspaces in \(\mathbb{R}^d\)

VC dimension of linear classifiers \(h(x)=\mathrm{sign}(w^\top x+b)\)?

Solution: \(\mathrm{VCdim}=d+1\). Any \(d+1\) points in general position are shattered by linear independence. No \(d+2\) points can be shattered (Radon's theorem).

Example 2: Intervals on \(\mathbb{R}\)

Find VC dimension of classifiers \(h_{a,b}(x)=\mathbf{1}[x\in[a,b]]\).

Solution: \(\mathrm{VCdim}=2\). Two points \(x_1<x_2\) are shattered. For three points, the labeling \((+,-,+)\) requires two disjoint intervals, impossible with one.

Practice

  1. Prove \(\mathrm{VCdim}\) of axis-aligned rectangles in \(\mathbb{R}^2\) equals 4.
  2. State the No-Free-Lunch theorem and explain its connection to VC theory.
  3. Compute sample complexity for \(\epsilon=0.05\), \(\delta=0.05\), \(d=10\).
  4. How does VC dimension of a ReLU network scale with depth and width?
Show Answer Key

1. Need to shatter 4 points but not 5. Shatter 4: place at $(0,0),(1,0),(0,1),(1,1)$. Any labeling can be captured by choosing appropriate rectangle boundaries. Can't shatter 5: for any 5 points, the one with largest $x$, smallest $x$, largest $y$, smallest $y$, and middle point — the labeling 'all corners positive, middle negative' cannot be realized by an axis-aligned rectangle (it must include the middle if it includes all corners). VCdim = 4.

2. No-Free-Lunch: averaged over all possible target functions, every learning algorithm has the same expected error. Formally: $\sum_f E_{\text{test}}[L(h_A,f)] = \sum_f E_{\text{test}}[L(h_B,f)]$ for any algorithms $A,B$. Connection to VC: NFL says no universal learner exists; VC theory says for a specific hypothesis class, sample complexity depends on VCdim — restricting the class of functions makes learning possible.

3. PAC sample complexity: $n \geq \frac{c}{\epsilon}(d\ln(1/\epsilon)+\ln(1/\delta))$ where $d = $ VCdim $= 10$. For $\epsilon=\delta=0.05$: $n \geq \frac{c}{0.05}(10\ln 20+\ln 20) \approx \frac{c}{0.05}\cdot 11\cdot 3.0 \approx 660c$. With $c\approx 0.5$ (typical constant): $n \approx 330$. The VC bound is often loose; practical sample sizes are smaller.

4. For a fully connected ReLU network with $L$ layers and width $w$: total parameters $\sim Lw^2$. VCdim $= O(Lw^2\log(Lw))$ — roughly proportional to the number of parameters times a log factor. Deeper networks have larger VC dimension, scaling linearly with depth. However, VC bounds are often very loose for deep networks; in practice, implicit regularization (SGD, dropout) controls generalization better than VC theory predicts.