Kernel Methods & Reproducing Kernel Hilbert Spaces
Kernel Methods & Reproducing Kernel Hilbert Spaces
Kernel methods implicitly map data into high or infinite-dimensional feature spaces via the kernel trick, enabling nonlinear learning at the cost of linear methods. The RKHS framework provides rigorous functional-analytic foundations.
Kernel & RKHS
\(k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}\) is positive semi-definite if \(\sum_{i,j}c_ic_jk(x_i,x_j)\ge 0\) for all \(\{x_i\},\{c_i\}\). By Mercer: \(k(x,x')=\langle\phi(x),\phi(x')\rangle_{\mathcal{H}}\). The RKHS \(\mathcal{H}_k\) satisfies the reproducing property \(f(x)=\langle f,k(\cdot,x)\rangle_{\mathcal{H}_k}\) for all \(f\in\mathcal{H}_k\).
Representer Theorem
For strictly increasing \(g\), the minimizer of \(\frac{1}{n}\sum_i\ell(y_i,f(x_i))+g(\|f\|_{\mathcal{H}_k})\) over \(f\in\mathcal{H}_k\) has the form \(f^*(x)=\sum_{i=1}^n\alpha_ik(x_i,x)\). This reduces the infinite-dimensional problem to a finite dual solvable via the \(n\times n\) kernel matrix \(K_{ij}=k(x_i,x_j)\).
Example 1: RBF Kernel
Identify the feature space for \(k(x,x')=e^{-\|x-x'\|^2/(2\sigma^2)}\).
Solution: Taylor expanding \(e^{x^\top x'/\sigma^2}\) gives an infinite-dimensional feature map covering all polynomial degrees. Kernel evaluations cost \(O(d)\), enabling implicit computation in an infinite-dimensional space.
Example 2: Kernel Ridge Regression
Solve \(\min_{f\in\mathcal{H}_k}\frac{1}{n}\|f(X)-y\|^2+\lambda\|f\|^2\).
Solution: Representer theorem gives \(\hat{\alpha}=(K+n\lambda I)^{-1}y\); predictions \(\hat{f}(x)=k(x,X)^\top\hat{\alpha}\). Training cost \(O(n^3)\), prediction \(O(n)\) per point.
Practice
- Verify that the polynomial kernel \(k(x,x')=(x^\top x'+c)^p\) is PSD by constructing its feature map.
- Prove the reproducing property from the RKHS axioms.
- Why is the kernel trick computationally advantageous when \(\dim(\phi)\gg n\)?
- Analyze RBF bandwidth: what happens to the learned function as \(\sigma\to 0\) and \(\sigma\to\infty\)?
Show Answer Key
1. Polynomial kernel: $k(x,x')=(x^Tx'+c)^p = \sum_{|\alpha|\leq p}\binom{p}{\alpha}c^{p-|\alpha|}\prod_j (x_jx'_j)^{\alpha_j}$. The feature map $\phi(x)$ consists of all monomials $\sqrt{\binom{p}{\alpha}c^{p-|\alpha|}}\prod_j x_j^{\alpha_j}$ up to degree $p$. Then $k(x,x')=\phi(x)^T\phi(x')$, which is an inner product → PSD (Gram matrix $K_{ij} = \phi_i^T\phi_j = \Phi\Phi^T \succeq 0$).
2. RKHS axiom: the evaluation functional $f \mapsto f(x)$ is continuous. By Riesz representation: $f(x) = \langle f, k_x\rangle_{\mathcal{H}}$ where $k_x(\cdot) = k(x,\cdot)$. This is the reproducing property: $f(x) = \langle f, k(x,\cdot)\rangle$. As a special case: $k(x,x') = \langle k(x,\cdot), k(x',\cdot)\rangle$.
3. Without the kernel trick: must compute $\phi(x) \in \mathbb{R}^D$ explicitly ($D \gg n$, possibly infinite), then form $D \times D$ matrices. With kernel trick: only compute $k(x_i,x_j) = \phi(x_i)^T\phi(x_j)$, forming an $n \times n$ kernel matrix. Cost: $O(n^2)$ kernel evaluations vs $O(nD+D^2)$ in feature space. When $D \gg n$ (e.g., Gaussian kernel: $D=\infty$), the kernel trick is essential.
4. As $\sigma \to 0$: $k(x,x') = e^{-\|x-x'\|^2/2\sigma^2} \to \delta_{xx'}$ (identity kernel). Each training point gets its own basis function → extreme overfitting (memorization, zero bias, high variance). As $\sigma \to \infty$: $k \to 1$ (constant kernel). All points look identical → extreme underfitting (constant prediction, high bias, zero variance). Optimal $\sigma$ balances: captures true structure without memorizing noise.