5 / 10

Kernel Methods & Reproducing Kernel Hilbert Spaces

42 min Mathematics of Machine Learning

Kernel Methods & Reproducing Kernel Hilbert Spaces

Kernel methods implicitly map data into high or infinite-dimensional feature spaces via the kernel trick, enabling nonlinear learning at the cost of linear methods. The RKHS framework provides rigorous functional-analytic foundations.

Kernel & RKHS

$k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ is positive semi-definite if $\sum_{i,j}c_ic_jk(x_i,x_j)\ge 0$ for all $\{x_i\},\{c_i\}$. By Mercer: $k(x,x')=\langle\phi(x),\phi(x')\rangle_{\mathcal{H}}$. The RKHS $\mathcal{H}_k$ satisfies the reproducing property $f(x)=\langle f,k(\cdot,x)\rangle_{\mathcal{H}_k}$ for all $f\in\mathcal{H}_k$.

Representer Theorem

For strictly increasing $g$, the minimizer of $\frac{1}{n}\sum_i\ell(y_i,f(x_i))+g(\|f\|_{\mathcal{H}_k})$ over $f\in\mathcal{H}_k$ has the form $f^*(x)=\sum_{i=1}^n\alpha_ik(x_i,x)$. This reduces the infinite-dimensional problem to a finite dual solvable via the $n\times n$ kernel matrix $K_{ij}=k(x_i,x_j)$.

Example 1: RBF Kernel

Identify the feature space for $k(x,x')=e^{-\|x-x'\|^2/(2\sigma^2)}$.

Solution: Taylor expanding $e^{x^\top x'/\sigma^2}$ gives an infinite-dimensional feature map covering all polynomial degrees. Kernel evaluations cost $O(d)$, enabling implicit computation in an infinite-dimensional space.

Example 2: Kernel Ridge Regression

Solve $\min_{f\in\mathcal{H}_k}\frac{1}{n}\|f(X)-y\|^2+\lambda\|f\|^2$.

Solution: Representer theorem gives $\hat{\alpha}=(K+n\lambda I)^{-1}y$; predictions $\hat{f}(x)=k(x,X)^\top\hat{\alpha}$. Training cost $O(n^3)$, prediction $O(n)$ per point.

Practice

Verify that the polynomial kernel $k(x,x')=(x^\top x'+c)^p$ is PSD by constructing its feature map.
Prove the reproducing property from the RKHS axioms.
Why is the kernel trick computationally advantageous when $\dim(\phi)\gg n$?
Analyze RBF bandwidth: what happens to the learned function as $\sigma\to 0$ and $\sigma\to\infty$?

Show Answer Key

1. Polynomial kernel: $k(x,x')=(x^Tx'+c)^p = \sum_{|\alpha|\leq p}\binom{p}{\alpha}c^{p-|\alpha|}\prod_j (x_jx'_j)^{\alpha_j}$. The feature map $\phi(x)$ consists of all monomials $\sqrt{\binom{p}{\alpha}c^{p-|\alpha|}}\prod_j x_j^{\alpha_j}$ up to degree $p$. Then $k(x,x')=\phi(x)^T\phi(x')$, which is an inner product → PSD (Gram matrix $K_{ij} = \phi_i^T\phi_j = \Phi\Phi^T \succeq 0$).

2. RKHS axiom: the evaluation functional $f \mapsto f(x)$ is continuous. By Riesz representation: $f(x) = \langle f, k_x\rangle_{\mathcal{H}}$ where $k_x(\cdot) = k(x,\cdot)$. This is the reproducing property: $f(x) = \langle f, k(x,\cdot)\rangle$. As a special case: $k(x,x') = \langle k(x,\cdot), k(x',\cdot)\rangle$.

3. Without the kernel trick: must compute $\phi(x) \in \mathbb{R}^D$ explicitly ($D \gg n$, possibly infinite), then form $D \times D$ matrices. With kernel trick: only compute $k(x_i,x_j) = \phi(x_i)^T\phi(x_j)$, forming an $n \times n$ kernel matrix. Cost: $O(n^2)$ kernel evaluations vs $O(nD+D^2)$ in feature space. When $D \gg n$ (e.g., Gaussian kernel: $D=\infty$), the kernel trick is essential.

4. As $\sigma \to 0$: $k(x,x') = e^{-\|x-x'\|^2/2\sigma^2} \to \delta_{xx'}$ (identity kernel). Each training point gets its own basis function → extreme overfitting (memorization, zero bias, high variance). As $\sigma \to \infty$: $k \to 1$ (constant kernel). All points look identical → extreme underfitting (constant prediction, high bias, zero variance). Optimal $\sigma$ balances: captures true structure without memorizing noise.

VC Dimension & PAC Learning Support Vector Machines: Primal & Dual

Kernel Methods & Reproducing Kernel Hilbert Spaces