6 / 10

Support Vector Machines: Primal & Dual

42 min Mathematics of Machine Learning

Support Vector Machines: Primal & Dual

SVMs find the maximum-margin hyperplane separating classes, combining geometric intuition with convex duality. The dual formulation reveals the kernel trick and sparse support vector structure.

Hard-Margin SVM

Primal: $\min_{w,b}\frac{1}{2}\|w\|^2$ s.t. $y_i(w^\top x_i+b)\ge 1$. Geometric margin: $2/\|w^*\|$. Soft-margin adds slacks $\xi_i\ge 0$: minimize $\frac{1}{2}\|w\|^2+C\sum_i\xi_i$ s.t. $y_i(w^\top x_i+b)\ge 1-\xi_i$. $C$ trades margin vs. misclassification.

Dual & KKT

Lagrangian yields: $\max_\alpha\sum_i\alpha_i-\frac{1}{2}\sum_{i,j}\alpha_i\alpha_jy_iy_jx_i^\top x_j$ s.t. $0\le\alpha_i\le C$, $\sum_i\alpha_iy_i=0$. KKT: $w^*=\sum_i\alpha_i^*y_ix_i$. Points with $\alpha_i^*>0$ are support vectors. Substitute $k(x_i,x_j)$ for the dot product to kernelize.

Example 1: Margin Generalization Bound

Show the SVM margin implies a generalization bound.

Solution: For margin $\gamma=1/\|w^*\|$ and $\|x\|\le R$, Rademacher complexity $\mathfrak{R}_n\le R/(\gamma\sqrt{n})$. Thus $R(\hat{h})\le\hat{R}(\hat{h})+2R/(\gamma\sqrt{n})+O(\sqrt{\log(1/\delta)/n})$. Maximizing margin minimizes this bound.

Example 2: Hinge Loss Equivalence

Show soft-margin SVM minimizes regularized hinge loss.

Solution: Hinge loss: $\max(0,1-y_iw^\top x_i)$. With slack variables, primal equals $\frac{1}{2}\|w\|^2+C\sum_i\max(0,1-y_i(w^\top x_i+b))$, a convex surrogate for 0-1 loss with L2 regularization.

Practice

Derive the SVM dual from the Lagrangian using KKT conditions step by step.
Explain the role of $C$: what is its effect on model bias and variance?
For kernel SVM, how many support vectors do you expect as $C\to\infty$?
Compare hinge loss and logistic loss: which is more robust to outliers?

Show Answer Key

1. Primal: $\min_{w,b,\xi} \frac{1}{2}\|w\|^2+C\sum\xi_i$ s.t. $y_i(w^Tx_i+b)\geq 1-\xi_i$, $\xi_i\geq 0$. Lagrangian: $L = \frac{1}{2}\|w\|^2+C\sum\xi_i-\sum\alpha_i[y_i(w^Tx_i+b)-1+\xi_i]-\sum\mu_i\xi_i$. KKT: $\partial L/\partial w=0 \Rightarrow w=\sum\alpha_iy_ix_i$. $\partial L/\partial b=0 \Rightarrow \sum\alpha_iy_i=0$. $\partial L/\partial\xi_i=0 \Rightarrow \alpha_i+\mu_i=C$, so $0\leq\alpha_i\leq C$. Dual: $\max_\alpha \sum\alpha_i-\frac{1}{2}\sum_{ij}\alpha_i\alpha_jy_iy_jx_i^Tx_j$ s.t. $0\leq\alpha_i\leq C$, $\sum\alpha_iy_i=0$.

2. Small $C$: large margin, many slack violations allowed → smoother boundary, higher bias, lower variance (underfitting risk). Large $C$: penalizes violations heavily → narrower margin, fits training data closely, lower bias, higher variance (overfitting risk). $C=\infty$: hard-margin SVM (no violations allowed). $C$ controls the bias-variance tradeoff.

3. As $C \to \infty$: hard-margin SVM. Only the closest points to the boundary (support vectors on the margin) matter. Number of SVs → minimal (just those defining the maximum-margin hyperplane). For separable data: typically $O(1)$ or $O(d)$ support vectors. For non-separable data: $C \to \infty$ may be infeasible (no solution) or result in a very narrow margin with few SVs.

4. Hinge loss: $\max(0, 1-yf(x))$ — flat beyond margin ($yf\geq 1$), zero gradient for correctly classified points far from boundary. Logistic loss: $\log(1+e^{-yf})$ — always nonzero, asymptotically linear. Hinge is more robust to outliers because it ignores well-classified points entirely (sparse SVs). Logistic is smoother, differentiable everywhere, but every point influences the solution. Both are convex surrogates for 0-1 loss.

Kernel Methods & Reproducing Kernel Hilbert Spaces Neural Networks: Universal Approximation & Backpropagation

Support Vector Machines: Primal & Dual