Training Mathematics of Machine Learning Support Vector Machines: Primal & Dual
6 / 10

Support Vector Machines: Primal & Dual

42 min Mathematics of Machine Learning

Support Vector Machines: Primal & Dual

SVMs find the maximum-margin hyperplane separating classes, combining geometric intuition with convex duality. The dual formulation reveals the kernel trick and sparse support vector structure.

Hard-Margin SVM

Primal: \(\min_{w,b}\frac{1}{2}\|w\|^2\) s.t. \(y_i(w^\top x_i+b)\ge 1\). Geometric margin: \(2/\|w^*\|\). Soft-margin adds slacks \(\xi_i\ge 0\): minimize \(\frac{1}{2}\|w\|^2+C\sum_i\xi_i\) s.t. \(y_i(w^\top x_i+b)\ge 1-\xi_i\). \(C\) trades margin vs. misclassification.

Dual & KKT

Lagrangian yields: \(\max_\alpha\sum_i\alpha_i-\frac{1}{2}\sum_{i,j}\alpha_i\alpha_jy_iy_jx_i^\top x_j\) s.t. \(0\le\alpha_i\le C\), \(\sum_i\alpha_iy_i=0\). KKT: \(w^*=\sum_i\alpha_i^*y_ix_i\). Points with \(\alpha_i^*>0\) are support vectors. Substitute \(k(x_i,x_j)\) for the dot product to kernelize.

Example 1: Margin Generalization Bound

Show the SVM margin implies a generalization bound.

Solution: For margin \(\gamma=1/\|w^*\|\) and \(\|x\|\le R\), Rademacher complexity \(\mathfrak{R}_n\le R/(\gamma\sqrt{n})\). Thus \(R(\hat{h})\le\hat{R}(\hat{h})+2R/(\gamma\sqrt{n})+O(\sqrt{\log(1/\delta)/n})\). Maximizing margin minimizes this bound.

Example 2: Hinge Loss Equivalence

Show soft-margin SVM minimizes regularized hinge loss.

Solution: Hinge loss: \(\max(0,1-y_iw^\top x_i)\). With slack variables, primal equals \(\frac{1}{2}\|w\|^2+C\sum_i\max(0,1-y_i(w^\top x_i+b))\), a convex surrogate for 0-1 loss with L2 regularization.

Practice

  1. Derive the SVM dual from the Lagrangian using KKT conditions step by step.
  2. Explain the role of \(C\): what is its effect on model bias and variance?
  3. For kernel SVM, how many support vectors do you expect as \(C\to\infty\)?
  4. Compare hinge loss and logistic loss: which is more robust to outliers?
Show Answer Key

1. Primal: $\min_{w,b,\xi} \frac{1}{2}\|w\|^2+C\sum\xi_i$ s.t. $y_i(w^Tx_i+b)\geq 1-\xi_i$, $\xi_i\geq 0$. Lagrangian: $L = \frac{1}{2}\|w\|^2+C\sum\xi_i-\sum\alpha_i[y_i(w^Tx_i+b)-1+\xi_i]-\sum\mu_i\xi_i$. KKT: $\partial L/\partial w=0 \Rightarrow w=\sum\alpha_iy_ix_i$. $\partial L/\partial b=0 \Rightarrow \sum\alpha_iy_i=0$. $\partial L/\partial\xi_i=0 \Rightarrow \alpha_i+\mu_i=C$, so $0\leq\alpha_i\leq C$. Dual: $\max_\alpha \sum\alpha_i-\frac{1}{2}\sum_{ij}\alpha_i\alpha_jy_iy_jx_i^Tx_j$ s.t. $0\leq\alpha_i\leq C$, $\sum\alpha_iy_i=0$.

2. Small $C$: large margin, many slack violations allowed → smoother boundary, higher bias, lower variance (underfitting risk). Large $C$: penalizes violations heavily → narrower margin, fits training data closely, lower bias, higher variance (overfitting risk). $C=\infty$: hard-margin SVM (no violations allowed). $C$ controls the bias-variance tradeoff.

3. As $C \to \infty$: hard-margin SVM. Only the closest points to the boundary (support vectors on the margin) matter. Number of SVs → minimal (just those defining the maximum-margin hyperplane). For separable data: typically $O(1)$ or $O(d)$ support vectors. For non-separable data: $C \to \infty$ may be infeasible (no solution) or result in a very narrow margin with few SVs.

4. Hinge loss: $\max(0, 1-yf(x))$ — flat beyond margin ($yf\geq 1$), zero gradient for correctly classified points far from boundary. Logistic loss: $\log(1+e^{-yf})$ — always nonzero, asymptotically linear. Hinge is more robust to outliers because it ignores well-classified points entirely (sparse SVs). Logistic is smoother, differentiable everywhere, but every point influences the solution. Both are convex surrogates for 0-1 loss.