8 / 10

Regularization: Ridge, Lasso & Bayesian Perspective

42 min Mathematics of Machine Learning

Regularization: Ridge, Lasso & Bayesian Perspective

Regularization constrains hypothesis complexity to reduce overfitting. Ridge and Lasso penalties admit both optimization and Bayesian interpretations, with Lasso inducing sparsity via non-smooth $\ell_1$ geometry.

Ridge and Lasso

Ridge (L2): $\hat{w}=\arg\min_w\|Xw-y\|^2+\lambda\|w\|^2$. Closed form: $(X^\top X+\lambda I)^{-1}X^\top y$. SVD: $\hat{w}=\sum_i\frac{\sigma_i}{\sigma_i^2+\lambda}(u_i^\top y)v_i$ (shrinks all directions). Lasso (L1): $\arg\min_w\frac{1}{2}\|Xw-y\|^2+\lambda\|w\|_1$. Orthogonal case: $\hat{w}_j=\mathrm{sign}(\tilde{w}_j)(|\tilde{w}_j|-\lambda)_+$ (sparse).

Bayesian Equivalence

MAP with Gaussian prior $w\sim\mathcal{N}(0,\tau^2 I)$ and likelihood $y|X,w\sim\mathcal{N}(Xw,\sigma^2 I)$ yields ridge with $\lambda=\sigma^2/\tau^2$. Laplace prior $p(w)\propto e^{-\lambda\|w\|_1}$ yields Lasso. Full Bayesian inference propagates uncertainty to predictions via the posterior predictive distribution.

Example 1: Sparsity Geometry

Why does $\ell_1$ regularization produce sparse solutions?

Solution: The $\ell_1$ ball has corners on coordinate axes. Elliptical loss contours generically first contact the $\ell_1$ constraint at a corner, setting a coordinate to zero. The smooth $\ell_2$ ball yields tangency away from axes, giving dense solutions.

Example 2: Elastic Net

What problem does Elastic Net solve?

Solution: $\hat{w}=\arg\min_w\|Xw-y\|^2+\lambda_1\|w\|_1+\lambda_2\|w\|^2$. Lasso arbitrarily selects one among correlated predictors; Elastic Net groups them. The $\ell_2$ term provides strong convexity, enabling stable selection when $p\gg n$.

Practice

Derive the ridge solution via normal equations and verify using the SVD form.
Prove that soft-thresholding is the proximal operator of $\lambda\|\cdot\|_1$.
Compute bias and variance of ridge regression for each principal component.
Compare cross-validation for selecting $\lambda$ in ridge vs Lasso: which is more stable?

Show Answer Key

1. Normal equations: $(X^TX+\lambda I)w^* = X^Ty$ → $w^* = (X^TX+\lambda I)^{-1}X^Ty$. SVD: $X = U\Sigma V^T$. Then $X^TX = V\Sigma^2V^T$, $X^Ty = V\Sigma U^Ty$. So $w^* = V(\Sigma^2+\lambda I)^{-1}\Sigma U^Ty = \sum_i \frac{\sigma_i}{\sigma_i^2+\lambda}(u_i^Ty)v_i$. Ridge shrinks each singular component by $\sigma_i/(\sigma_i^2+\lambda)$ — small singular values are shrunk most. ✓

2. Proximal operator of $\lambda\|\cdot\|_1$: $\text{prox}_{\lambda\|\cdot\|_1}(v) = \arg\min_w \frac{1}{2}\|w-v\|^2+\lambda\|w\|_1$. Separable in each coordinate: $w_j^* = \arg\min \frac{1}{2}(w_j-v_j)^2+\lambda|w_j|$. Solution: $w_j^* = \text{sign}(v_j)\max(|v_j|-\lambda, 0)$ (soft-thresholding). Proof: subgradient condition $w_j^*-v_j+\lambda\partial|w_j^*| \ni 0$. Three cases: $v_j>\lambda$, $v_j<-\lambda$, $|v_j|\leq\lambda$. ✓

3. Ridge regression along $i$-th principal component (direction $v_i$): prediction $\hat{y}_i = \frac{\sigma_i^2}{\sigma_i^2+\lambda}(v_i^Tw^*_{\text{true}})$. Bias $= \frac{\lambda}{\sigma_i^2+\lambda}v_i^Tw^*$: increases with $\lambda$ (more shrinkage). Variance $= \frac{\sigma^2\sigma_i^2}{(\sigma_i^2+\lambda)^2}$: decreases with $\lambda$. Small singular values (poorly determined directions) contribute high variance in OLS; ridge dramatically reduces their variance at the cost of bias.

4. Ridge: $\lambda$ smoothly shrinks all coefficients. The MSE curve (as function of $\lambda$) is smooth → CV estimate is stable, optimal $\lambda$ well-determined. Lasso: $\lambda$ causes discrete variable selection (coefficients jump to zero). Small $\lambda$ changes can add/remove variables → CV error is less smooth, optimal $\lambda$ selection is noisier. Remedy: use repeated CV or regularization path algorithms (LARS) for Lasso.

Neural Networks: Universal Approximation & Backpropagation Dimensionality Reduction: PCA, SVD & Manifold Learning

Regularization: Ridge, Lasso & Bayesian Perspective