Regularization: Ridge, Lasso & Bayesian Perspective
Regularization: Ridge, Lasso & Bayesian Perspective
Regularization constrains hypothesis complexity to reduce overfitting. Ridge and Lasso penalties admit both optimization and Bayesian interpretations, with Lasso inducing sparsity via non-smooth \(\ell_1\) geometry.
Ridge and Lasso
Ridge (L2): \(\hat{w}=\arg\min_w\|Xw-y\|^2+\lambda\|w\|^2\). Closed form: \((X^\top X+\lambda I)^{-1}X^\top y\). SVD: \(\hat{w}=\sum_i\frac{\sigma_i}{\sigma_i^2+\lambda}(u_i^\top y)v_i\) (shrinks all directions). Lasso (L1): \(\arg\min_w\frac{1}{2}\|Xw-y\|^2+\lambda\|w\|_1\). Orthogonal case: \(\hat{w}_j=\mathrm{sign}(\tilde{w}_j)(|\tilde{w}_j|-\lambda)_+\) (sparse).
Bayesian Equivalence
MAP with Gaussian prior \(w\sim\mathcal{N}(0,\tau^2 I)\) and likelihood \(y|X,w\sim\mathcal{N}(Xw,\sigma^2 I)\) yields ridge with \(\lambda=\sigma^2/\tau^2\). Laplace prior \(p(w)\propto e^{-\lambda\|w\|_1}\) yields Lasso. Full Bayesian inference propagates uncertainty to predictions via the posterior predictive distribution.
Example 1: Sparsity Geometry
Why does \(\ell_1\) regularization produce sparse solutions?
Solution: The \(\ell_1\) ball has corners on coordinate axes. Elliptical loss contours generically first contact the \(\ell_1\) constraint at a corner, setting a coordinate to zero. The smooth \(\ell_2\) ball yields tangency away from axes, giving dense solutions.
Example 2: Elastic Net
What problem does Elastic Net solve?
Solution: \(\hat{w}=\arg\min_w\|Xw-y\|^2+\lambda_1\|w\|_1+\lambda_2\|w\|^2\). Lasso arbitrarily selects one among correlated predictors; Elastic Net groups them. The \(\ell_2\) term provides strong convexity, enabling stable selection when \(p\gg n\).
Practice
- Derive the ridge solution via normal equations and verify using the SVD form.
- Prove that soft-thresholding is the proximal operator of \(\lambda\|\cdot\|_1\).
- Compute bias and variance of ridge regression for each principal component.
- Compare cross-validation for selecting \(\lambda\) in ridge vs Lasso: which is more stable?
Show Answer Key
1. Normal equations: $(X^TX+\lambda I)w^* = X^Ty$ → $w^* = (X^TX+\lambda I)^{-1}X^Ty$. SVD: $X = U\Sigma V^T$. Then $X^TX = V\Sigma^2V^T$, $X^Ty = V\Sigma U^Ty$. So $w^* = V(\Sigma^2+\lambda I)^{-1}\Sigma U^Ty = \sum_i \frac{\sigma_i}{\sigma_i^2+\lambda}(u_i^Ty)v_i$. Ridge shrinks each singular component by $\sigma_i/(\sigma_i^2+\lambda)$ — small singular values are shrunk most. ✓
2. Proximal operator of $\lambda\|\cdot\|_1$: $\text{prox}_{\lambda\|\cdot\|_1}(v) = \arg\min_w \frac{1}{2}\|w-v\|^2+\lambda\|w\|_1$. Separable in each coordinate: $w_j^* = \arg\min \frac{1}{2}(w_j-v_j)^2+\lambda|w_j|$. Solution: $w_j^* = \text{sign}(v_j)\max(|v_j|-\lambda, 0)$ (soft-thresholding). Proof: subgradient condition $w_j^*-v_j+\lambda\partial|w_j^*| \ni 0$. Three cases: $v_j>\lambda$, $v_j<-\lambda$, $|v_j|\leq\lambda$. ✓
3. Ridge regression along $i$-th principal component (direction $v_i$): prediction $\hat{y}_i = \frac{\sigma_i^2}{\sigma_i^2+\lambda}(v_i^Tw^*_{\text{true}})$. Bias $= \frac{\lambda}{\sigma_i^2+\lambda}v_i^Tw^*$: increases with $\lambda$ (more shrinkage). Variance $= \frac{\sigma^2\sigma_i^2}{(\sigma_i^2+\lambda)^2}$: decreases with $\lambda$. Small singular values (poorly determined directions) contribute high variance in OLS; ridge dramatically reduces their variance at the cost of bias.
4. Ridge: $\lambda$ smoothly shrinks all coefficients. The MSE curve (as function of $\lambda$) is smooth → CV estimate is stable, optimal $\lambda$ well-determined. Lasso: $\lambda$ causes discrete variable selection (coefficients jump to zero). Small $\lambda$ changes can add/remove variables → CV error is less smooth, optimal $\lambda$ selection is noisier. Remedy: use repeated CV or regularization path algorithms (LARS) for Lasso.