Convex Optimization & Gradient Descent
Convex Optimization & Gradient Descent
Convex optimization is the backbone of machine learning: most training objectives are convex or approximately so, guaranteeing that local minima are global. Gradient descent exploits the geometry of the loss landscape to iteratively minimize the objective.
Convex Function
\(f: \mathbb{R}^d \to \mathbb{R}\) is convex if \(f(\lambda x+(1-\lambda)y)\le\lambda f(x)+(1-\lambda)f(y)\). Equivalently: \(f(y)\ge f(x)+\nabla f(x)^\top(y-x)\). A function is \(\beta\)-smooth if \(\|\nabla f(x)-\nabla f(y)\|\le\beta\|x-y\|\) and \(\alpha\)-strongly convex if \(f(y)\ge f(x)+\nabla f(x)^\top(y-x)+\frac{\alpha}{2}\|y-x\|^2\).
GD Convergence
For \(\beta\)-smooth convex \(f\), GD with \(\eta=1/\beta\) satisfies: \(f(x_T)-f^*\le\frac{\beta\|x_0-x^*\|^2}{2T}\). For \(\alpha\)-strongly convex: \(\|x_T-x^*\|^2\le(1-\alpha/\beta)^T\|x_0-x^*\|^2\) (linear convergence). Condition number \(\kappa=\beta/\alpha\) governs speed.
Example 1: Least Squares
Minimize \(f(w)=\frac{1}{2n}\|Xw-y\|^2\). Is \(f\) convex?
Solution: \(\nabla^2 f=X^\top X/n\succeq 0\), so \(f\) is convex with \(\beta=\lambda_{\max}(X^\top X)/n\). GD converges as \(O(1/T)\), or linearly if \(X\) has full column rank.
Example 2: Logistic Loss
Show \(f(w)=\frac{1}{n}\sum_i\log(1+e^{-y_i w^\top x_i})\) is convex.
Solution: \(\nabla^2 f=\frac{1}{n}\sum_i\sigma(z_i)(1-\sigma(z_i))x_ix_i^\top\succeq 0\). Thus \(f\) is convex with \(\beta\le\|X\|^2/(4n)\).
Practice
- Prove a twice-differentiable function is convex iff \(\nabla^2 f(x)\succeq 0\) everywhere.
- Derive the gradient of \(f(w)=\|Xw-y\|^2+\lambda\|w\|^2\) and find \(w^*\) in closed form.
- Show that if \(f\) is \(\alpha\)-strongly convex, it has a unique minimizer.
- Compare convergence \(O(1/T)\) vs \(O(\rho^T)\). For \(\kappa=10\), how many steps for \(10^{-6}\) accuracy?
Show Answer Key
1. ($\Rightarrow$) If $f$ is convex, then $f(y) \geq f(x)+\nabla f(x)^T(y-x)$ for all $x,y$. Take $y=x+\epsilon v$: $f(x+\epsilon v) \geq f(x)+\epsilon\nabla f(x)^Tv$. Taylor expand LHS: $f(x)+\epsilon\nabla f^Tv+\frac{\epsilon^2}{2}v^T\nabla^2 f\,v+O(\epsilon^3) \geq f(x)+\epsilon\nabla f^Tv$. So $v^T\nabla^2 f\,v \geq 0$ for all $v$. ($\Leftarrow$) If $\nabla^2 f \succeq 0$, use Taylor with integral remainder: $f(y)=f(x)+\nabla f^T(y-x)+\frac{1}{2}(y-x)^T\nabla^2 f(z)(y-x) \geq f(x)+\nabla f^T(y-x)$. ✓
2. $\nabla f = 2X^T(Xw-y)+2\lambda w$. Set to 0: $(X^TX+\lambda I)w^* = X^Ty$, so $w^* = (X^TX+\lambda I)^{-1}X^Ty$. The matrix $(X^TX+\lambda I)$ is always invertible for $\lambda>0$ (eigenvalues $\geq \lambda > 0$).
3. Strongly convex: $f(y) \geq f(x)+\nabla f(x)^T(y-x)+\frac{\alpha}{2}\|y-x\|^2$. Suppose two minimizers $x^*,y^*$ with $f(x^*)=f(y^*)=f^*$. Apply the inequality at $x^*$: $f^* \geq f^*+0+\frac{\alpha}{2}\|y^*-x^*\|^2$, so $\|y^*-x^*\|^2 \leq 0$, hence $x^*=y^*$. Uniqueness follows. ✓
4. GD on $L$-smooth, $\alpha$-strongly convex $f$: $f(w_T)-f^* \leq (1-\alpha/L)^T(f(w_0)-f^*)$ → converges as $O(\rho^T)$ with $\rho = 1-1/\kappa$ ($\kappa = L/\alpha$). For $\kappa=10$: need $(1-0.1)^T < 10^{-6}$ → $T > 6\ln 10/\ln(10/9) \approx 131$ steps. Subgradient/plain convex: $O(1/T)$ → need $T > 10^6$ steps. Linear convergence is dramatically faster.