1 / 10

Convex Optimization & Gradient Descent

42 min Mathematics of Machine Learning

Convex Optimization & Gradient Descent

Convex optimization is the backbone of machine learning: most training objectives are convex or approximately so, guaranteeing that local minima are global. Gradient descent exploits the geometry of the loss landscape to iteratively minimize the objective.

Convex Function

$f: \mathbb{R}^d \to \mathbb{R}$ is convex if $f(\lambda x+(1-\lambda)y)\le\lambda f(x)+(1-\lambda)f(y)$. Equivalently: $f(y)\ge f(x)+\nabla f(x)^\top(y-x)$. A function is $\beta$-smooth if $\|\nabla f(x)-\nabla f(y)\|\le\beta\|x-y\|$ and $\alpha$-strongly convex if $f(y)\ge f(x)+\nabla f(x)^\top(y-x)+\frac{\alpha}{2}\|y-x\|^2$.

GD Convergence

For $\beta$-smooth convex $f$, GD with $\eta=1/\beta$ satisfies: $f(x_T)-f^*\le\frac{\beta\|x_0-x^*\|^2}{2T}$. For $\alpha$-strongly convex: $\|x_T-x^*\|^2\le(1-\alpha/\beta)^T\|x_0-x^*\|^2$ (linear convergence). Condition number $\kappa=\beta/\alpha$ governs speed.

Example 1: Least Squares

Minimize $f(w)=\frac{1}{2n}\|Xw-y\|^2$. Is $f$ convex?

Solution: $\nabla^2 f=X^\top X/n\succeq 0$, so $f$ is convex with $\beta=\lambda_{\max}(X^\top X)/n$. GD converges as $O(1/T)$, or linearly if $X$ has full column rank.

Example 2: Logistic Loss

Show $f(w)=\frac{1}{n}\sum_i\log(1+e^{-y_i w^\top x_i})$ is convex.

Solution: $\nabla^2 f=\frac{1}{n}\sum_i\sigma(z_i)(1-\sigma(z_i))x_ix_i^\top\succeq 0$. Thus $f$ is convex with $\beta\le\|X\|^2/(4n)$.

Practice

Prove a twice-differentiable function is convex iff $\nabla^2 f(x)\succeq 0$ everywhere.
Derive the gradient of $f(w)=\|Xw-y\|^2+\lambda\|w\|^2$ and find $w^*$ in closed form.
Show that if $f$ is $\alpha$-strongly convex, it has a unique minimizer.
Compare convergence $O(1/T)$ vs $O(\rho^T)$. For $\kappa=10$, how many steps for $10^{-6}$ accuracy?

Show Answer Key

1. ($\Rightarrow$) If $f$ is convex, then $f(y) \geq f(x)+\nabla f(x)^T(y-x)$ for all $x,y$. Take $y=x+\epsilon v$: $f(x+\epsilon v) \geq f(x)+\epsilon\nabla f(x)^Tv$. Taylor expand LHS: $f(x)+\epsilon\nabla f^Tv+\frac{\epsilon^2}{2}v^T\nabla^2 f\,v+O(\epsilon^3) \geq f(x)+\epsilon\nabla f^Tv$. So $v^T\nabla^2 f\,v \geq 0$ for all $v$. ($\Leftarrow$) If $\nabla^2 f \succeq 0$, use Taylor with integral remainder: $f(y)=f(x)+\nabla f^T(y-x)+\frac{1}{2}(y-x)^T\nabla^2 f(z)(y-x) \geq f(x)+\nabla f^T(y-x)$. ✓

2. $\nabla f = 2X^T(Xw-y)+2\lambda w$. Set to 0: $(X^TX+\lambda I)w^* = X^Ty$, so $w^* = (X^TX+\lambda I)^{-1}X^Ty$. The matrix $(X^TX+\lambda I)$ is always invertible for $\lambda>0$ (eigenvalues $\geq \lambda > 0$).

3. Strongly convex: $f(y) \geq f(x)+\nabla f(x)^T(y-x)+\frac{\alpha}{2}\|y-x\|^2$. Suppose two minimizers $x^*,y^*$ with $f(x^*)=f(y^*)=f^*$. Apply the inequality at $x^*$: $f^* \geq f^*+0+\frac{\alpha}{2}\|y^*-x^*\|^2$, so $\|y^*-x^*\|^2 \leq 0$, hence $x^*=y^*$. Uniqueness follows. ✓

4. GD on $L$-smooth, $\alpha$-strongly convex $f$: $f(w_T)-f^* \leq (1-\alpha/L)^T(f(w_0)-f^*)$ → converges as $O(\rho^T)$ with $\rho = 1-1/\kappa$ ($\kappa = L/\alpha$). For $\kappa=10$: need $(1-0.1)^T < 10^{-6}$ → $T > 6\ln 10/\ln(10/9) \approx 131$ steps. Subgradient/plain convex: $O(1/T)$ → need $T > 10^6$ steps. Linear convergence is dramatically faster.

Stochastic Gradient Descent & Convergence

Convex Optimization & Gradient Descent