Training Mathematics of Machine Learning Convex Optimization & Gradient Descent
1 / 10

Convex Optimization & Gradient Descent

42 min Mathematics of Machine Learning

Convex Optimization & Gradient Descent

Convex optimization is the backbone of machine learning: most training objectives are convex or approximately so, guaranteeing that local minima are global. Gradient descent exploits the geometry of the loss landscape to iteratively minimize the objective.

Convex Function

\(f: \mathbb{R}^d \to \mathbb{R}\) is convex if \(f(\lambda x+(1-\lambda)y)\le\lambda f(x)+(1-\lambda)f(y)\). Equivalently: \(f(y)\ge f(x)+\nabla f(x)^\top(y-x)\). A function is \(\beta\)-smooth if \(\|\nabla f(x)-\nabla f(y)\|\le\beta\|x-y\|\) and \(\alpha\)-strongly convex if \(f(y)\ge f(x)+\nabla f(x)^\top(y-x)+\frac{\alpha}{2}\|y-x\|^2\).

GD Convergence

For \(\beta\)-smooth convex \(f\), GD with \(\eta=1/\beta\) satisfies: \(f(x_T)-f^*\le\frac{\beta\|x_0-x^*\|^2}{2T}\). For \(\alpha\)-strongly convex: \(\|x_T-x^*\|^2\le(1-\alpha/\beta)^T\|x_0-x^*\|^2\) (linear convergence). Condition number \(\kappa=\beta/\alpha\) governs speed.

Example 1: Least Squares

Minimize \(f(w)=\frac{1}{2n}\|Xw-y\|^2\). Is \(f\) convex?

Solution: \(\nabla^2 f=X^\top X/n\succeq 0\), so \(f\) is convex with \(\beta=\lambda_{\max}(X^\top X)/n\). GD converges as \(O(1/T)\), or linearly if \(X\) has full column rank.

Example 2: Logistic Loss

Show \(f(w)=\frac{1}{n}\sum_i\log(1+e^{-y_i w^\top x_i})\) is convex.

Solution: \(\nabla^2 f=\frac{1}{n}\sum_i\sigma(z_i)(1-\sigma(z_i))x_ix_i^\top\succeq 0\). Thus \(f\) is convex with \(\beta\le\|X\|^2/(4n)\).

Practice

  1. Prove a twice-differentiable function is convex iff \(\nabla^2 f(x)\succeq 0\) everywhere.
  2. Derive the gradient of \(f(w)=\|Xw-y\|^2+\lambda\|w\|^2\) and find \(w^*\) in closed form.
  3. Show that if \(f\) is \(\alpha\)-strongly convex, it has a unique minimizer.
  4. Compare convergence \(O(1/T)\) vs \(O(\rho^T)\). For \(\kappa=10\), how many steps for \(10^{-6}\) accuracy?
Show Answer Key

1. ($\Rightarrow$) If $f$ is convex, then $f(y) \geq f(x)+\nabla f(x)^T(y-x)$ for all $x,y$. Take $y=x+\epsilon v$: $f(x+\epsilon v) \geq f(x)+\epsilon\nabla f(x)^Tv$. Taylor expand LHS: $f(x)+\epsilon\nabla f^Tv+\frac{\epsilon^2}{2}v^T\nabla^2 f\,v+O(\epsilon^3) \geq f(x)+\epsilon\nabla f^Tv$. So $v^T\nabla^2 f\,v \geq 0$ for all $v$. ($\Leftarrow$) If $\nabla^2 f \succeq 0$, use Taylor with integral remainder: $f(y)=f(x)+\nabla f^T(y-x)+\frac{1}{2}(y-x)^T\nabla^2 f(z)(y-x) \geq f(x)+\nabla f^T(y-x)$. ✓

2. $\nabla f = 2X^T(Xw-y)+2\lambda w$. Set to 0: $(X^TX+\lambda I)w^* = X^Ty$, so $w^* = (X^TX+\lambda I)^{-1}X^Ty$. The matrix $(X^TX+\lambda I)$ is always invertible for $\lambda>0$ (eigenvalues $\geq \lambda > 0$).

3. Strongly convex: $f(y) \geq f(x)+\nabla f(x)^T(y-x)+\frac{\alpha}{2}\|y-x\|^2$. Suppose two minimizers $x^*,y^*$ with $f(x^*)=f(y^*)=f^*$. Apply the inequality at $x^*$: $f^* \geq f^*+0+\frac{\alpha}{2}\|y^*-x^*\|^2$, so $\|y^*-x^*\|^2 \leq 0$, hence $x^*=y^*$. Uniqueness follows. ✓

4. GD on $L$-smooth, $\alpha$-strongly convex $f$: $f(w_T)-f^* \leq (1-\alpha/L)^T(f(w_0)-f^*)$ → converges as $O(\rho^T)$ with $\rho = 1-1/\kappa$ ($\kappa = L/\alpha$). For $\kappa=10$: need $(1-0.1)^T < 10^{-6}$ → $T > 6\ln 10/\ln(10/9) \approx 131$ steps. Subgradient/plain convex: $O(1/T)$ → need $T > 10^6$ steps. Linear convergence is dramatically faster.