Training Mathematics of Machine Learning Stochastic Gradient Descent & Convergence
2 / 10

Stochastic Gradient Descent & Convergence

42 min Mathematics of Machine Learning

Stochastic Gradient Descent & Convergence

SGD replaces the full gradient with a noisy mini-batch estimate, enabling scalable optimization over massive datasets. The interplay between noise and step size governs convergence speed and final accuracy.

SGD Update

Sample \(i_t\) uniformly; update \(w_{t+1}=w_t-\eta_t\nabla\ell(w_t;z_{i_t})\) where \(\mathbb{E}[\nabla\ell(w;z_i)]=\nabla f(w)\) (unbiased). Gradient variance \(\sigma^2=\mathbb{E}\|\nabla\ell(w;z)-\nabla f(w)\|^2\). Schedules: constant \(\eta\), decaying \(\eta_t=\eta_0/\sqrt{t}\), or cosine annealing.

SGD Convergence (Convex)

For \(\beta\)-smooth convex \(f\) with variance \(\sigma^2\), SGD with \(\eta_t=c/\sqrt{t}\) and averaged iterate \(\bar{w}_T\) satisfies: \(\mathbb{E}[f(\bar{w}_T)]-f^*\le\frac{R^2+c^2\sigma^2\log T}{2c\sqrt{T}}\). For strongly convex objectives with constant \(\eta\): convergence to noise ball of radius \(O(\sigma^2\eta/\alpha)\).

Example 1: Online Regression

SGD on \(\ell(w;(x,y))=\frac{1}{2}(w^\top x-y)^2\). Derive the update.

Solution: Gradient: \((w^\top x-y)x\). Update: \(w_{t+1}=w_t-\eta_t(w_t^\top x_t-y_t)x_t\). Smoothness \(\beta=R_x^2\). Variance bounded by \(4\|w^*\|^2 R_x^4\).

Example 2: Momentum

How does momentum accelerate SGD on strongly convex quadratics?

Solution: Heavy-ball: \(v_{t+1}=\mu v_t+\nabla f(w_t)\), \(w_{t+1}=w_t-\eta v_{t+1}\). Optimal \(\mu=(1-\sqrt{\alpha/\beta})^2\) gives rate \(O((1-2\sqrt{\alpha/\beta})^T)\), replacing \(\kappa\) with \(\sqrt{\kappa}\).

Practice

  1. Why does constant step-size SGD not converge to \(f^*\) for strongly convex objectives?
  2. Compare convergence of SGD vs GD when data is highly redundant.
  3. Derive the optimal batch size minimizing wall-clock time given parallel hardware.
  4. Show Adam's adaptive step is equivalent to preconditioning by \(\mathrm{diag}(v_t)^{-1/2}\).
Show Answer Key

1. SGD with constant step $\eta$: the gradient estimate has irreducible variance $\sigma^2$. The iterate fluctuates in a ball of radius $\sim \eta\sigma^2/\alpha$ around $w^*$. It converges to a neighborhood of $f^*$, not $f^*$ itself: $E[f(w_T)]-f^* \leq (1-\alpha\eta)^T(f(w_0)-f^*) + \frac{\eta L\sigma^2}{2\alpha}$. The second term is the irreducible error floor, vanishing only as $\eta \to 0$ (e.g., diminishing step sizes $\eta_t = O(1/t)$).

2. With redundant data, full-batch GD recomputes nearly identical gradients across similar examples. SGD makes progress using one sample per step. Per epoch (one pass of data, $n$ steps): SGD reduces error by $\sim n \times$ single-step progress, while GD makes 1 step. SGD's convergence per epoch is much faster when data is redundant (many similar examples). GD wins per-iteration only when variance is very high.

3. Wall-clock time per step: $t(B) = t_0 + B/P$ (overhead + compute, $P$ = parallel throughput). Variance reduction: $\text{Var} \propto \sigma^2/B$. Progress per second $\propto 1/(B\cdot t(B)/\sigma^2)$. Optimize: $\partial/\partial B[B\cdot t(B)] = 0$ → $B^* = \sqrt{P t_0}$ (geometric mean of parallelism and overhead). Beyond $B^*$, larger batches waste compute (linear scaling breaks down).

4. Adam update: $w_{t+1} = w_t - \eta \hat{m}_t / (\sqrt{\hat{v}_t}+\epsilon)$ where $\hat{v}_t \approx \text{diag}(E[g_t^2])$. This is equivalent to $w_{t+1} = w_t - \eta D_t^{-1}\hat{m}_t$ with preconditioner $D_t = \text{diag}(\sqrt{\hat{v}_t}+\epsilon)$. This approximates natural gradient / Newton's method using only diagonal second-moment information, adapting the learning rate per-parameter.