2 / 10

Stochastic Gradient Descent & Convergence

42 min Mathematics of Machine Learning

Stochastic Gradient Descent & Convergence

SGD replaces the full gradient with a noisy mini-batch estimate, enabling scalable optimization over massive datasets. The interplay between noise and step size governs convergence speed and final accuracy.

SGD Update

Sample $i_t$ uniformly; update $w_{t+1}=w_t-\eta_t\nabla\ell(w_t;z_{i_t})$ where $\mathbb{E}[\nabla\ell(w;z_i)]=\nabla f(w)$ (unbiased). Gradient variance $\sigma^2=\mathbb{E}\|\nabla\ell(w;z)-\nabla f(w)\|^2$. Schedules: constant $\eta$, decaying $\eta_t=\eta_0/\sqrt{t}$, or cosine annealing.

SGD Convergence (Convex)

For $\beta$-smooth convex $f$ with variance $\sigma^2$, SGD with $\eta_t=c/\sqrt{t}$ and averaged iterate $\bar{w}_T$ satisfies: $\mathbb{E}[f(\bar{w}_T)]-f^*\le\frac{R^2+c^2\sigma^2\log T}{2c\sqrt{T}}$. For strongly convex objectives with constant $\eta$: convergence to noise ball of radius $O(\sigma^2\eta/\alpha)$.

Example 1: Online Regression

SGD on $\ell(w;(x,y))=\frac{1}{2}(w^\top x-y)^2$. Derive the update.

Solution: Gradient: $(w^\top x-y)x$. Update: $w_{t+1}=w_t-\eta_t(w_t^\top x_t-y_t)x_t$. Smoothness $\beta=R_x^2$. Variance bounded by $4\|w^*\|^2 R_x^4$.

Example 2: Momentum

How does momentum accelerate SGD on strongly convex quadratics?

Solution: Heavy-ball: $v_{t+1}=\mu v_t+\nabla f(w_t)$, $w_{t+1}=w_t-\eta v_{t+1}$. Optimal $\mu=(1-\sqrt{\alpha/\beta})^2$ gives rate $O((1-2\sqrt{\alpha/\beta})^T)$, replacing $\kappa$ with $\sqrt{\kappa}$.

Practice

Why does constant step-size SGD not converge to $f^*$ for strongly convex objectives?
Compare convergence of SGD vs GD when data is highly redundant.
Derive the optimal batch size minimizing wall-clock time given parallel hardware.
Show Adam's adaptive step is equivalent to preconditioning by $\mathrm{diag}(v_t)^{-1/2}$.

Show Answer Key

1. SGD with constant step $\eta$: the gradient estimate has irreducible variance $\sigma^2$. The iterate fluctuates in a ball of radius $\sim \eta\sigma^2/\alpha$ around $w^*$. It converges to a neighborhood of $f^*$, not $f^*$ itself: $E[f(w_T)]-f^* \leq (1-\alpha\eta)^T(f(w_0)-f^*) + \frac{\eta L\sigma^2}{2\alpha}$. The second term is the irreducible error floor, vanishing only as $\eta \to 0$ (e.g., diminishing step sizes $\eta_t = O(1/t)$).

2. With redundant data, full-batch GD recomputes nearly identical gradients across similar examples. SGD makes progress using one sample per step. Per epoch (one pass of data, $n$ steps): SGD reduces error by $\sim n \times$ single-step progress, while GD makes 1 step. SGD's convergence per epoch is much faster when data is redundant (many similar examples). GD wins per-iteration only when variance is very high.

3. Wall-clock time per step: $t(B) = t_0 + B/P$ (overhead + compute, $P$ = parallel throughput). Variance reduction: $\text{Var} \propto \sigma^2/B$. Progress per second $\propto 1/(B\cdot t(B)/\sigma^2)$. Optimize: $\partial/\partial B[B\cdot t(B)] = 0$ → $B^* = \sqrt{P t_0}$ (geometric mean of parallelism and overhead). Beyond $B^*$, larger batches waste compute (linear scaling breaks down).

4. Adam update: $w_{t+1} = w_t - \eta \hat{m}_t / (\sqrt{\hat{v}_t}+\epsilon)$ where $\hat{v}_t \approx \text{diag}(E[g_t^2])$. This is equivalent to $w_{t+1} = w_t - \eta D_t^{-1}\hat{m}_t$ with preconditioner $D_t = \text{diag}(\sqrt{\hat{v}_t}+\epsilon)$. This approximates natural gradient / Newton's method using only diagonal second-moment information, adapting the learning rate per-parameter.

Convex Optimization & Gradient Descent Statistical Learning Theory: Risk & Generalization

Stochastic Gradient Descent & Convergence