Stochastic Gradient Descent & Convergence
Stochastic Gradient Descent & Convergence
SGD replaces the full gradient with a noisy mini-batch estimate, enabling scalable optimization over massive datasets. The interplay between noise and step size governs convergence speed and final accuracy.
SGD Update
Sample \(i_t\) uniformly; update \(w_{t+1}=w_t-\eta_t\nabla\ell(w_t;z_{i_t})\) where \(\mathbb{E}[\nabla\ell(w;z_i)]=\nabla f(w)\) (unbiased). Gradient variance \(\sigma^2=\mathbb{E}\|\nabla\ell(w;z)-\nabla f(w)\|^2\). Schedules: constant \(\eta\), decaying \(\eta_t=\eta_0/\sqrt{t}\), or cosine annealing.
SGD Convergence (Convex)
For \(\beta\)-smooth convex \(f\) with variance \(\sigma^2\), SGD with \(\eta_t=c/\sqrt{t}\) and averaged iterate \(\bar{w}_T\) satisfies: \(\mathbb{E}[f(\bar{w}_T)]-f^*\le\frac{R^2+c^2\sigma^2\log T}{2c\sqrt{T}}\). For strongly convex objectives with constant \(\eta\): convergence to noise ball of radius \(O(\sigma^2\eta/\alpha)\).
Example 1: Online Regression
SGD on \(\ell(w;(x,y))=\frac{1}{2}(w^\top x-y)^2\). Derive the update.
Solution: Gradient: \((w^\top x-y)x\). Update: \(w_{t+1}=w_t-\eta_t(w_t^\top x_t-y_t)x_t\). Smoothness \(\beta=R_x^2\). Variance bounded by \(4\|w^*\|^2 R_x^4\).
Example 2: Momentum
How does momentum accelerate SGD on strongly convex quadratics?
Solution: Heavy-ball: \(v_{t+1}=\mu v_t+\nabla f(w_t)\), \(w_{t+1}=w_t-\eta v_{t+1}\). Optimal \(\mu=(1-\sqrt{\alpha/\beta})^2\) gives rate \(O((1-2\sqrt{\alpha/\beta})^T)\), replacing \(\kappa\) with \(\sqrt{\kappa}\).
Practice
- Why does constant step-size SGD not converge to \(f^*\) for strongly convex objectives?
- Compare convergence of SGD vs GD when data is highly redundant.
- Derive the optimal batch size minimizing wall-clock time given parallel hardware.
- Show Adam's adaptive step is equivalent to preconditioning by \(\mathrm{diag}(v_t)^{-1/2}\).
Show Answer Key
1. SGD with constant step $\eta$: the gradient estimate has irreducible variance $\sigma^2$. The iterate fluctuates in a ball of radius $\sim \eta\sigma^2/\alpha$ around $w^*$. It converges to a neighborhood of $f^*$, not $f^*$ itself: $E[f(w_T)]-f^* \leq (1-\alpha\eta)^T(f(w_0)-f^*) + \frac{\eta L\sigma^2}{2\alpha}$. The second term is the irreducible error floor, vanishing only as $\eta \to 0$ (e.g., diminishing step sizes $\eta_t = O(1/t)$).
2. With redundant data, full-batch GD recomputes nearly identical gradients across similar examples. SGD makes progress using one sample per step. Per epoch (one pass of data, $n$ steps): SGD reduces error by $\sim n \times$ single-step progress, while GD makes 1 step. SGD's convergence per epoch is much faster when data is redundant (many similar examples). GD wins per-iteration only when variance is very high.
3. Wall-clock time per step: $t(B) = t_0 + B/P$ (overhead + compute, $P$ = parallel throughput). Variance reduction: $\text{Var} \propto \sigma^2/B$. Progress per second $\propto 1/(B\cdot t(B)/\sigma^2)$. Optimize: $\partial/\partial B[B\cdot t(B)] = 0$ → $B^* = \sqrt{P t_0}$ (geometric mean of parallelism and overhead). Beyond $B^*$, larger batches waste compute (linear scaling breaks down).
4. Adam update: $w_{t+1} = w_t - \eta \hat{m}_t / (\sqrt{\hat{v}_t}+\epsilon)$ where $\hat{v}_t \approx \text{diag}(E[g_t^2])$. This is equivalent to $w_{t+1} = w_t - \eta D_t^{-1}\hat{m}_t$ with preconditioner $D_t = \text{diag}(\sqrt{\hat{v}_t}+\epsilon)$. This approximates natural gradient / Newton's method using only diagonal second-moment information, adapting the learning rate per-parameter.