7 / 10

Neural Networks: Universal Approximation & Backpropagation

42 min Mathematics of Machine Learning

Neural Networks: Universal Approximation & Backpropagation

The universal approximation theorem establishes that shallow networks can represent any continuous function. Backpropagation provides an efficient $O(\text{params})$ algorithm to compute gradients via the chain rule through computational graphs.

Feedforward Network

$L$-layer network: $h^{(0)}=x$, $h^{(l)}=\sigma(W^{(l)}h^{(l-1)}+b^{(l)})$, output $\hat{y}=h^{(L)}$. Params $\theta=\{W^{(l)},b^{(l)}\}$. Total params: $\sum_l(n_ln_{l-1}+n_l)$. Activations: ReLU $\max(0,z)$, sigmoid, tanh. ReLU avoids vanishing gradients and is piecewise linear.

Universal Approximation Theorem

For any continuous $f:[0,1]^d\to\mathbb{R}$ and $\epsilon>0$, a single hidden-layer network with non-polynomial activation approximates $f$ uniformly within $\epsilon$. For ReLU, $d$-variable functions may require $O(\epsilon^{-d})$ neurons. Depth exponentially reduces width for hierarchically structured functions.

Example 1: Backpropagation

Compute $\partial\mathcal{L}/\partial W^{(l)}$ for squared loss via chain rule.

Solution: Error signal $\delta^{(l)}=\partial\mathcal{L}/\partial z^{(l)}$. Output: $\delta^{(L)}=\nabla_{h^{(L)}}\mathcal{L}\odot\sigma'(z^{(L)})$. Recursion: $\delta^{(l)}=(W^{(l+1)})^\top\delta^{(l+1)}\odot\sigma'(z^{(l)})$. Gradient: $\partial\mathcal{L}/\partial W^{(l)}=\delta^{(l)}(h^{(l-1)})^\top$.

Example 2: Vanishing Gradients

Why do deep sigmoid networks suffer from vanishing gradients?

Solution: $\|\delta^{(l)}\|\le\|\delta^{(L)}\|\prod_{k=l+1}^L\|W^{(k)}\|\cdot\max|\sigma'|$. Sigmoid: $\max|\sigma'|=0.25$. For $L=10$: factor $0.25^{10}\approx 10^{-6}$. ReLU ($\sigma'\in\{0,1\}$), batch norm, and residual connections mitigate this.

Practice

Implement backprop by hand for a 2-layer network on a 2D binary classification problem.
Show a depth-2 ReLU network can represent any piecewise linear function.
Derive the variance-preserving (Xavier/He) weight initialization from first principles.
Derive the gradient of batch normalization with respect to its pre-normalized inputs.

Show Answer Key

1. 2-layer network: $z = W_2\sigma(W_1x+b_1)+b_2$. Forward: compute $h=\sigma(W_1x+b_1)$, $z=W_2h+b_2$, loss $L=\frac{1}{2}(z-y)^2$. Backward: $\delta_2 = z-y$, $\partial L/\partial W_2 = \delta_2 h^T$, $\partial L/\partial b_2 = \delta_2$. $\delta_1 = (W_2^T\delta_2)\odot\sigma'(W_1x+b_1)$, $\partial L/\partial W_1 = \delta_1 x^T$, $\partial L/\partial b_1 = \delta_1$. For 2D classification with sigmoid output, replace squared loss with cross-entropy.

2. A single ReLU neuron $\max(0, w^Tx+b)$ is a 'hinge' — zero on one side, linear on the other. A sum of $k$ such neurons (depth-2 network): $f(x) = \sum_{i=1}^k a_i\max(0, w_i^Tx+b_i)$ can represent any piecewise linear function with $\leq k$ pieces (in 1D). Each neuron adds a breakpoint. In $d$ dimensions, the arrangement of hyperplanes $\{w_i^Tx+b_i=0\}$ partitions space into polyhedral regions where $f$ is affine.

3. Layer output variance: $\text{Var}(y) = n_{\text{in}}\cdot\text{Var}(w)\cdot\text{Var}(x)$ (assuming independence). Xavier: set $\text{Var}(w) = 1/n_{\text{in}}$ (or $2/(n_{\text{in}}+n_{\text{out}})$ for symmetric init) to preserve variance across layers. For ReLU (He init): ReLU zeros out half the activations on average, so $\text{Var}(w) = 2/n_{\text{in}}$ compensates. Prevents vanishing/exploding gradients at initialization.

4. Batch norm: $\hat{x}_i = (x_i-\mu_B)/\sqrt{\sigma_B^2+\epsilon}$ where $\mu_B = \frac{1}{m}\sum x_i$, $\sigma_B^2 = \frac{1}{m}\sum(x_i-\mu_B)^2$. Gradient $\partial L/\partial x_i$: by chain rule through $\hat{x}_i$, $\mu_B$, $\sigma_B^2$. Result: $\frac{\partial L}{\partial x_i} = \frac{1}{m\sqrt{\sigma_B^2+\epsilon}}\left(m\frac{\partial L}{\partial\hat{x}_i}-\sum_j\frac{\partial L}{\partial\hat{x}_j}-\hat{x}_i\sum_j\frac{\partial L}{\partial\hat{x}_j}\hat{x}_j\right)$. The gradient is a projection that removes the mean and variance components.

Support Vector Machines: Primal & Dual Regularization: Ridge, Lasso & Bayesian Perspective

Neural Networks: Universal Approximation & Backpropagation