Neural Networks: Universal Approximation & Backpropagation
Neural Networks: Universal Approximation & Backpropagation
The universal approximation theorem establishes that shallow networks can represent any continuous function. Backpropagation provides an efficient \(O(\text{params})\) algorithm to compute gradients via the chain rule through computational graphs.
Feedforward Network
\(L\)-layer network: \(h^{(0)}=x\), \(h^{(l)}=\sigma(W^{(l)}h^{(l-1)}+b^{(l)})\), output \(\hat{y}=h^{(L)}\). Params \(\theta=\{W^{(l)},b^{(l)}\}\). Total params: \(\sum_l(n_ln_{l-1}+n_l)\). Activations: ReLU \(\max(0,z)\), sigmoid, tanh. ReLU avoids vanishing gradients and is piecewise linear.
Universal Approximation Theorem
For any continuous \(f:[0,1]^d\to\mathbb{R}\) and \(\epsilon>0\), a single hidden-layer network with non-polynomial activation approximates \(f\) uniformly within \(\epsilon\). For ReLU, \(d\)-variable functions may require \(O(\epsilon^{-d})\) neurons. Depth exponentially reduces width for hierarchically structured functions.
Example 1: Backpropagation
Compute \(\partial\mathcal{L}/\partial W^{(l)}\) for squared loss via chain rule.
Solution: Error signal \(\delta^{(l)}=\partial\mathcal{L}/\partial z^{(l)}\). Output: \(\delta^{(L)}=\nabla_{h^{(L)}}\mathcal{L}\odot\sigma'(z^{(L)})\). Recursion: \(\delta^{(l)}=(W^{(l+1)})^\top\delta^{(l+1)}\odot\sigma'(z^{(l)})\). Gradient: \(\partial\mathcal{L}/\partial W^{(l)}=\delta^{(l)}(h^{(l-1)})^\top\).
Example 2: Vanishing Gradients
Why do deep sigmoid networks suffer from vanishing gradients?
Solution: \(\|\delta^{(l)}\|\le\|\delta^{(L)}\|\prod_{k=l+1}^L\|W^{(k)}\|\cdot\max|\sigma'|\). Sigmoid: \(\max|\sigma'|=0.25\). For \(L=10\): factor \(0.25^{10}\approx 10^{-6}\). ReLU (\(\sigma'\in\{0,1\}\)), batch norm, and residual connections mitigate this.
Practice
- Implement backprop by hand for a 2-layer network on a 2D binary classification problem.
- Show a depth-2 ReLU network can represent any piecewise linear function.
- Derive the variance-preserving (Xavier/He) weight initialization from first principles.
- Derive the gradient of batch normalization with respect to its pre-normalized inputs.
Show Answer Key
1. 2-layer network: $z = W_2\sigma(W_1x+b_1)+b_2$. Forward: compute $h=\sigma(W_1x+b_1)$, $z=W_2h+b_2$, loss $L=\frac{1}{2}(z-y)^2$. Backward: $\delta_2 = z-y$, $\partial L/\partial W_2 = \delta_2 h^T$, $\partial L/\partial b_2 = \delta_2$. $\delta_1 = (W_2^T\delta_2)\odot\sigma'(W_1x+b_1)$, $\partial L/\partial W_1 = \delta_1 x^T$, $\partial L/\partial b_1 = \delta_1$. For 2D classification with sigmoid output, replace squared loss with cross-entropy.
2. A single ReLU neuron $\max(0, w^Tx+b)$ is a 'hinge' — zero on one side, linear on the other. A sum of $k$ such neurons (depth-2 network): $f(x) = \sum_{i=1}^k a_i\max(0, w_i^Tx+b_i)$ can represent any piecewise linear function with $\leq k$ pieces (in 1D). Each neuron adds a breakpoint. In $d$ dimensions, the arrangement of hyperplanes $\{w_i^Tx+b_i=0\}$ partitions space into polyhedral regions where $f$ is affine.
3. Layer output variance: $\text{Var}(y) = n_{\text{in}}\cdot\text{Var}(w)\cdot\text{Var}(x)$ (assuming independence). Xavier: set $\text{Var}(w) = 1/n_{\text{in}}$ (or $2/(n_{\text{in}}+n_{\text{out}})$ for symmetric init) to preserve variance across layers. For ReLU (He init): ReLU zeros out half the activations on average, so $\text{Var}(w) = 2/n_{\text{in}}$ compensates. Prevents vanishing/exploding gradients at initialization.
4. Batch norm: $\hat{x}_i = (x_i-\mu_B)/\sqrt{\sigma_B^2+\epsilon}$ where $\mu_B = \frac{1}{m}\sum x_i$, $\sigma_B^2 = \frac{1}{m}\sum(x_i-\mu_B)^2$. Gradient $\partial L/\partial x_i$: by chain rule through $\hat{x}_i$, $\mu_B$, $\sigma_B^2$. Result: $\frac{\partial L}{\partial x_i} = \frac{1}{m\sqrt{\sigma_B^2+\epsilon}}\left(m\frac{\partial L}{\partial\hat{x}_i}-\sum_j\frac{\partial L}{\partial\hat{x}_j}-\hat{x}_i\sum_j\frac{\partial L}{\partial\hat{x}_j}\hat{x}_j\right)$. The gradient is a projection that removes the mean and variance components.