Markov Chain Monte Carlo: Metropolis-Hastings
Markov Chain Monte Carlo: Metropolis-Hastings
For most real-world models the posterior normalizing constant $p(\mathcal{D})$ is intractable, making direct sampling impossible. Markov Chain Monte Carlo (MCMC) sidesteps this by constructing a Markov chain whose stationary distribution equals the target posterior. After a burn-in period, the chain's trajectory yields approximate posterior samples for estimation and uncertainty quantification.
The Metropolis-Hastings (MH) algorithm is the foundational MCMC method, requiring only the ability to evaluate the posterior up to a constant.
Metropolis-Hastings Algorithm
Given current state $\theta^{(t)}$ and proposal distribution $q(\theta^*\mid\theta^{(t)})$:
1. Draw $\theta^*\sim q(\cdot\mid\theta^{(t)})$.
2. Compute acceptance ratio: $$\alpha=\min\!\left(1,\;\frac{p(\theta^*\mid\mathcal{D})\,q(\theta^{(t)}\mid\theta^*)}{p(\theta^{(t)}\mid\mathcal{D})\,q(\theta^*\mid\theta^{(t)})}\right)$$
3. Set $\theta^{(t+1)}=\theta^*$ with probability $\alpha$, else $\theta^{(t+1)}=\theta^{(t)}$.
Normalizing constants cancel in the ratio $p(\theta^*\mid\mathcal{D})/p(\theta^{(t)}\mid\mathcal{D})$.
Detailed Balance & Ergodicity
MH satisfies detailed balance: $p(\theta)\,K(\theta,\theta^*)=p(\theta^*)\,K(\theta^*,\theta)$ where $K$ is the MH transition kernel. This ensures $p(\theta\mid\mathcal{D})$ is the unique stationary distribution. If the chain is irreducible and aperiodic, the ergodic theorem guarantees: $$\frac{1}{T}\sum_{t=1}^T f(\theta^{(t)})\xrightarrow{a.s.}\mathbb{E}_{p(\theta\mid\mathcal{D})}[f(\theta)]$$
Example 1: Random-Walk Metropolis
Symmetric proposal $q(\theta^*\mid\theta)=\mathcal{N}(\theta,\sigma_q^2)$ gives $\alpha=\min(1,p(\theta^*\mid\mathcal{D})/p(\theta\mid\mathcal{D}))$. Optimal $\sigma_q$ targets acceptance rate $\approx0.234$ in high dimensions (Roberts & Rosenthal, 2001). Too small $\sigma_q$: high acceptance but slow mixing; too large: most proposals rejected.
Example 2: Convergence Diagnostics
Run $m$ chains from dispersed starting points. Gelman-Rubin statistic: $$\hat{R}=\sqrt{\frac{\hat{V}}{W}}$$ where $W$ is within-chain variance and $\hat{V}$ is total variance. $\hat{R}\approx1$ indicates convergence; $\hat{R}>1.1$ signals non-convergence. Effective sample size $n_{eff}=T/(1+2\sum_{k=1}^\infty\rho_k)$ adjusts for autocorrelation $\rho_k$.
Practice
- Prove that a symmetric proposal $q(\theta^*\mid\theta)=q(\theta\mid\theta^*)$ simplifies MH to the original Metropolis algorithm.
- Explain why posterior samples from MCMC are correlated, and how this affects Monte Carlo error estimates.
Show Answer Key
1. MH acceptance ratio: $\alpha=\min(1,\frac{\pi(\theta^*)q(\theta|\theta^*)}{\pi(\theta)q(\theta^*|\theta)})$. With symmetric $q$, the $q$-ratio is 1, so $\alpha=\min(1,\pi(\theta^*)/\pi(\theta))$, which is the original Metropolis algorithm.
2. MCMC samples are correlated because each new sample depends on the previous one (Markov chain). The effective sample size (ESS) is $n/(1+2\sum_k\rho_k)$ where $\rho_k$ is the lag-$k$ autocorrelation. Monte Carlo error is $\sigma/\sqrt{\text{ESS}}$, larger than for i.i.d. samples.