Training Bayesian Inference & Statistical Modeling Markov Chain Monte Carlo: Metropolis-Hastings
4 / 7

Markov Chain Monte Carlo: Metropolis-Hastings

35 min Bayesian Inference & Statistical Modeling

Markov Chain Monte Carlo: Metropolis-Hastings

For most real-world models the posterior normalizing constant $p(\mathcal{D})$ is intractable, making direct sampling impossible. Markov Chain Monte Carlo (MCMC) sidesteps this by constructing a Markov chain whose stationary distribution equals the target posterior. After a burn-in period, the chain's trajectory yields approximate posterior samples for estimation and uncertainty quantification.

The Metropolis-Hastings (MH) algorithm is the foundational MCMC method, requiring only the ability to evaluate the posterior up to a constant.

Metropolis-Hastings Algorithm

Given current state $\theta^{(t)}$ and proposal distribution $q(\theta^*\mid\theta^{(t)})$:
1. Draw $\theta^*\sim q(\cdot\mid\theta^{(t)})$.
2. Compute acceptance ratio: $$\alpha=\min\!\left(1,\;\frac{p(\theta^*\mid\mathcal{D})\,q(\theta^{(t)}\mid\theta^*)}{p(\theta^{(t)}\mid\mathcal{D})\,q(\theta^*\mid\theta^{(t)})}\right)$$
3. Set $\theta^{(t+1)}=\theta^*$ with probability $\alpha$, else $\theta^{(t+1)}=\theta^{(t)}$.
Normalizing constants cancel in the ratio $p(\theta^*\mid\mathcal{D})/p(\theta^{(t)}\mid\mathcal{D})$.

Detailed Balance & Ergodicity

MH satisfies detailed balance: $p(\theta)\,K(\theta,\theta^*)=p(\theta^*)\,K(\theta^*,\theta)$ where $K$ is the MH transition kernel. This ensures $p(\theta\mid\mathcal{D})$ is the unique stationary distribution. If the chain is irreducible and aperiodic, the ergodic theorem guarantees: $$\frac{1}{T}\sum_{t=1}^T f(\theta^{(t)})\xrightarrow{a.s.}\mathbb{E}_{p(\theta\mid\mathcal{D})}[f(\theta)]$$

Example 1: Random-Walk Metropolis

Symmetric proposal $q(\theta^*\mid\theta)=\mathcal{N}(\theta,\sigma_q^2)$ gives $\alpha=\min(1,p(\theta^*\mid\mathcal{D})/p(\theta\mid\mathcal{D}))$. Optimal $\sigma_q$ targets acceptance rate $\approx0.234$ in high dimensions (Roberts & Rosenthal, 2001). Too small $\sigma_q$: high acceptance but slow mixing; too large: most proposals rejected.

Example 2: Convergence Diagnostics

Run $m$ chains from dispersed starting points. Gelman-Rubin statistic: $$\hat{R}=\sqrt{\frac{\hat{V}}{W}}$$ where $W$ is within-chain variance and $\hat{V}$ is total variance. $\hat{R}\approx1$ indicates convergence; $\hat{R}>1.1$ signals non-convergence. Effective sample size $n_{eff}=T/(1+2\sum_{k=1}^\infty\rho_k)$ adjusts for autocorrelation $\rho_k$.

Practice

  1. Prove that a symmetric proposal $q(\theta^*\mid\theta)=q(\theta\mid\theta^*)$ simplifies MH to the original Metropolis algorithm.
  2. Explain why posterior samples from MCMC are correlated, and how this affects Monte Carlo error estimates.
Show Answer Key

1. MH acceptance ratio: $\alpha=\min(1,\frac{\pi(\theta^*)q(\theta|\theta^*)}{\pi(\theta)q(\theta^*|\theta)})$. With symmetric $q$, the $q$-ratio is 1, so $\alpha=\min(1,\pi(\theta^*)/\pi(\theta))$, which is the original Metropolis algorithm.

2. MCMC samples are correlated because each new sample depends on the previous one (Markov chain). The effective sample size (ESS) is $n/(1+2\sum_k\rho_k)$ where $\rho_k$ is the lag-$k$ autocorrelation. Monte Carlo error is $\sigma/\sqrt{\text{ESS}}$, larger than for i.i.d. samples.