5 / 7

Gibbs Sampling & Hamiltonian Monte Carlo

35 min Bayesian Inference & Statistical Modeling

Gibbs Sampling & Hamiltonian Monte Carlo

Gibbs sampling exploits conditional structure: in models where full conditional distributions $p(\theta_j\mid\theta_{-j},\mathcal{D})$ are tractable, updating each coordinate in turn is a special case of MH with acceptance probability exactly one. Hamiltonian Monte Carlo (HMC) uses gradient information to make large moves through parameter space, dramatically reducing autocorrelation and enabling efficient inference in high dimensions.

Gibbs Sampler

For $\theta=(\theta_1,\ldots,\theta_p)$, each iteration cycles through: $$\theta_j^{(t+1)}\sim p\!\left(\theta_j\,\Big|\,\theta_1^{(t+1)},\ldots,\theta_{j-1}^{(t+1)},\theta_{j+1}^{(t)},\ldots,\theta_p^{(t)},\mathcal{D}\right)$$ Gibbs is a MH sampler with proposal equal to the full conditional, so acceptance is always 1. It is exact, requires no tuning, but mixes slowly when parameters are highly correlated.

Hamiltonian Monte Carlo

Introduce auxiliary momentum $r\sim\mathcal{N}(0,M)$. Define Hamiltonian $H(\theta,r)=U(\theta)+\frac{1}{2}r^TM^{-1}r$ where $U(\theta)=-\log p(\theta\mid\mathcal{D})$. Simulate Hamiltonian dynamics via leapfrog integration ($L$ steps, step-size $\varepsilon$): $$r_{t+\varepsilon/2}=r_t-\frac{\varepsilon}{2}\nabla U(\theta_t), \quad \theta_{t+\varepsilon}=\theta_t+\varepsilon M^{-1}r_{t+\varepsilon/2}$$ Proposed $\theta^*$ is accepted with probability $\min(1,e^{-H(\theta^*,r^*)+H(\theta,r)})$. Accepts $\approx1$ with exact integration.

Example 1: Gibbs for Gaussian Mixture

Two-component Gaussian mixture with latent indicators $z_i\in\{1,2\}$. Full conditionals: $p(z_i=k\mid\cdot)\propto\pi_k\mathcal{N}(x_i;\mu_k,\sigma_k^2)$ (categorical); $\mu_k\mid\cdot\sim\mathcal{N}$ (conjugate normal-normal). Cycling through $z_i$ and $\mu_k$ updates implements the probabilistic $E$ and $M$ steps of EM with uncertainty propagation.

Example 2: NUTS

The No-U-Turn Sampler (Hoffman & Gelman, 2014) automatically selects $L$ by building a binary tree of leapfrog steps until the trajectory turns back on itself ($r^T(\theta-\theta_0)<0$). NUTS eliminates step-length tuning, making HMC practical. Stan implements NUTS with dual averaging for $\varepsilon$ adaptation during warm-up.

Practice

Show that Gibbs sampling for a bivariate normal with correlation $\rho$ has autocorrelation $\rho^2$ at lag 1, explaining slow mixing when $|\rho|\approx1$.
Explain the role of the leapfrog step-size $\varepsilon$ and number of steps $L$ in HMC efficiency, and the trade-off between them.

Show Answer Key

1. For bivariate normal $(X,Y)$ with correlation $\rho$: each full Gibbs scan updates $X|Y$ then $Y|X$. The conditional means are linear in the conditioning variable with slope $\rho$. The lag-1 autocorrelation of each component is $\rho^2$. When $|\rho|\approx1$, mixing is very slow because successive samples barely move.

2. Step size $\varepsilon$: too large causes high rejection rates; too small causes random-walk behavior. Number of steps $L$: too few gives proposals close to the current point; too many wastes computation. The product $\varepsilon L$ should be roughly the diameter of the typical set. Optimal acceptance rate is $\approx0.65$.

Markov Chain Monte Carlo: Metropolis-Hastings Hierarchical Models & Partial Pooling

Gibbs Sampling & Hamiltonian Monte Carlo