Training Bayesian Inference & Statistical Modeling Likelihood Functions & Posterior Derivation
3 / 7

Likelihood Functions & Posterior Derivation

35 min Bayesian Inference & Statistical Modeling

Likelihood Functions & Posterior Derivation

The likelihood function $\mathcal{L}(\theta;\mathcal{D})=p(\mathcal{D}\mid\theta)$ viewed as a function of $\theta$ for fixed data $\mathcal{D}$ is the bridge between data and inference. It encodes all information the data contain about $\theta$ via the likelihood principle. The posterior is proportional to the product of likelihood and prior, making proportionality arguments the workhorse of Bayesian computation.

Identifying kernel forms — the $\theta$-dependent factors — allows us to recognize posterior families without evaluating the normalizing constant.

Likelihood & Log-Likelihood

For i.i.d. data $\mathcal{D}=\{x_1,\ldots,x_n\}$: $$\mathcal{L}(\theta;\mathcal{D})=\prod_{i=1}^n p(x_i\mid\theta)$$ $$\ell(\theta)=\log\mathcal{L}(\theta)=\sum_{i=1}^n\log p(x_i\mid\theta)$$ The MLE solves $\nabla_\theta\ell(\theta)=0$. The Bayesian posterior is $p(\theta\mid\mathcal{D})\propto\mathcal{L}(\theta;\mathcal{D})\,p(\theta)$.

Bernstein-von Mises Theorem

Under regularity conditions, as $n\to\infty$, the posterior concentrates around the MLE $\hat\theta$ and converges in total variation to a normal distribution: $$p(\theta\mid\mathcal{D})\xrightarrow{d}\mathcal{N}\!\left(\hat\theta,\;[n\,\mathcal{I}(\hat\theta)]^{-1}\right)$$ This justifies Laplace approximations and shows Bayesian and frequentist inference agree asymptotically.

Example 1: Exponential Likelihood

Data $x_i\overset{iid}{\sim}\text{Exp}(\lambda)$. Likelihood: $\mathcal{L}(\lambda)=\lambda^n e^{-\lambda n\bar x}$. With Gamma$(a,b)$ prior $p(\lambda)\propto\lambda^{a-1}e^{-b\lambda}$: $$p(\lambda\mid\mathcal{D})\propto\lambda^{n+a-1}e^{-(n\bar x+b)\lambda}\equiv\text{Gamma}(n+a,\;n\bar x+b)$$ Posterior mean $=(n+a)/(n\bar x+b)$; MLE $=1/\bar x$ recovered as $a,b\to0$.

Example 2: Laplace Approximation

When conjugacy fails, approximate the posterior by expanding $\log p(\theta\mid\mathcal{D})$ to second order around its mode $\hat\theta_{MAP}$: $$p(\theta\mid\mathcal{D})\approx\mathcal{N}\!\left(\hat\theta_{MAP},\;\left[-\nabla^2\log p(\theta\mid\mathcal{D})\big|_{\hat\theta_{MAP}}\right]^{-1}\right)$$ The Hessian of the negative log-posterior gives the approximate posterior covariance.

Practice

  1. Derive the posterior for $\sigma^2$ in a Normal model with known $\mu$, using an Inverse-Gamma prior $\sigma^2\sim\text{IG}(a,b)$.
  2. Explain why the likelihood principle implies that stopping rules are irrelevant for Bayesian inference but not for frequentist $p$-values.
Show Answer Key

1. Likelihood: $\prod(2\pi\sigma^2)^{-1/2}\exp(-\sum(x_i-\mu)^2/(2\sigma^2))\propto(\sigma^2)^{-n/2}\exp(-S/(2\sigma^2))$ where $S=\sum(x_i-\mu)^2$. With prior $\text{IG}(a,b)\propto(\sigma^2)^{-a-1}e^{-b/\sigma^2}$, the posterior is $\text{IG}(a+n/2,\,b+S/2)$.

2. The likelihood principle states inference depends only on the observed data through the likelihood function. Since the stopping rule affects the sample space but not the likelihood, Bayesian posteriors are the same regardless of the stopping rule. Frequentist $p$-values depend on the sample space (all possible outcomes), so the stopping rule matters.