1 / 7

Bayes' Theorem & Probability Fundamentals

35 min Bayesian Inference & Statistical Modeling

Bayes' Theorem & Probability Fundamentals

Bayesian inference treats unknown parameters as random variables endowed with probability distributions, inverting the classical frequentist paradigm. Rather than asking what data would look like under a fixed parameter, we ask how our belief about the parameter should change after observing data. This inversion is accomplished via Bayes' theorem, one of the most consequential results in all of mathematics.

The key insight is that probability quantifies degree of belief, not just long-run frequency. This allows coherent reasoning under uncertainty even when repeated experiments are impossible.

Bayes' Theorem

For events $A$ and $B$ with $P(B)>0$: $$P(A\mid B)=\frac{P(B\mid A)\,P(A)}{P(B)}$$ In parameter estimation with data $\mathcal{D}$ and parameter $\theta$: $$\underbrace{p(\theta\mid\mathcal{D})}_{\text{posterior}}=\frac{\underbrace{p(\mathcal{D}\mid\theta)}_{\text{likelihood}}\;\underbrace{p(\theta)}_{\text{prior}}}{\underbrace{p(\mathcal{D})}_{\text{evidence}}}$$ where $p(\mathcal{D})=\int p(\mathcal{D}\mid\theta)\,p(\theta)\,d\theta$ is the marginal likelihood.

Law of Total Probability

If $\{B_i\}$ partitions the sample space, then for any event $A$: $$P(A)=\sum_i P(A\mid B_i)\,P(B_i)$$ This underpins computation of the evidence $p(\mathcal{D})$ and is essential for marginalizing over nuisance parameters.

Example 1: Medical Diagnostic Test

Disease prevalence $P(D)=0.01$. Test sensitivity $P(+\mid D)=0.95$, specificity $P(-\mid D^c)=0.90$. A patient tests positive. Find $P(D\mid +)$.
$P(+)=0.95(0.01)+0.10(0.99)=0.1085$.
$P(D\mid +)=\frac{0.95\times 0.01}{0.1085}\approx 0.0876$.
Despite 95% sensitivity, positive predictive value is only 8.8% due to low prevalence.

Example 2: Sequential Updating

Start with $P(\theta=0.7)=0.5$, $P(\theta=0.3)=0.5$. Observe one head from a coin flip. Update: $P(\theta=0.7\mid H)=\frac{0.7\times0.5}{0.7\times0.5+0.3\times0.5}=0.7$. Observe a second head: $P(\theta=0.7\mid HH)=\frac{0.7\times0.7}{0.7^2+0.3^2}\approx0.845$. Sequential Bayesian updating is equivalent to one-shot updating on the full dataset.

Practice

Derive the odds form of Bayes' theorem: $\frac{P(H\mid D)}{P(H^c\mid D)}=\frac{P(D\mid H)}{P(D\mid H^c)}\cdot\frac{P(H)}{P(H^c)}$.
Show that Bayesian updating is order-independent: updating on $(x_1, x_2)$ jointly equals updating on $x_1$ then $x_2$.

Show Answer Key

1. Divide Bayes' theorem $P(H|D)=P(D|H)P(H)/P(D)$ by $P(H^c|D)=P(D|H^c)P(H^c)/P(D)$. The $P(D)$ cancels, yielding the odds form.

Prior Distributions: Conjugate & Non-informative

Bayes' Theorem & Probability Fundamentals