1 / 7

Entropy & Measures of Information

35 min Information Theory

Entropy & Measures of Information

Shannon entropy quantifies the average uncertainty in a random variable. For a discrete random variable $X$ with distribution $P$, the entropy $H(X)=-\sum_x P(x)\log_2 P(x)$ (in bits) measures how many binary questions on average are needed to determine the outcome. Shannon's 1948 paper established entropy as the fundamental measure of information, unifying communication, compression, and statistical inference.

Shannon Entropy

For a discrete random variable $X$ taking values in $\mathcal{X}$ with PMF $P$, the entropy is $$H(X)=-\sum_{x\in\mathcal{X}}P(x)\log P(x)=\mathbb{E}[-\log P(X)]$$ (using $0\log 0=0$ by continuity). In bits (base 2) or nats (base $e$). Properties: $H(X)\geq 0$; $H(X)=0$ iff $X$ is deterministic; $H(X)\leq\log|\mathcal{X}|$ with equality iff $X$ is uniform; $H(X_1,\ldots,X_n)\leq\sum H(X_i)$ with equality iff independent.

Maximum Entropy Principle

Among all distributions on $\mathcal{X}$ with $|\mathcal{X}|=n$, entropy is maximized by the uniform distribution: $H(X)\leq\log n$ with equality iff $P(x)=1/n$ for all $x$. More generally, given moment constraints $\mathbb{E}[f_k(X)]=\mu_k$, the maximum-entropy distribution is the exponential family $P(x)\propto\exp(\sum_k\lambda_k f_k(x))$. This principle underlies statistical mechanics (Boltzmann distribution) and Bayesian priors.

Example 1

Compute $H(X)$ for $X\sim\text{Bernoulli}(p)$. Find the value of $p$ that maximizes entropy.

Solution: $H(X)=-p\log_2 p-(1-p)\log_2(1-p)=:h(p)$. This is the binary entropy function. Setting $h'(p)=-\log_2 p+\log_2(1-p)=0$ gives $p=1/2$, $H=1$ bit (maximum). At $p=0$ or $p=1$, $H=0$. For $p=1/4$: $H=-\frac{1}{4}\log_2\frac{1}{4}-\frac{3}{4}\log_2\frac{3}{4}=\frac{1}{2}+\frac{3}{4}(\log_2 4-\log_2 3)=\frac{1}{2}+\frac{3}{4}(2-1.585)=\frac{1}{2}+0.311=0.811$ bits.

Example 2

Verify $H(X,Y)=H(X)+H(Y|X)$ for $X\sim\text{Bernoulli}(1/2)$ and $Y=X\oplus Z$ where $Z\sim\text{Bernoulli}(1/4)$, $Z\perp X$ (binary symmetric channel with crossover $1/4$).

Solution: $H(X)=1$ bit (uniform). $H(Y|X=0)=h(1/4)=0.811$ bits; $H(Y|X=1)=h(1/4)=0.811$ bits; so $H(Y|X)=0.811$. Chain rule: $H(X,Y)=H(X)+H(Y|X)=1+0.811=1.811$ bits. Verify directly: $P(0,0)=P(1,1)=3/8$, $P(0,1)=P(1,0)=1/8$. $H(X,Y)=-2(3/8)\log_2(3/8)-2(1/8)\log_2(1/8)=\frac{3}{4}\cdot 3.415-\frac{1}{4}\cdot 2\cdot 3=1.061+0.75=1.811$ bits. ✓

Practice

Prove the chain rule: $H(X_1,\ldots,X_n)=\sum_{i=1}^n H(X_i|X_1,\ldots,X_{i-1})$.
Show $H(X|Y)\leq H(X)$ with equality iff $X\perp Y$ (conditioning reduces entropy).
Compute the entropy of a fair six-sided die. How many binary questions suffice to determine the outcome with certainty?
Find the maximum entropy distribution on $\{1,2,\ldots,n\}$ subject to fixed mean $\mu$. Show it is geometric.

Show Answer Key

1. By definition $H(X_1,\ldots,X_n)=-\sum p(x_1,\ldots,x_n)\log p(x_1,\ldots,x_n)$. Factor: $p(x_1,\ldots,x_n)=p(x_1)p(x_2|x_1)\cdots p(x_n|x_1,\ldots,x_{n-1})$. Take $-\log$ and sum: each factor contributes $H(X_i|X_1,\ldots,X_{i-1})$.

2. $H(X|Y)=H(X,Y)-H(Y)$ and $H(X,Y)\le H(X)+H(Y)$ with equality iff $X\perp Y$. So $H(X|Y)\le H(X)$ with equality iff $X\perp Y$.

3. $H=\log_2 6\approx2.585$ bits. Since $\lceil\log_2 6\rceil=3$, at most 3 binary questions suffice (e.g., binary search: "Is it $\le3$?", then narrow down).

4. Maximize $H=-\sum p_i\log p_i$ subject to $\sum p_i=1$ and $\sum ip_i=\mu$. Lagrange multipliers give $p_i\propto e^{-\lambda i}$, which is geometric: $p_i=(1-e^{-\lambda})e^{-\lambda(i-1)}$ for $i=1,2,\ldots$ with $\lambda$ set by the mean constraint.

Mutual Information & KL Divergence

Entropy & Measures of Information