Training Information Theory Rate-Distortion Theory & Lossy Compression
5 / 7

Rate-Distortion Theory & Lossy Compression

35 min Information Theory

Rate-Distortion Theory & Lossy Compression

When lossless compression is insufficient or unnecessary, rate-distortion theory addresses the fundamental tradeoff between compression rate and distortion. For a source $X$ with distortion measure $d(x,\hat{x})$, the rate-distortion function $R(D)$ gives the minimum bits/sample needed to reconstruct $X$ within expected distortion $D$. This theory underpins audio (MP3), image (JPEG), and video (H.264) compression standards.

Rate-Distortion Function

Given source $X$ with distribution $P_X$ and distortion measure $d:\mathcal{X}\times\hat{\mathcal{X}}\to[0,\infty)$, the rate-distortion function is $$R(D)=\min_{P_{\hat{X}|X}:\mathbb{E}[d(X,\hat{X})]\leq D}I(X;\hat{X}).$$ Properties: $R(D)$ is convex and non-increasing in $D$; $R(0)=H(X)$ for lossless sources; $R(D_{\max})=0$ where $D_{\max}=\min_{\hat{x}}\mathbb{E}[d(X,\hat{x})]$. The RD theorem: any code with rate $R>R(D)$ can achieve distortion $D$; no code with $R

Gaussian Rate-Distortion

For $X\sim\mathcal{N}(0,\sigma^2)$ with squared-error distortion $d(x,\hat{x})=(x-\hat{x})^2$, the rate-distortion function is $$R(D)=\frac{1}{2}\log_2\frac{\sigma^2}{D}, \quad 0\leq D\leq\sigma^2.$$ The optimal reconstruction is $\hat{X}=X+Z$ where $Z\sim\mathcal{N}(0,D)$ independent of $X$, giving $\text{MSE}=D$. The Gaussian source is the hardest to compress at a given distortion level: for any source with variance $\sigma^2$, $R(D)\leq\frac{1}{2}\log_2(\sigma^2/D)$.

Example 1

For $X\sim\mathcal{N}(0,1)$, find the rate (in bits) required to represent $X$ with MSE $\leq 0.01$.

Solution: $R(D)=\frac{1}{2}\log_2\frac{\sigma^2}{D}=\frac{1}{2}\log_2\frac{1}{0.01}=\frac{1}{2}\log_2 100=\frac{1}{2}(6.644)=3.32$ bits per sample. So approximately 3.32 bits/sample suffices for 1% MSE reconstruction of standard normal samples. Equivalently, quantizing to $2^{3.32}\approx 10$ levels with optimal design achieves MSE $\leq 0.01$.

Example 2

Compute the rate-distortion function for $X\sim\text{Bernoulli}(1/2)$ with Hamming distortion $d(x,\hat{x})=1_{x\neq\hat{x}}$.

Solution: $D_{\max}=1/2$ (flip everything). For $D\leq 1/2$: $R(D)=H(X)-H_b(D)=1-h(D)$ where $h(D)=-D\log D-(1-D)\log(1-D)$. At $D=0$: $R=1$ bit (lossless). At $D=1/4$: $R=1-h(1/4)=1-0.811=0.189$ bits. At $D=1/2$: $R=0$ (one can always output $0$ with error rate $1/2$). This equals the binary channel capacity formula: the RD function for Bernoulli$(1/2)$ source is the inverse channel capacity of BSC$(D)$.

Practice

  1. Derive the water-filling solution for the vector Gaussian source with $n$ components and per-component distortions $D_i$.
  2. State the Blahut-Arimoto algorithm for computing $R(D)$ numerically and prove its convergence.
  3. Show that $R(D)$ is convex in $D$ using the convexity of mutual information in conditional distributions.
  4. Describe how JPEG compression approximates rate-distortion theoretic compression for images.
Show Answer Key

1. For a vector Gaussian source with covariance $\Lambda=\text{diag}(\sigma_1^2,\ldots,\sigma_n^2)$ and total distortion $D$: minimize $R=\sum\frac{1}{2}\log(\sigma_i^2/D_i)$ subject to $\sum D_i=D$. Lagrangian: $D_i=\min(\sigma_i^2,\theta)$ (water-filling). Components with $\sigma_i^2<\theta$ are not coded at all.

2. Blahut-Arimoto alternates: (1) fix $p(x)$, optimize $p(\hat{x}|x)$ (test channel) to minimize $I(X;\hat{X})$ subject to distortion; (2) fix $p(\hat{x}|x)$, optimize $p(x)$ (input distribution). Both steps are convex, and the algorithm converges to $R(D)$ monotonically.

3. $R(D)=\min_{p(\hat{x}|x):E[d]\le D}I(X;\hat{X})$. The minimization is over conditional distributions, and $I$ is convex in $p(\hat{x}|x)$ for fixed $p(x)$. The minimum over a convex set of a convex function is convex in $D$ (constraint set grows with $D$).

4. JPEG: (1) block DCT, (2) quantize coefficients (lossy step), (3) entropy code. The quantization step approximates the rate-distortion optimal solution for Gaussian sources. The DCT concentrates energy, and quantization allocates more bits to important (low-frequency) coefficients — a practical water-filling analog.