Source Coding & Data Compression
Source Coding & Data Compression
Shannon's source coding theorem establishes the fundamental limit of lossless data compression: on average, you cannot compress below the entropy rate, but entropy is achievable. Prefix-free (uniquely decodable) codes are characterized by Kraft's inequality, and Huffman coding achieves the optimal code within one bit of entropy. Arithmetic coding and Lempel-Ziv algorithms approach entropy for sources with memory.
Kraft's Inequality & Optimal Codes
A prefix-free code with codeword lengths $\ell_1,\ldots,\ell_n$ over a $D$-ary alphabet exists iff $$\sum_{i=1}^n D^{-\ell_i}\leq 1.$$ Conversely, any lengths satisfying this inequality correspond to a valid prefix-free code (McMillan). The expected code length is $L=\sum_i P(x_i)\ell_i$. The optimal code minimizes $L$; choosing $\ell_i=\lceil-\log_D P(x_i)\rceil$ gives $H_D(X)\leq L
Shannon's Source Coding Theorem
For an i.i.d. source $X^n\sim P^n$ with entropy $H(X)$, there exists a lossless compression scheme with rate $R$ bits/symbol for any $R>H(X)$, and no lossless compression scheme can achieve $R
Example 1
Construct the Huffman code for source with $P=(0.5,0.25,0.125,0.125)$ and compute its expected length.
Solution: Huffman algorithm: sort by probability. Merge two smallest ($0.125+0.125=0.25$) → tree. Now $\{0.5,0.25,0.25\}$. Merge ($0.25+0.25=0.5$) → tree. Merge ($0.5+0.5=1.0$). Assign codes: Symbol 1 ($p=0.5$): code 0 (length 1); Symbol 2 ($p=0.25$): code 10 (length 2); Symbols 3,4 ($p=0.125$): codes 110, 111 (length 3). $L=0.5(1)+0.25(2)+0.125(3)+0.125(3)=0.5+0.5+0.375+0.375=1.75$ bits. Entropy: $H=-0.5\log_2 0.5-0.25\log_2 0.25-2(0.125)\log_2 0.125=0.5+0.5+0.75=1.75$ bits. The Huffman code achieves entropy exactly here.
Example 2
Show that no uniquely decodable binary code for source $P=(0.6,0.3,0.1)$ can achieve $L=H(P)$ exactly.
Solution: $H(P)=-0.6\log_2 0.6-0.3\log_2 0.3-0.1\log_2 0.1=0.6(0.737)+0.3(1.737)+0.1(3.322)=0.442+0.521+0.332=1.295$ bits. Optimal lengths: $\ell_1=\lceil-\log_2 0.6\rceil=1$, $\ell_2=\lceil-\log_2 0.3\rceil=2$, $\ell_3=\lceil-\log_2 0.1\rceil=4$. Check Kraft: $2^{-1}+2^{-2}+2^{-4}=0.5+0.25+0.0625=0.8125\leq 1$. $L=0.6(1)+0.3(2)+0.1(4)=0.6+0.6+0.4=1.6>H=1.295$. Any integer-length code has $L\geq H$ with equality only when all probabilities are powers of 2.
Practice
- Prove that $H_D(X)\leq L^*
- Describe the Lempel-Ziv algorithm. State (without proof) that it achieves the entropy rate of any stationary ergodic source.
- For $n$ i.i.d. Bernoulli($p$) bits, compute the size of the $\epsilon$-typical set and the compression ratio achieved.
- Show that arithmetic coding can achieve expected length $L$ satisfying $H(X^n)/n\leq L/n
Show Answer Key
1. $L^*\ge H_D(X)$ (noiseless coding theorem lower bound). The Shannon-Fano or Huffman code achieves $L^* 2. LZ77/LZ78: parse the source into phrases, each extending a previous phrase by one symbol. The dictionary grows adaptively. For any stationary ergodic source, $\lim_{n\to\infty}L_n/n=h$ (entropy rate) almost surely. The algorithm is universal — it achieves the entropy rate without knowing the source distribution. 3. Typical set $A_\epsilon^{(n)}$ has $|A_\epsilon^{(n)}|\approx2^{nH(p)}$ elements. For $p=0.5$: $|A|\approx2^n$ (no compression). For $p=0.1$: $|A|\approx2^{0.469n}$. Compression ratio $\approx H(p)$ bits per source bit. 4. Arithmetic coding encodes the entire sequence as a single sub-interval of $[0,1)$ of width $\prod P(x_i)$. The codeword length is $\lceil-\log_2\prod P(x_i)\rceil+1$ bits. Expected length per symbol: $L/n\le H(X)+1/n$, approaching the entropy as $n\to\infty$.