Training Information Theory Algorithmic Information Theory & Applications
7 / 7

Algorithmic Information Theory & Applications

35 min Information Theory

Algorithmic Information Theory & Applications

Kolmogorov complexity $K(x)$ — the length of the shortest program that outputs $x$ — provides a universal, distribution-free measure of the information content of individual objects. Unlike Shannon entropy (which applies to distributions), Kolmogorov complexity applies to single strings. While uncomputable, it yields powerful theoretical results: most strings are incompressible, and the incompressibility method provides elegant combinatorial proofs.

Kolmogorov Complexity

Fix a universal Turing machine $U$. The Kolmogorov complexity of a string $x\in\{0,1\}^*$ is $$K(x)=\min\{|p|:U(p)=x\}$$ — the length of the shortest program $p$ that outputs $x$. Properties: $K(x)\leq|x|+O(1)$ (copy program); most strings are incompressible: $|\{x:|x|=n\}|=2^n$ but only $\sum_{l=0}^{n-c}2^l<2^{n-c+1}$ programs of length $

Relationship to Shannon Entropy

For an i.i.d. source $X^n\sim P^n$: $\mathbb{E}[K(X^n)]\approx n H(X)$ for large $n$ (Kolmogorov-Shannon connection). More precisely: $H(X)\leq\frac{1}{n}\mathbb{E}[K(X^n)]\leq H(X)+O(\log n/n)$. The universal probability $m(x)=2^{-K(x)}$ (Solomonoff-Levin measure) dominates all computable probability measures: for any computable $P$, $m(x)\geq cP(x)$ for a constant $c$ depending only on $P$. Minimum description length (MDL) model selection uses $K$ as the ideal model cost.

Example 1

Show that a random string $x$ of length $n$ satisfies $K(x)\geq n-c$ with high probability for small constant $c$.

Solution: Counting: the number of binary strings with $K(x)

Example 2

Use the incompressibility method to prove there are at least $n!$ permutations of $\{1,\ldots,n\}$.

Solution: Standard proof: $|S_n|=n!$ by induction. Incompressibility proof: a permutation $\sigma$ of $[n]$ can be encoded by its rank among all permutations in lexicographic order (an integer from 0 to $|S_n|-1$). The shortest such encoding requires at least $\log_2|S_n|$ bits. But we can describe any permutation $\sigma$ by listing its values $(\sigma(1),\ldots,\sigma(n))$: this takes at most $n\lceil\log_2 n\rceil$ bits (each value fits in $\lceil\log_2 n\rceil$ bits), but we can do better by noting position $k$ uses $\lceil\log_2(n-k+1)\rceil$ bits — total $\sum_{k=1}^n\log_2 k=\log_2 n!$. By Stirling: $\log_2 n!\approx n\log_2 n-n\log_2 e$, consistent with $n!\approx(n/e)^n\sqrt{2\pi n}$.

Practice

  1. Prove that Kolmogorov complexity is uncomputable (the uncomputability of the halting problem implies $K$ is not computable).
  2. State the MDL principle for model selection. How does it compare to BIC (Bayesian Information Criterion)?
  3. Show that $K(x,y)\leq K(x)+K(y|x)+O(\log\min(K(x),K(y)))$.
  4. Describe how information-theoretic methods (min-entropy, Rényi entropy) are used in cryptographic key extraction from noisy sources.
Show Answer Key

1. Suppose $K(x)$ is computable by program $P$. Then construct a program that enumerates strings $x$ by increasing $K(x)$, and outputs the first $x$ with $K(x)>n$ for input $n$. This program has length $O(\log n)$, but $K(x)>n$, so $K(x)\le O(\log n)

2. MDL selects the model $M$ minimizing $L(D|M)+L(M)$ (description length of data given model + model complexity). BIC approximates $-2\log P(D|\hat{\theta})+k\log n$, which equals MDL up to $O(1)$ for exponential families. MDL is more general (applies beyond parametric models) but harder to compute.

3. $K(x,y)\le K(x)+K(y|x^*)+O(\log K(x))$. The $O(\log)$ term encodes the length of the program for $x$ so the concatenated program can be parsed. This is the chain rule for Kolmogorov complexity, analogous to $H(X,Y)=H(X)+H(Y|X)$ in Shannon theory.

4. Cryptographic key extraction: a noisy physical source produces bits with entropy $H_\infty(X)$ (min-entropy). A randomness extractor converts $n$ bits with min-entropy $k$ into $m\approx k-2\log(1/\epsilon)$ nearly-uniform bits. Rényi entropy $H_\alpha$ (especially $\alpha=\infty$) gives tighter bounds than Shannon entropy for worst-case security guarantees.