Training Mathematics of Machine Learning Dimensionality Reduction: PCA, SVD & Manifold Learning
9 / 10

Dimensionality Reduction: PCA, SVD & Manifold Learning

42 min Mathematics of Machine Learning

Dimensionality Reduction: PCA, SVD & Manifold Learning

High-dimensional data often lies near low-dimensional manifolds. PCA finds the optimal linear subspace via the SVD, while manifold learning captures nonlinear structure. These tools reveal intrinsic geometry and enable compression.

PCA & Optimal Subspace

Given centered \(X\in\mathbb{R}^{n\times d}\) with SVD \(X=U\Sigma V^\top\), the rank-\(k\) approximation \(X_k=U_k\Sigma_k V_k^\top\) minimizes \(\|X-\hat{X}\|_F\) over rank-\(k\) matrices (Eckart-Young). Principal components: columns of \(V_k\). Scores: \(Z=XV_k\). Variance explained: \(\sum_{i=1}^k\sigma_i^2/\sum_{i=1}^d\sigma_i^2\).

Eckart-Young Theorem

\(\min_{\mathrm{rank}(\hat{X})\le k}\|X-\hat{X}\|_F=\sqrt{\sum_{i=k+1}^r\sigma_i^2}\), achieved by \(X_k\). Also: \(\|X-X_k\|_2=\sigma_{k+1}\). Sample covariance: \(X^\top X/n=V(\Sigma^2/n)V^\top\). PCA directions equal eigenvectors of \(\hat{\Sigma}\) ordered by eigenvalue.

Example 1: Scree Plot

\(X\in\mathbb{R}^{100\times 50}\) has \(\sigma_1=10,\sigma_2=8\), rest \(\le 2\). Select \(k\).

Solution: Top-2 variance fraction: \((100+64)/(100+64+48\times 4)\approx 46\%\). Sharp elbow at \(k=2\) indicates genuine rank-2 structure. Retain \(k=2\) for a parsimonious model.

Example 2: Kernel PCA

How does Kernel PCA extend linear PCA?

Solution: Center the kernel matrix: \(\tilde{K}=K-\mathbf{1}_nK-K\mathbf{1}_n+\mathbf{1}_nK\mathbf{1}_n\). Eigendecompose: \(\tilde{K}=U\Lambda U^\top\). Projections: \(z_i^{(l)}=\sqrt{\lambda_l}U_{il}\). Implicitly operates in \(\phi\)-space; captures nonlinear manifold structure.

Practice

  1. Prove the first PC maximizes projected variance and equals the leading eigenvector of \(X^\top X\).
  2. Show the effect of failing to center data before PCA.
  3. Compare t-SNE and PCA: when does t-SNE reveal cluster structure PCA misses?
  4. Derive the relationship between SVD of \(X\) and eigendecompositions of \(X^\top X\) and \(XX^\top\).
Show Answer Key

1. Maximize $\text{Var}(w^TX) = w^T\Sigma w$ subject to $\|w\|=1$ where $\Sigma = \frac{1}{n}X^TX$ (centered data). Lagrangian: $w^T\Sigma w - \lambda(w^Tw-1)$. Gradient $= 0$: $\Sigma w = \lambda w$. So $w$ is an eigenvector, variance $= w^T\Sigma w = \lambda$. Maximum at the largest eigenvalue $\lambda_1$, with $w = $ corresponding eigenvector. ✓

2. If data is not centered ($\bar{x} \neq 0$): $X^TX$ captures both variance and the squared mean. PCA on uncentered data finds the direction of maximum second moment, not maximum variance. The first PC may point toward the data's mean rather than the direction of greatest spread. This distorts the principal components and the explained variance ratios. Always center: $\tilde{X} = X - \mathbf{1}\bar{x}^T$.

3. PCA preserves global linear structure (maximizes variance). For data with nonlinear manifold structure (e.g., clusters on a curved surface, nested rings), PCA projects onto a subspace that may mix clusters. t-SNE uses a nonlinear embedding that preserves local neighborhood structure (KL divergence between pairwise probability distributions in high-D and low-D). t-SNE reveals cluster separation that PCA misses. Trade-off: t-SNE is non-convex, non-parametric, and doesn't preserve global distances.

4. SVD: $X = U\Sigma V^T$ ($X$ is $n\times d$). $X^TX = V\Sigma^2V^T$ → eigenvalues of $X^TX$ are $\sigma_i^2$, eigenvectors are columns of $V$ (right singular vectors = principal component directions). $XX^T = U\Sigma^2U^T$ → eigenvalues are the same $\sigma_i^2$, eigenvectors are columns of $U$ (left singular vectors = principal component scores, normalized). The $i$-th PC score: $Xv_i = \sigma_i u_i$.