15-05 — Numerical Linear Algebra for ML
Phase: Numerical Methods for ML | Subject: 15-05 Prerequisites: 9-1 through 9-10 (linear algebra series), 15-02-condition-stability.md Next subject: 15-06-mixed-precision-training.md
Learning Objectives
By the end of this subject, you will be able to:
- Choose the right matrix factorization for ML tasks (SVD, QR, Cholesky, LU)
- Apply SVD for dimensionality reduction, pseudoinverse, and low-rank approximation
- Explain why iterative methods (CG, Lanczos) beat direct methods for large sparse systems
- Compute the SVD-based pseudoinverse and connect it to ridge regression
- Use randomized SVD for approximate low-rank factorization of large matrices
Core Content
The ML Matrix Factorization Toolkit
| Factorization | Form | Cost | ML Use Case |
|---|---|---|---|
| LU | $A = LU$ | $O(n^3)$ | Solving $A\mathbf{x} = \mathbf{b}$ for dense square $A$ |
| Cholesky | $A = LL^T$ ($A \succ 0$) | $O(n^3/3)$ | Faster than LU for PSD matrices, Gaussian processes |
| QR | $A = QR$, $Q^T Q = I$ | $O(mn^2)$ | Least squares, orthogonalization |
| SVD | $A = U\Sigma V^T$ | $O(mn^2)$ | PCA, pseudoinverse, low-rank approximation, matrix completion |
| Eigenvalue | $A = V\Lambda V^{-1}$ | $O(n^3)$ | Spectral clustering, PCA (via covariance) |
SVD: The Swiss Army Knife
For $A \in \mathbb{R}^{m \times n}$:
$$A = U\Sigma V^T = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T$$
- $U \in \mathbb{R}^{m \times m}$: left singular vectors (orthonormal columns)
- $\Sigma \in \mathbb{R}^{m \times n}$: diagonal of singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$
- $V \in \mathbb{R}^{n \times n}$: right singular vectors (orthonormal columns)
- $r = \operatorname{rank}(A)$
⚠️ CRITICAL — SVD reveals everything about a matrix: Rank = number of nonzero $\sigma_i$. Condition number = $\sigma_{\max}/\sigma_{\min}$. Null space = right singular vectors with $\sigma_i = 0$. Range = left singular vectors with $\sigma_i > 0$.
Key SVD Applications for ML
1. PCA via SVD: For centered data matrix $X \in \mathbb{R}^{N \times d}$: $X = U\Sigma V^T$. Principal components = columns of $V$. Explained variance = $\sigma_i^2 / \sum_j \sigma_j^2$. No need to form the $d \times d$ covariance matrix — SVD directly gives PCs.
2. Pseudoinverse: $$A^+ = V\Sigma^+ U^T$$ where $\Sigma^+$ inverts nonzero singular values and transposes. For $\sigma_i > \tau$ (threshold): $(\Sigma^+){ii} = 1/\sigma_i$. For $\sigma_i \leq \tau$: $(\Sigma^+){ii} = 0$.
$\mathbf{x} = A^+ \mathbf{b}$ gives the minimum-norm least-squares solution to $A\mathbf{x} \approx \mathbf{b}$.
3. Low-Rank Approximation (Eckart-Young): The best rank-$k$ approximation to $A$ (in Frobenius norm) is: $$A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^T$$
Keeps the $k$ largest singular values and their vectors. Error: $|A - A_k|F = \sqrt{\sum{i=k+1}^r \sigma_i^2}$.
This is the math behind compressed models: keep only the top singular components.
4. Ridge Regression via SVD: If $X = U\Sigma V^T$, the ridge solution $\hat{\beta} = (X^T X + \lambda I)^{-1} X^T \mathbf{y}$ becomes: $$\hat{\beta} = V \operatorname{diag}\left(\frac{\sigma_i}{\sigma_i^2 + \lambda}\right) U^T \mathbf{y}$$
The filter factor $f_i = \sigma_i^2/(\sigma_i^2 + \lambda)$ shows how ridge shrinks small singular value directions — exactly what we want for ill-conditioned $X^T X$.
Iterative Methods for Large Systems
For large sparse $A$ (e.g., graph Laplacian, PageRank), $O(n^3)$ direct methods are infeasible. Iterative methods apply $A$ repeatedly to vectors.
Conjugate Gradient (CG): For $A \succ 0$, solves $A\mathbf{x} = \mathbf{b}$ in at most $n$ iterations (theoretically) or much fewer with good conditioning. Each iteration costs one matrix-vector product — $O(\text{nnz})$ for sparse $A$.
Key idea: CG constructs $\mathbf{x}_k$ in the Krylov subspace $\operatorname{span}{\mathbf{b}, A\mathbf{b}, A^2\mathbf{b}, \ldots, A^{k-1}\mathbf{b}}$ that minimizes $|\mathbf{x}_k - \mathbf{x}^*|_A$.
Lanczos: Computes the top $k$ eigenvalues/eigenvectors of a large sparse symmetric matrix in $O(k \cdot \text{nnz})$ — crucial for spectral methods on large graphs.
Randomized SVD
For $A \in \mathbb{R}^{m \times n}$ where computing the full SVD is too expensive:
- Draw random matrix $\Omega \in \mathbb{R}^{n \times (k+p)}$ (Gaussian or structured)
- Form $Y = A\Omega$ (captures range of $A$)
- QR-factorize $Y = QR$
- Form $B = Q^T A \in \mathbb{R}^{(k+p) \times n}$ (small!)
- SVD of $B = \tilde{U}\Sigma V^T$
- Approximate SVD of $A \approx (Q\tilde{U})\Sigma V^T$
Cost: $O(mn(k+p))$ vs $O(mn^2)$ for full SVD. For $k=100$, $n=10^6$, this is ~10,000× faster.
Key Terms
- Cholesky
- Eigenvalue
Worked Examples
Example 1: PCA via SVD
Data: $X = \begin{pmatrix} 1 & 2 \ 2 & 4 \ 3 & 6 \end{pmatrix}$ (3 points in 2D). Center and compute PCA.
Solution: Mean: $(2, 4)$. Center: $\tilde{X} = \begin{pmatrix} -1 & -2 \ 0 & 0 \ 1 & 2 \end{pmatrix}$.
$\tilde{X} = U\Sigma V^T = \begin{pmatrix} -0.447 & \cdots \ 0 & \cdots \ 0.447 & \cdots \end{pmatrix} \begin{pmatrix} 3.162 & 0 \ 0 & 0 \end{pmatrix} \begin{pmatrix} -0.447 & -0.894 \ -0.894 & 0.447 \end{pmatrix}$
All data lies on a line ($\sigma_2 = 0$). First PC: $\mathbf{v}_1 = (-0.447, -0.894)^T =$ direction $(1, 2)$. 100% of variance explained by 1 component.
Click for answer
$\sigma_1 = \sqrt{10} \approx 3.162$, $\sigma_2 = 0$. Data is perfectly collinear (all points lie on $y=2x$). One PC captures all variance.Example 2: Ridge Regression Filter Factors
For design matrix $X$ with singular values $\sigma = [10, 5, 1, 0.1]$, compute ridge filter factors for $\lambda = 1$ and $\lambda = 0.01$.
Solution: $f_i = \sigma_i^2 / (\sigma_i^2 + \lambda)$.
| $\sigma_i$ | $\lambda=1$ | $\lambda=0.01$ |
|---|---|---|
| 10 | 100/101 ≈ 0.990 | 100/100.01 ≈ 1.000 |
| 5 | 25/26 ≈ 0.962 | 25/25.01 ≈ 1.000 |
| 1 | 1/2 = 0.500 | 1/1.01 ≈ 0.990 |
| 0.1 | 0.01/1.01 ≈ 0.010 | 0.01/0.02 = 0.500 |
Large singular values pass through almost unattenuated. Small singular values are heavily shrunk — this is ridge regression's regularization mechanism.
Click for answer
$\lambda=1$: near-complete attenuation of the $\sigma=0.1$ direction (factor 0.01). $\lambda=0.01$: mild shrinkage (factor 0.5). Ridge selectively shrinks low-variance (likely noise) directions.Example 3: Low-Rank Approximation
Approximate $A = \begin{pmatrix} 3 & 1 & 0 \ 1 & 3 & 1 \ 0 & 1 & 3 \end{pmatrix}$ with a rank-2 matrix.
Solution: SVD: $\sigma_1 \approx 4.414$, $\sigma_2 \approx 2.000$, $\sigma_3 \approx 0.586$. $A_2 = \sigma_1 \mathbf{u}_1\mathbf{v}_1^T + \sigma_2 \mathbf{u}_2\mathbf{v}_2^T$ (keeping top 2 components).
$|A - A_2|_F = \sigma_3 \approx 0.586$. The rank-2 approximation captures $(4.414^2 + 2^2)/(4.414^2 + 2^2 + 0.586^2) \approx 98.4\%$ of the Frobenius norm energy.
Click for answer
Rank-2 approximation captures ~98.4% of the energy. Error = $\sigma_3 \approx 0.586$. The third singular component contributes little.Quiz
Q1: What does the concept of Cholesky primarily refer to in this subject?
A) A computational error related to Cholesky B) The definition and application of Cholesky C) A historical anecdote about Cholesky D) A visual representation of Cholesky
Correct: B)
- If you chose A: This is incorrect. Cholesky is defined as: the definition and application of cholesky. The other options describe different aspects that are not the primary focus.
- If you chose B: Cholesky is defined as: the definition and application of cholesky. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. Cholesky is defined as: the definition and application of cholesky. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Cholesky is defined as: the definition and application of cholesky. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) The inverse operation of the formula in question B) A = LU C) An unrelated formula from a different topic D) A simplified version of A = LU...
Correct: B)
- If you chose A: This is incorrect. The formula A = LU is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: The formula A = LU is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose C: This is incorrect. The formula A = LU is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula A = LU is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Eigenvalue?
A) It replaces all other methods in this domain B) It is used to eigenvalue in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system
Correct: B)
- If you chose A: This is incorrect. Eigenvalue serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Eigenvalue serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Eigenvalue serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Eigenvalue serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about The Ml Matrix Factorization Toolkit is TRUE?
A) The Ml Matrix Factorization Toolkit is an advanced topic beyond this subject's scope B) The Ml Matrix Factorization Toolkit is not related to this subject C) The Ml Matrix Factorization Toolkit is mentioned only as a historical footnote D) The Ml Matrix Factorization Toolkit is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. The Ml Matrix Factorization Toolkit is a fundamental concept covered in this subject. This subject covers The Ml Matrix Factorization Toolkit as part of its core content.
- If you chose B: This is incorrect. The Ml Matrix Factorization Toolkit is a fundamental concept covered in this subject. This subject covers The Ml Matrix Factorization Toolkit as part of its core content.
- If you chose C: This is incorrect. The Ml Matrix Factorization Toolkit is a fundamental concept covered in this subject. This subject covers The Ml Matrix Factorization Toolkit as part of its core content.
- If you chose D: The Ml Matrix Factorization Toolkit is a fundamental concept covered in this subject. This subject covers The Ml Matrix Factorization Toolkit as part of its core content. Correct!
Q5: Based on the worked examples in this subject, what is the correct result?
A) An unrelated numerical value B) A different result from a common mistake C) The inverse of the correct answer D) Ridge Regression Filter Factors
Correct: D)
- If you chose A: This is incorrect. The worked examples show that the result is Ridge Regression Filter Factors. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is Ridge Regression Filter Factors. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is Ridge Regression Filter Factors. The other options represent common errors.
- If you chose D: The worked examples show that the result is Ridge Regression Filter Factors. The other options represent common errors. Correct!
Q6: How are The Ml Matrix Factorization Toolkit and Svd: The Swiss Army Knife related?
A) The Ml Matrix Factorization Toolkit is the inverse of Svd: The Swiss Army Knife B) The Ml Matrix Factorization Toolkit is a special case of Svd: The Swiss Army Knife C) The Ml Matrix Factorization Toolkit and Svd: The Swiss Army Knife are completely unrelated topics D) The Ml Matrix Factorization Toolkit and Svd: The Swiss Army Knife are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both The Ml Matrix Factorization Toolkit and Svd: The Swiss Army Knife are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both The Ml Matrix Factorization Toolkit and Svd: The Swiss Army Knife are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both The Ml Matrix Factorization Toolkit and Svd: The Swiss Army Knife are covered in this subject as interconnected topics.
- If you chose D: Both The Ml Matrix Factorization Toolkit and Svd: The Swiss Army Knife are covered in this subject as interconnected topics. Correct!
Q7: What is a common pitfall when working with Key Svd Applications For Ml?
A) The main error with Key Svd Applications For Ml is using it when it is not needed B) Key Svd Applications For Ml has no common misconceptions C) Key Svd Applications For Ml is always computed the same way in all contexts D) A common mistake is confusing Key Svd Applications For Ml with a similar concept
Correct: D)
- If you chose A: This is incorrect. Students often confuse Key Svd Applications For Ml with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Key Svd Applications For Ml with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Key Svd Applications For Ml with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: Students often confuse Key Svd Applications For Ml with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
Q8: When should you apply Iterative Methods For Large Systems?
A) Iterative Methods For Large Systems is not practically useful B) Apply Iterative Methods For Large Systems to solve problems in this subject's domain C) Use Iterative Methods For Large Systems only in pure mathematics contexts D) Avoid Iterative Methods For Large Systems unless explicitly instructed
Correct: B)
- If you chose A: This is incorrect. Iterative Methods For Large Systems is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Iterative Methods For Large Systems is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Iterative Methods For Large Systems is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Iterative Methods For Large Systems is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Show that the pseudoinverse $A^+$ satisfies the four Moore-Penrose conditions: $AA^+ A = A$, $A^+A A^+ = A^+$, $(AA^+)^T = AA^+$, $(A^+A)^T = A^+A$.
Click for answer
Let $A = U\Sigma V^T$, $A^+ = V\Sigma^+ U^T$. 1. $AA^+A = U\Sigma V^T V\Sigma^+ U^T U\Sigma V^T = U\Sigma\Sigma^+\Sigma V^T = U\Sigma V^T = A$ (since $\Sigma\Sigma^+\Sigma = \Sigma$ — the diagonal entries are $\sigma_i \cdot (1/\sigma_i) \cdot \sigma_i = \sigma_i$ for $\sigma_i > 0$, and $0 \cdot 0 \cdot 0 = 0$ for $\sigma_i = 0$). 2-4 follow similarly from the diagonal structure. -
Explain why the condition number of $X^T X + \lambda I$ improves as $\lambda$ increases.
Click for answer
If singular values of $X$ are $\sigma_i$, then $X^T X + \lambda I$ has eigenvalues $\sigma_i^2 + \lambda$. $\kappa = (\sigma_{\max}^2 + \lambda)/(\sigma_{\min}^2 + \lambda)$. As $\lambda$ increases: numerator and denominator both increase, but the ratio shrinks toward 1. For $\lambda \gg \sigma_{\min}^2$, $\kappa \approx 1$ — the problem becomes perfectly conditioned. -
For a sparse matrix $A$ with $10^6$ rows/columns and $10^7$ nonzeros, compare the cost per iteration of CG vs cost of one LU factorization.
Click for answer
CG per iteration: one matrix-vector product → $O(\text{nnz}) = 10^7$ ops. LU factorization: $O(n^3) = 10^{18}$ ops for dense, or $O(n \cdot \text{nnz})$ for sparse (with fill-in). Even best-case sparse LU: $10^6 \times 10^7 = 10^{13}$ ops, far more expensive. CG with good preconditioning converges in ~100 iterations for well-conditioned problems → $10^9$ ops total — 10,000× faster than sparse LU. -
Compute the rank-1 approximation of the matrix of all ones $J_{3 \times 3}$ using SVD.
Click for answer
$J_{3 \times 3}$ is rank-1 (all rows identical). Its only nonzero singular value: $\sigma_1 = 3$. $u_1 = (1,1,1)^T/\sqrt{3}$, $v_1 = (1,1,1)^T/\sqrt{3}$. $J = 3 \cdot \frac{1}{\sqrt{3}}(1,1,1)^T \cdot \frac{1}{\sqrt{3}}(1,1,1)^T = 3 \cdot \frac{1}{3} J_{3\times 3}$. Since $3 \cdot (1/3) = 1$, we have $J = \mathbf{1}\mathbf{1}^T$ where $\mathbf{1} = (1,1,1)^T$. The rank-1 approximation IS the exact matrix (since it's exactly rank-1). -
Randomized SVD with $k+p = 20$ on a $10^5 \times 10^5$ matrix. Approximately how much faster than full SVD?
Click for answer
Full SVD: $O(mn^2) = 10^5 \times 10^{10} = 10^{15}$ ops. Randomized SVD: $O(mn(k+p))$ for the matrix-multiply step = $10^5 \times 10^5 \times 20 = 2 \times 10^{11}$ ops, plus $O((k+p)^2 n)$ for the small SVD ≈ $400 \times 10^5 = 4 \times 10^7$ ops (negligible). Total: ~$2 \times 10^{11}$ ops — about 5,000× faster than full SVD.
Summary
Key takeaways:
- SVD $A = U\Sigma V^T$ is the most versatile factorization for ML: PCA, pseudoinverse, low-rank approximation, conditioning analysis
- PCA: right singular vectors of centered data matrix = principal components
- Pseudoinverse: $A^+ = V\Sigma^+ U^T$ — minimum-norm least-squares solution
- Low-rank approximation: truncate to top $k$ singular values (Eckart-Young optimality)
- Ridge regression via SVD: filter factors $\sigma_i^2/(\sigma_i^2 + \lambda)$ show regularization mechanism
- Iterative methods (CG, Lanczos) scale to large sparse matrices where direct $O(n^3)$ methods can't
- Randomized SVD enables approximate factorization of matrices too large for exact SVD
Pitfalls
- Using direct $O(n^3)$ methods on large sparse matrices: For a $10^6 \times 10^6$ sparse matrix with $10^7$ nonzeros, LU factorization is computationally infeasible. Iterative methods like Conjugate Gradient cost $O(\text{nnz})$ per iteration and converge in far fewer than $n$ steps with good preconditioning. Always check sparsity before reaching for
numpy.linalg.solveortorch.linalg.solve. - Forgetting to center data before PCA via SVD: The SVD of the centered data matrix gives principal components directly. Computing SVD on uncentered data captures the mean, not the variance structure. The first singular vector of uncentered data often points toward the data centroid rather than the direction of maximum variance. Always subtract the mean before SVD-based PCA.
- Using the pseudoinverse without thresholding small singular values: $A^+ = V\Sigma^+ U^T$ with $(\Sigma^+){ii} = 1/\sigma_i$ amplifies noise when $\sigma_i$ is tiny. A singular value of $10^{-10}$ becomes $10^{10}$ in the pseudoinverse, blowing up rounding errors. Always threshold: set $(\Sigma^+){ii} = 0$ for $\sigma_i < \tau \cdot \sigma_{\max}$ (with $\tau \approx 10^{-12}$ for float64).
- Choosing the wrong factorization for the task: Cholesky ($LL^T$) requires positive definiteness — it will fail silently or crash on indefinite matrices. QR is stable for least squares but slower than the normal equations for well-conditioned problems. SVD is the most robust but also the most expensive. Know the tradeoffs: LU for general square, Cholesky for PSD, QR for least squares, SVD for rank/conditioning analysis.
- Expecting randomized SVD accuracy without oversampling: Randomized SVD with target rank $k$ should use $k+p$ random projections where $p \approx 5$–$10$ is the oversampling parameter. Without oversampling ($p=0$), the randomized approximation can miss significant singular components. For matrices with slow singular value decay, power iteration ($q=1$–$2$ steps of $(AA^T)^q A\Omega$) dramatically improves accuracy.
Next Steps
Next up: 15-06-mixed-precision-training.md — float16/bfloat16 training, loss scaling, and how modern GPUs balance speed and accuracy.