Math graphic
📐 Concept diagram

15-05 — Numerical Linear Algebra for ML

Phase: Numerical Methods for ML | Subject: 15-05 Prerequisites: 9-1 through 9-10 (linear algebra series), 15-02-condition-stability.md Next subject: 15-06-mixed-precision-training.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Choose the right matrix factorization for ML tasks (SVD, QR, Cholesky, LU)
  2. Apply SVD for dimensionality reduction, pseudoinverse, and low-rank approximation
  3. Explain why iterative methods (CG, Lanczos) beat direct methods for large sparse systems
  4. Compute the SVD-based pseudoinverse and connect it to ridge regression
  5. Use randomized SVD for approximate low-rank factorization of large matrices

Core Content

The ML Matrix Factorization Toolkit

Factorization Form Cost ML Use Case
LU $A = LU$ $O(n^3)$ Solving $A\mathbf{x} = \mathbf{b}$ for dense square $A$
Cholesky $A = LL^T$ ($A \succ 0$) $O(n^3/3)$ Faster than LU for PSD matrices, Gaussian processes
QR $A = QR$, $Q^T Q = I$ $O(mn^2)$ Least squares, orthogonalization
SVD $A = U\Sigma V^T$ $O(mn^2)$ PCA, pseudoinverse, low-rank approximation, matrix completion
Eigenvalue $A = V\Lambda V^{-1}$ $O(n^3)$ Spectral clustering, PCA (via covariance)

SVD: The Swiss Army Knife

For $A \in \mathbb{R}^{m \times n}$:

$$A = U\Sigma V^T = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T$$

⚠️ CRITICAL — SVD reveals everything about a matrix: Rank = number of nonzero $\sigma_i$. Condition number = $\sigma_{\max}/\sigma_{\min}$. Null space = right singular vectors with $\sigma_i = 0$. Range = left singular vectors with $\sigma_i > 0$.

Key SVD Applications for ML

1. PCA via SVD: For centered data matrix $X \in \mathbb{R}^{N \times d}$: $X = U\Sigma V^T$. Principal components = columns of $V$. Explained variance = $\sigma_i^2 / \sum_j \sigma_j^2$. No need to form the $d \times d$ covariance matrix — SVD directly gives PCs.

2. Pseudoinverse: $$A^+ = V\Sigma^+ U^T$$ where $\Sigma^+$ inverts nonzero singular values and transposes. For $\sigma_i > \tau$ (threshold): $(\Sigma^+){ii} = 1/\sigma_i$. For $\sigma_i \leq \tau$: $(\Sigma^+){ii} = 0$.

$\mathbf{x} = A^+ \mathbf{b}$ gives the minimum-norm least-squares solution to $A\mathbf{x} \approx \mathbf{b}$.

3. Low-Rank Approximation (Eckart-Young): The best rank-$k$ approximation to $A$ (in Frobenius norm) is: $$A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^T$$

Keeps the $k$ largest singular values and their vectors. Error: $|A - A_k|F = \sqrt{\sum{i=k+1}^r \sigma_i^2}$.

This is the math behind compressed models: keep only the top singular components.

4. Ridge Regression via SVD: If $X = U\Sigma V^T$, the ridge solution $\hat{\beta} = (X^T X + \lambda I)^{-1} X^T \mathbf{y}$ becomes: $$\hat{\beta} = V \operatorname{diag}\left(\frac{\sigma_i}{\sigma_i^2 + \lambda}\right) U^T \mathbf{y}$$

The filter factor $f_i = \sigma_i^2/(\sigma_i^2 + \lambda)$ shows how ridge shrinks small singular value directions — exactly what we want for ill-conditioned $X^T X$.

Iterative Methods for Large Systems

For large sparse $A$ (e.g., graph Laplacian, PageRank), $O(n^3)$ direct methods are infeasible. Iterative methods apply $A$ repeatedly to vectors.

Conjugate Gradient (CG): For $A \succ 0$, solves $A\mathbf{x} = \mathbf{b}$ in at most $n$ iterations (theoretically) or much fewer with good conditioning. Each iteration costs one matrix-vector product — $O(\text{nnz})$ for sparse $A$.

Key idea: CG constructs $\mathbf{x}_k$ in the Krylov subspace $\operatorname{span}{\mathbf{b}, A\mathbf{b}, A^2\mathbf{b}, \ldots, A^{k-1}\mathbf{b}}$ that minimizes $|\mathbf{x}_k - \mathbf{x}^*|_A$.

Lanczos: Computes the top $k$ eigenvalues/eigenvectors of a large sparse symmetric matrix in $O(k \cdot \text{nnz})$ — crucial for spectral methods on large graphs.

Randomized SVD

For $A \in \mathbb{R}^{m \times n}$ where computing the full SVD is too expensive:

  1. Draw random matrix $\Omega \in \mathbb{R}^{n \times (k+p)}$ (Gaussian or structured)
  2. Form $Y = A\Omega$ (captures range of $A$)
  3. QR-factorize $Y = QR$
  4. Form $B = Q^T A \in \mathbb{R}^{(k+p) \times n}$ (small!)
  5. SVD of $B = \tilde{U}\Sigma V^T$
  6. Approximate SVD of $A \approx (Q\tilde{U})\Sigma V^T$

Cost: $O(mn(k+p))$ vs $O(mn^2)$ for full SVD. For $k=100$, $n=10^6$, this is ~10,000× faster.



Key Terms

Worked Examples

Example 1: PCA via SVD

Data: $X = \begin{pmatrix} 1 & 2 \ 2 & 4 \ 3 & 6 \end{pmatrix}$ (3 points in 2D). Center and compute PCA.

Solution: Mean: $(2, 4)$. Center: $\tilde{X} = \begin{pmatrix} -1 & -2 \ 0 & 0 \ 1 & 2 \end{pmatrix}$.

$\tilde{X} = U\Sigma V^T = \begin{pmatrix} -0.447 & \cdots \ 0 & \cdots \ 0.447 & \cdots \end{pmatrix} \begin{pmatrix} 3.162 & 0 \ 0 & 0 \end{pmatrix} \begin{pmatrix} -0.447 & -0.894 \ -0.894 & 0.447 \end{pmatrix}$

All data lies on a line ($\sigma_2 = 0$). First PC: $\mathbf{v}_1 = (-0.447, -0.894)^T =$ direction $(1, 2)$. 100% of variance explained by 1 component.

Click for answer $\sigma_1 = \sqrt{10} \approx 3.162$, $\sigma_2 = 0$. Data is perfectly collinear (all points lie on $y=2x$). One PC captures all variance.

Example 2: Ridge Regression Filter Factors

For design matrix $X$ with singular values $\sigma = [10, 5, 1, 0.1]$, compute ridge filter factors for $\lambda = 1$ and $\lambda = 0.01$.

Solution: $f_i = \sigma_i^2 / (\sigma_i^2 + \lambda)$.

$\sigma_i$ $\lambda=1$ $\lambda=0.01$
10 100/101 ≈ 0.990 100/100.01 ≈ 1.000
5 25/26 ≈ 0.962 25/25.01 ≈ 1.000
1 1/2 = 0.500 1/1.01 ≈ 0.990
0.1 0.01/1.01 ≈ 0.010 0.01/0.02 = 0.500

Large singular values pass through almost unattenuated. Small singular values are heavily shrunk — this is ridge regression's regularization mechanism.

Click for answer $\lambda=1$: near-complete attenuation of the $\sigma=0.1$ direction (factor 0.01). $\lambda=0.01$: mild shrinkage (factor 0.5). Ridge selectively shrinks low-variance (likely noise) directions.

Example 3: Low-Rank Approximation

Approximate $A = \begin{pmatrix} 3 & 1 & 0 \ 1 & 3 & 1 \ 0 & 1 & 3 \end{pmatrix}$ with a rank-2 matrix.

Solution: SVD: $\sigma_1 \approx 4.414$, $\sigma_2 \approx 2.000$, $\sigma_3 \approx 0.586$. $A_2 = \sigma_1 \mathbf{u}_1\mathbf{v}_1^T + \sigma_2 \mathbf{u}_2\mathbf{v}_2^T$ (keeping top 2 components).

$|A - A_2|_F = \sigma_3 \approx 0.586$. The rank-2 approximation captures $(4.414^2 + 2^2)/(4.414^2 + 2^2 + 0.586^2) \approx 98.4\%$ of the Frobenius norm energy.

Click for answer Rank-2 approximation captures ~98.4% of the energy. Error = $\sigma_3 \approx 0.586$. The third singular component contributes little.


Quiz

Q1: What does the concept of Cholesky primarily refer to in this subject?

A) A computational error related to Cholesky B) The definition and application of Cholesky C) A historical anecdote about Cholesky D) A visual representation of Cholesky

Correct: B)

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) A = LU C) An unrelated formula from a different topic D) A simplified version of A = LU...

Correct: B)

Q3: What is the primary purpose of Eigenvalue?

A) It replaces all other methods in this domain B) It is used to eigenvalue in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system

Correct: B)

Q4: Which statement about The Ml Matrix Factorization Toolkit is TRUE?

A) The Ml Matrix Factorization Toolkit is an advanced topic beyond this subject's scope B) The Ml Matrix Factorization Toolkit is not related to this subject C) The Ml Matrix Factorization Toolkit is mentioned only as a historical footnote D) The Ml Matrix Factorization Toolkit is a fundamental concept covered in this subject

Correct: D)

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) A different result from a common mistake C) The inverse of the correct answer D) Ridge Regression Filter Factors

Correct: D)

Q6: How are The Ml Matrix Factorization Toolkit and Svd: The Swiss Army Knife related?

A) The Ml Matrix Factorization Toolkit is the inverse of Svd: The Swiss Army Knife B) The Ml Matrix Factorization Toolkit is a special case of Svd: The Swiss Army Knife C) The Ml Matrix Factorization Toolkit and Svd: The Swiss Army Knife are completely unrelated topics D) The Ml Matrix Factorization Toolkit and Svd: The Swiss Army Knife are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with Key Svd Applications For Ml?

A) The main error with Key Svd Applications For Ml is using it when it is not needed B) Key Svd Applications For Ml has no common misconceptions C) Key Svd Applications For Ml is always computed the same way in all contexts D) A common mistake is confusing Key Svd Applications For Ml with a similar concept

Correct: D)

Q8: When should you apply Iterative Methods For Large Systems?

A) Iterative Methods For Large Systems is not practically useful B) Apply Iterative Methods For Large Systems to solve problems in this subject's domain C) Use Iterative Methods For Large Systems only in pure mathematics contexts D) Avoid Iterative Methods For Large Systems unless explicitly instructed

Correct: B)

Practice Problems

  1. Show that the pseudoinverse $A^+$ satisfies the four Moore-Penrose conditions: $AA^+ A = A$, $A^+A A^+ = A^+$, $(AA^+)^T = AA^+$, $(A^+A)^T = A^+A$.

    Click for answer Let $A = U\Sigma V^T$, $A^+ = V\Sigma^+ U^T$. 1. $AA^+A = U\Sigma V^T V\Sigma^+ U^T U\Sigma V^T = U\Sigma\Sigma^+\Sigma V^T = U\Sigma V^T = A$ (since $\Sigma\Sigma^+\Sigma = \Sigma$ — the diagonal entries are $\sigma_i \cdot (1/\sigma_i) \cdot \sigma_i = \sigma_i$ for $\sigma_i > 0$, and $0 \cdot 0 \cdot 0 = 0$ for $\sigma_i = 0$). 2-4 follow similarly from the diagonal structure.

  2. Explain why the condition number of $X^T X + \lambda I$ improves as $\lambda$ increases.

    Click for answer If singular values of $X$ are $\sigma_i$, then $X^T X + \lambda I$ has eigenvalues $\sigma_i^2 + \lambda$. $\kappa = (\sigma_{\max}^2 + \lambda)/(\sigma_{\min}^2 + \lambda)$. As $\lambda$ increases: numerator and denominator both increase, but the ratio shrinks toward 1. For $\lambda \gg \sigma_{\min}^2$, $\kappa \approx 1$ — the problem becomes perfectly conditioned.

  3. For a sparse matrix $A$ with $10^6$ rows/columns and $10^7$ nonzeros, compare the cost per iteration of CG vs cost of one LU factorization.

    Click for answer CG per iteration: one matrix-vector product → $O(\text{nnz}) = 10^7$ ops. LU factorization: $O(n^3) = 10^{18}$ ops for dense, or $O(n \cdot \text{nnz})$ for sparse (with fill-in). Even best-case sparse LU: $10^6 \times 10^7 = 10^{13}$ ops, far more expensive. CG with good preconditioning converges in ~100 iterations for well-conditioned problems → $10^9$ ops total — 10,000× faster than sparse LU.

  4. Compute the rank-1 approximation of the matrix of all ones $J_{3 \times 3}$ using SVD.

    Click for answer $J_{3 \times 3}$ is rank-1 (all rows identical). Its only nonzero singular value: $\sigma_1 = 3$. $u_1 = (1,1,1)^T/\sqrt{3}$, $v_1 = (1,1,1)^T/\sqrt{3}$. $J = 3 \cdot \frac{1}{\sqrt{3}}(1,1,1)^T \cdot \frac{1}{\sqrt{3}}(1,1,1)^T = 3 \cdot \frac{1}{3} J_{3\times 3}$. Since $3 \cdot (1/3) = 1$, we have $J = \mathbf{1}\mathbf{1}^T$ where $\mathbf{1} = (1,1,1)^T$. The rank-1 approximation IS the exact matrix (since it's exactly rank-1).

  5. Randomized SVD with $k+p = 20$ on a $10^5 \times 10^5$ matrix. Approximately how much faster than full SVD?

    Click for answer Full SVD: $O(mn^2) = 10^5 \times 10^{10} = 10^{15}$ ops. Randomized SVD: $O(mn(k+p))$ for the matrix-multiply step = $10^5 \times 10^5 \times 20 = 2 \times 10^{11}$ ops, plus $O((k+p)^2 n)$ for the small SVD ≈ $400 \times 10^5 = 4 \times 10^7$ ops (negligible). Total: ~$2 \times 10^{11}$ ops — about 5,000× faster than full SVD.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 15-06-mixed-precision-training.md — float16/bfloat16 training, loss scaling, and how modern GPUs balance speed and accuracy.