09-10 โ Matrix Calculus
Phase: 9 โ Matrix Decompositions & Advanced Linear Algebra Subject: 09-10 Prerequisites: 09-09 โ Numerical Linear Algebra Next subject: 10-01 โ Probability Foundations
Learning Objectives
- Compute derivatives of scalar-valued functions with respect to vectors (gradients) using numerator layout convention
- Differentiate vector-valued functions of scalars and compute derivatives of vectors with respect to vectors (Jacobians)
- Differentiate scalar functions with respect to matrices, including trace-based differentiation tricks
- Apply the matrix chain rule and product rule to composite expressions
- Derive gradients of common ML functions: linear forms, quadratic forms, log-determinant, and matrix inverse
Core Content
Matrix calculus is the systematic extension of scalar calculus to functions involving vectors and matrices. It is foundational for optimization (gradient descent requires gradients with respect to parameters), backpropagation, and many ML derivations.
CRITICAL -- Foundational: Matrix calculus extends scalar calculus to matrix variables. Key identities: d(x^T A x)/dx = (A+A^T)x, d(tr(AX))/dX = A^T. Used everywhere in optimization and ML backprop.
1. Derivative of Scalar with Respect to Vector (Gradient)
Let f: โ^n โ โ. The gradient (numerator layout) is a column vector:
$โ_x f = โf/โx = [โf/โxโ, โf/โxโ, ..., โf/โx_n]^T $
Numerator vs. denominator layout: Two competing conventions exist: - Numerator layout: Gradient is a column vector, Jacobian of vector-valued function has dimensions of numerator. (This document uses numerator layout.) - Denominator layout: Gradient is a row vector.
Key derivatives:
$โ/โx (a^T x) = a (a is a constant vector) โ/โx (x^T a) = a โ/โx (x^T x) = 2x (since โ/โx_i ฮฃ x_jยฒ = 2x_i) โ/โx (x^T A x) = (A + A^T)x (quadratic form) $
If A is symmetric: $โ/โx (x^T A x) = 2Ax$.
Derivation of quadratic form gradient:
$x^T A x = ฮฃ_i ฮฃ_j x_i A_{ij} x_j
โ/โx_k (x^T A x) = โ/โx_k (ฮฃ_i ฮฃ_j x_i A_{ij} x_j)
$
Terms involving x_k:
- When i = k: A_{kj} x_j from $โ(x_k A_{kj} x_j)/โx_k = A_{kj} x_j$
- When j = k: x_i A_{ik} from โ(x_i A_{ik} x_k)/โx_k = x_i A_{ik}
$โ/โx_k (x^T A x) = ฮฃ_j A_{kj} x_j + ฮฃ_i x_i A_{ik}
= (Ax)_k + (A^T x)_k
$
So: $โ_x (x^T A x) = Ax + A^T x = (A + A^T)x$.
Chain rule for scalar-vector:
If f(g(x)) where g: โ^m โ โ^n and f: โ^n โ โ:
$โ_x f = (โg/โx)^T โ_g f $
where $โg/โx$ is the Jacobian (n ร m matrix).
2. Derivative of Vector with Respect to Scalar
Let f: โ โ โ^m. Then $โf/โt$ is an m ร 1 vector:
$โf/โt = [โfโ/โt, โfโ/โt, ..., โf_m/โt]^T $
Key example โ matrix exponential flow: If $x(t) = e^{At} xโ$, then $dx/dt = A e^{At} xโ = A x(t)$.
3. Derivative of Vector with Respect to Vector (Jacobian)
Let f: โ^n โ โ^m. The Jacobian matrix (numerator layout) is m ร n:
$J_{ij} = โf_i / โx_j
$
$โf/โx = [โfโ/โxโ โfโ/โxโ ... โfโ/โx_n]
[โfโ/โxโ โfโ/โxโ ... โfโ/โx_n]
[ ... ... ... ... ]
[โf_m/โxโ โf_m/โxโ ... โf_m/โx_n]
$
Key Jacobians:
$โ/โx (Ax) = A (linear transformation) โ/โx (x^T A) = A^T (A is constant) $
Product rule for vector-vector: If $h(x) = f(x) โ g(x)$ where โ is element-wise product:
$โh/โx = diag(g) ยท โf/โx + diag(f) ยท โg/โx $
4. Derivative of Scalar with Respect to Matrix
Let $f: โ^{mรn} โ โ$. The gradient is an m ร n matrix:
(โf/โX)_{ij} = โf/โX_{ij}
Key matrix derivatives:
Trace tricks: The trace is key because tr(A^T B) = ฮฃ_{i,j} A_{ij} B_{ij} and trace is cyclic: $tr(ABC) = tr(BCA) = tr(CAB)$ (when dimensions match).
โ/โX tr(AX) = A^T
โ/โX tr(X^T A) = A
โ/โX tr(X A X^T) = X A^T + X A (for symmetric A: 2XA)
โ/โX tr(X^T A X) = A X + A^T X (for symmetric A: 2AX)
Linear form:
$โ/โX (a^T X b) = a b^T $
Derivation: $a^T X b = ฮฃ_i ฮฃ_j a_i X_{ij} b_j$. $โ/โX_{ij} = a_i b_j$. So the gradient matrix is a b^T.
Quadratic form with matrix:
$โ/โX tr(X^T X) = 2X โ/โX ||X||_Fยฒ = 2X (since ||X||_Fยฒ = tr(X^T X)) $
Determinant:
โ/โX det(X) = det(X) (X^{-1})^T (if X is invertible)
โ/โX log det(X) = (X^{-1})^T (for invertible X)
Matrix inverse:
โ/โX (a^T X^{-1} b) = -X^{-T} a b^T X^{-T}
Or more generally for matrix-valued functions: d(X^{-1}) = -X^{-1} (dX) X^{-1}.
5. Chain Rule in Matrix Form
Suppose f(Y(X)) where X is an mรn matrix, Y is a pรq matrix, and f returns a scalar. The chain rule is:
Common Pitfall: Two conventions: numerator layout (Jacobian) and denominator layout (gradient). They are TRANSPOSES! d(Ax)/dx = A in numerator, A^T in denominator. Always check convention.
$โf/โX_{ij} = ฮฃ_{k=1}^{p} ฮฃ_{l=1}^{q} (โf/โY_{kl}) ยท (โY_{kl}/โX_{ij})
$
In matrix form (numerator layout), for vector functions:
$โh/โx = (โh/โg) ยท (โg/โx) $
where dimensions must align.
For neural network layers: $z = W x + b$, $a = ฯ(z)$, $L = loss(a)$:
$โL/โW = (โL/โa โ ฯ'(z)) x^T (outer product) โL/โx = W^T (โL/โa โ ฯ'(z)) $
This is the essence of backpropagation.
6. Common ML Gradient Patterns
| Expression | Gradient w.r.t. | Result |
|---|---|---|
| $||Ax - b||ยฒ$ | x | $2 A^T (Ax - b)$ |
x^T A x |
x | $(A + A^T) x$ |
| $log det(X)$ | X | X^{-T} (for symmetric: X^{-1}) |
||X||_Fยฒ |
X | 2X |
||W||_Fยฒ (L2 reg) |
W | 2W |
| $-log softmax(z)_y$ (cross-entropy) | z | $softmax(z) - e_y$ |
Key Terms
- Jacobian matrix
Worked Examples
Example 1: Gradient of Linear Regression Loss
Given X โ โ^{mรn}, w โ โ^n, y โ โ^m:
$L(w) = ||X w - y||ยฒ = (Xw - y)^T (Xw - y) $
Expand:
$L = w^T X^T X w - 2 y^T X w + y^T y $
Using $โ/โw (w^T A w) = (A + A^T) w$ and $โ/โw (b^T w) = b$:
$โL = (X^T X + X^T X) w - 2 X^T y = 2 X^T X w - 2 X^T y = 2 X^T (X w - y) $
Setting to zero: $X^T X w = X^T y$ (normal equations).
Example 2: Gradient of log det
Let $f(X) = log det(X)$ for X being SPD (or invertible).
We can use the trace trick. For small perturbation dX:
$f(X + dX) = log det(X + dX)
= log det(X (I + X^{-1} dX))
= log det(X) + log det(I + X^{-1} dX)
โ f(X) + tr(X^{-1} dX) (since log det(I + ฮตA) โ tr(ฮตA))
$
Thus: $df = tr(X^{-1} dX) = tr((X^{-T})^T dX)$, so โf/โX = X^{-T}.
If X is symmetric: โf/โX = X^{-1} (or $2X^{-1} - diag(X^{-1})$ if we consider only the independent entries โ the off-diagonal symmetry constraint matters).
Example 3: Gradient of a Neural Network Layer
Forward: $z = W x$, $a = ReLU(z)$, $L = ยฝ||a - y||ยฒ$.
Backward:
โL/โa = a - y
โL/โz = (โL/โa) โ ReLU'(z) where ReLU'(z) = 1 if z > 0, else 0
โL/โW = (โL/โz) x^T (outer product: mร1 ยท 1รn = mรn)
โL/โx = W^T (โL/โz)
For $W = [[1,2],[3,4]]$, $x = [1,1]^T$, $y = [5,11]^T$: - $z = Wx = [3, 7]^T$. Both positive, so $ReLU'(z) = [1, 1]^T$. - $a = [3, 7]^T$. - $โL/โa = [-2, -4]^T$. - $โL/โz = [-2, -4]^T$ (element-wise product with [1,1]^T). - $โL/โW = [-2,-4]^T [1,1] = [[-2,-2],[-4,-4]]$. - $โL/โx = [[1,3],[2,4]] [-2,-4]^T = [-14, -20]^T$.
Example 4: Product Rule Verification
Let $f(x) = x^T A x$. Compute $โf$ using the product rule.
Rewrite: $f(x) = (x)^T (A x)$. Treat as u(x)^T v(x) where $u(x) = x$, $v(x) = A x$.
Product rule: $โ(u^T v) = (โu/โx)^T v + (โv/โx)^T u = I ยท A x + A^T ยท x = A x + A^T x$.
Matches the direct computation. โ
Quiz
Q1: What does the concept of Jacobian matrix primarily refer to in this subject?
A) A visual representation of Jacobian matrix B) A historical anecdote about Jacobian matrix C) The definition and application of Jacobian matrix D) A computational error related to Jacobian matrix
Correct: C)
- If you chose A: This is incorrect. Jacobian matrix is defined as: the definition and application of jacobian matrix. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Jacobian matrix is defined as: the definition and application of jacobian matrix. The other options describe different aspects that are not the primary focus.
- If you chose C: Jacobian matrix is defined as: the definition and application of jacobian matrix. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Jacobian matrix is defined as: the definition and application of jacobian matrix. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of Common Pitfalls?
A) It replaces all other methods in this domain B) It is primarily a historical notation system C) It is used to common pitfalls in mathematical analysis D) It is used only in advanced research contexts
Correct: C)
- If you chose A: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose D: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Based on the worked examples in this subject, what is the correct result?
A) X^{-1}) |
B) An unrelated numerical value
C) A different result from a common mistake
D) The inverse of the correct answer
Correct: A)
- If you chose A: The worked examples show that the result is
X^{-1}) |. The other options represent common errors. Correct! - If you chose B: This is incorrect. The worked examples show that the result is
X^{-1}) |. The other options represent common errors. - If you chose C: This is incorrect. The worked examples show that the result is
X^{-1}) |. The other options represent common errors. - If you chose D: This is incorrect. The worked examples show that the result is
X^{-1}) |. The other options represent common errors.
Practice Problems
-
Compute $โ_x f$ where $f(x) = ||Ax||ยฒ = x^T A^T A x$. Use the quadratic form gradient formula.
-
Find the Jacobian $โf/โx$ for $f(x) = [xโยฒ + xโ, sin(xโ xโ)]^T$.
-
Compute $โ/โX tr(A X B)$ where A, B are constant matrices.
-
Derive $โ/โW ||W X - Y||_Fยฒ$ with respect to W. (X and Y are constant matrices.)
-
For $f(X) = det(X)$, use the trace approximation method to derive $โf/โX = det(X) (X^{-1})^T$.
-
Compute $โ_x log(1 + exp(w^T x))$ (the gradient of the log-sigmoid).
Answers
1. Using `โ/โx (x^T M x) = (M + M^T) x` with `M = A^T A`: Since A^T A is symmetric: โf = 2 A^T A x. 2. fโ = xโยฒ + xโ, fโ = sin(xโ xโ). โfโ/โxโ = 2xโ, โfโ/โxโ = 1. โfโ/โxโ = xโ cos(xโ xโ), โfโ/โxโ = xโ cos(xโ xโ). J = [[2xโ, 1], [xโ cos(xโ xโ), xโ cos(xโ xโ)]]. 3. `tr(A X B) = tr(B A X)` (cyclic property). `โ/โX tr(C X) = C^T` where C = B A. So `โ/โX tr(A X B) = (B A)^T = A^T B^T`. 4. `f(W) = ||W X - Y||_Fยฒ = tr((WX - Y)^T (WX - Y)) = tr(X^T W^T W X) - 2 tr(Y^T W X) + tr(Y^T Y)`. `โ/โW tr(X^T W^T W X) = โ/โW tr(W X X^T W^T) = W X X^T + W X X^T = 2 W X X^T` (if we treat symmetrically). `โ/โW (-2 tr(Y^T W X)) = -2 โ/โW tr(X Y^T W) = -2 (X Y^T)^T = -2 Y X^T`. So: `โf/โW = 2 W X X^T - 2 Y X^T = 2 (W X - Y) X^T`. 5. For small dX: `det(X + dX) โ det(X) + det(X) tr(X^{-1} dX)`. So `df = det(X) tr(X^{-1} dX) = tr(det(X) (X^{-T})^T dX)`. Therefore: `โ det(X)/โX = det(X) X^{-T}`. 6. `f(x) = log(1 + exp(w^T x))`. Let `s = w^T x`. `โf/โs = exp(s)/(1 + exp(s)) = ฯ(s)` (the sigmoid function). By chain rule: `โ_x f = (โf/โs) โ_x (w^T x) = ฯ(w^T x) ยท w`.Summary
- Scalar-vector gradient: $โ_x f = [โf/โxโ, ..., โf/โx_n]^T$. Key formulas: $โ(a^T x)/โx = a$, $โ(x^T A x)/โx = (A+A^T)x$
- Vector-vector Jacobian: $J_{ij} = โf_i/โx_j$. Key: $โ(Ax)/โx = A$
- Scalar-matrix derivatives use trace tricks: $tr(A^T B) = vec(A)^T vec(B)$. Key: $โ tr(AX)/โX = A^T$,
โ log det(X)/โX = X^{-T} - Matrix chain rule generalizes scalar chain rule: $โh/โx = (โh/โg)(โg/โx)$ with careful dimension alignment
- Neural network gradients (e.g., $โL/โW = ฮด x^T$) are direct applications of matrix calculus โ essential for understanding backpropagation
Pitfalls
- Mixing numerator and denominator layout conventions. They are transposes of each other: $โ(Ax)/โx = A$ in numerator layout but
A^Tin denominator layout. Pick one convention and apply it consistently throughout a derivation. - Forgetting the transpose in the quadratic form gradient. $โ(x^T A x)/โx = (A + A^T)x$, not
2Ax, unless A is symmetric. The transpose contribution matters โ dropping it gives wrong gradients for non-symmetric matrices. - Treating trace derivative rules as matrix product rules. $โ tr(AX)/โX = A^T$, not
A. The transpose is a consequence of how the trace inner product works: $tr(C^T dX) โ โ/โX = C$. - Misapplying the chain rule in matrix form. Dimensions must align: $โh/โx = (โh/โg)(โg/โx)$. For scalar-vector-vector compositions, the Jacobian of the outer function must multiply from the left.
- Assuming
โ log det(X)/โX = X^{-1}for all matrices. For symmetric X, the correct gradient accounting for symmetry constraints is $2X^{-1} - diag(X^{-1})$. For general invertible X (treating all entries independently), it'sX^{-T}.
Next Steps
Congratulations! You have completed Phase 9 (Matrix Decompositions & Advanced Linear Algebra). This concludes the core linear algebra sequence. Your next steps depend on your learning path:
- For optimization: Continue to Phase 14 (Optimization Theory) to build on matrix calculus
- For deep learning: Continue to Phase 16 (Neural Network Foundations) where these matrix calculus skills are directly applied
- For numerical methods: Topics from 09-09 (Numerical Linear Algebra) are expanded in Phase 15 (Numerical Methods for ML)
- For review: Return to Phase 8's diagonalization section to solidify the eigendecomposition intuition before moving to applications