📐 Concept diagram

09-10 — Matrix Calculus

Phase: 9 — Matrix Decompositions & Advanced Linear Algebra Subject: 09-10 Prerequisites: 09-09 — Numerical Linear Algebra Next subject: 10-01 — Probability Foundations

Learning Objectives

Compute derivatives of scalar-valued functions with respect to vectors (gradients) using numerator layout convention
Differentiate vector-valued functions of scalars and compute derivatives of vectors with respect to vectors (Jacobians)
Differentiate scalar functions with respect to matrices, including trace-based differentiation tricks
Apply the matrix chain rule and product rule to composite expressions
Derive gradients of common ML functions: linear forms, quadratic forms, log-determinant, and matrix inverse

Core Content

Matrix calculus is the systematic extension of scalar calculus to functions involving vectors and matrices. It is foundational for optimization (gradient descent requires gradients with respect to parameters), backpropagation, and many ML derivations.

CRITICAL -- Foundational: Matrix calculus extends scalar calculus to matrix variables. Key identities: d(x^T A x)/dx = (A+A^T)x, d(tr(AX))/dX = A^T. Used everywhere in optimization and ML backprop.

1. Derivative of Scalar with Respect to Vector (Gradient)

Let f: ℝ^n → ℝ. The gradient (numerator layout) is a column vector:

$∇_x f = ∂f/∂x = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂x_n]^T
$

Numerator vs. denominator layout: Two competing conventions exist: - Numerator layout: Gradient is a column vector, Jacobian of vector-valued function has dimensions of numerator. (This document uses numerator layout.) - Denominator layout: Gradient is a row vector.

Key derivatives:

$∂/∂x (a^T x) = a                         (a is a constant vector)
∂/∂x (x^T a) = a
∂/∂x (x^T x) = 2x                        (since ∂/∂x_i Σ x_j² = 2x_i)
∂/∂x (x^T A x) = (A + A^T)x              (quadratic form)
$

If A is symmetric: $∂/∂x (x^T A x) = 2Ax$.

Derivation of quadratic form gradient:

$x^T A x = Σ_i Σ_j x_i A_{ij} x_j

∂/∂x_k (x^T A x) = ∂/∂x_k (Σ_i Σ_j x_i A_{ij} x_j)
$

Terms involving x_k: - When i = k: A_{kj} x_j from $∂(x_k A_{kj} x_j)/∂x_k = A_{kj} x_j$ - When j = k: x_i A_{ik} from ∂(x_i A_{ik} x_k)/∂x_k = x_i A_{ik}

$∂/∂x_k (x^T A x) = Σ_j A_{kj} x_j + Σ_i x_i A_{ik}
                 = (Ax)_k + (A^T x)_k
$

So: $∇_x (x^T A x) = Ax + A^T x = (A + A^T)x$.

Chain rule for scalar-vector: If f(g(x)) where g: ℝ^m → ℝ^n and f: ℝ^n → ℝ:

$∇_x f = (∂g/∂x)^T ∇_g f
$

where $∂g/∂x$ is the Jacobian (n × m matrix).

2. Derivative of Vector with Respect to Scalar

Let f: ℝ → ℝ^m. Then $∂f/∂t$ is an m × 1 vector:

$∂f/∂t = [∂f₁/∂t, ∂f₂/∂t, ..., ∂f_m/∂t]^T
$

Key example — matrix exponential flow: If $x(t) = e^{At} x₀$, then $dx/dt = A e^{At} x₀ = A x(t)$.

3. Derivative of Vector with Respect to Vector (Jacobian)

Let f: ℝ^n → ℝ^m. The Jacobian matrix (numerator layout) is m × n:

$J_{ij} = ∂f_i / ∂x_j
$

$∂f/∂x = [∂f₁/∂x₁  ∂f₁/∂x₂  ...  ∂f₁/∂x_n]
        [∂f₂/∂x₁  ∂f₂/∂x₂  ...  ∂f₂/∂x_n]
        [  ...       ...     ...    ...   ]
        [∂f_m/∂x₁ ∂f_m/∂x₂ ... ∂f_m/∂x_n]
$

Key Jacobians:

$∂/∂x (Ax) = A            (linear transformation)
∂/∂x (x^T A) = A^T       (A is constant)
$

Product rule for vector-vector: If $h(x) = f(x) ⊙ g(x)$ where ⊙ is element-wise product:

$∂h/∂x = diag(g) · ∂f/∂x + diag(f) · ∂g/∂x
$

4. Derivative of Scalar with Respect to Matrix

Let $f: ℝ^{m×n} → ℝ$. The gradient is an m × n matrix:

(∂f/∂X)_{ij} = ∂f/∂X_{ij}

Key matrix derivatives:

Trace tricks: The trace is key because tr(A^T B) = Σ_{i,j} A_{ij} B_{ij} and trace is cyclic: $tr(ABC) = tr(BCA) = tr(CAB)$ (when dimensions match).

∂/∂X tr(AX) = A^T
∂/∂X tr(X^T A) = A
∂/∂X tr(X A X^T) = X A^T + X A  (for symmetric A: 2XA)
∂/∂X tr(X^T A X) = A X + A^T X   (for symmetric A: 2AX)

Linear form:

$∂/∂X (a^T X b) = a b^T
$

Derivation: $a^T X b = Σ_i Σ_j a_i X_{ij} b_j$. $∂/∂X_{ij} = a_i b_j$. So the gradient matrix is a b^T.

Quadratic form with matrix:

$∂/∂X tr(X^T X) = 2X
∂/∂X ||X||_F² = 2X   (since ||X||_F² = tr(X^T X))
$

Determinant:

∂/∂X det(X) = det(X) (X^{-1})^T        (if X is invertible)
∂/∂X log det(X) = (X^{-1})^T           (for invertible X)

Matrix inverse:

∂/∂X (a^T X^{-1} b) = -X^{-T} a b^T X^{-T}

Or more generally for matrix-valued functions: d(X^{-1}) = -X^{-1} (dX) X^{-1}.

5. Chain Rule in Matrix Form

Suppose f(Y(X)) where X is an m×n matrix, Y is a p×q matrix, and f returns a scalar. The chain rule is:

Common Pitfall: Two conventions: numerator layout (Jacobian) and denominator layout (gradient). They are TRANSPOSES! d(Ax)/dx = A in numerator, A^T in denominator. Always check convention.

$∂f/∂X_{ij} = Σ_{k=1}^{p} Σ_{l=1}^{q} (∂f/∂Y_{kl}) · (∂Y_{kl}/∂X_{ij})
$

In matrix form (numerator layout), for vector functions:

$∂h/∂x = (∂h/∂g) · (∂g/∂x)
$

where dimensions must align.

For neural network layers: $z = W x + b$, $a = σ(z)$, $L = loss(a)$:

$∂L/∂W = (∂L/∂a ⊙ σ'(z)) x^T       (outer product)
∂L/∂x = W^T (∂L/∂a ⊙ σ'(z))
$

This is the essence of backpropagation.

6. Common ML Gradient Patterns

Expression	Gradient w.r.t.	Result
$\|\|Ax - b\|\|²$	x	$2 A^T (Ax - b)$
`x^T A x`	x	$(A + A^T) x$
$log det(X)$	X	`X^{-T}` (for symmetric: `X^{-1}`)
`\|\|X\|\|_F²`	X	`2X`
`\|\|W\|\|_F²` (L2 reg)	W	`2W`
$-log softmax(z)_y$ (cross-entropy)	z	$softmax(z) - e_y$

Key Terms

Jacobian matrix

Worked Examples

Example 1: Gradient of Linear Regression Loss

Given X ∈ ℝ^{m×n}, w ∈ ℝ^n, y ∈ ℝ^m:

$L(w) = ||X w - y||² = (Xw - y)^T (Xw - y)
$

Expand:

$L = w^T X^T X w - 2 y^T X w + y^T y
$

Using $∂/∂w (w^T A w) = (A + A^T) w$ and $∂/∂w (b^T w) = b$:

$∇L = (X^T X + X^T X) w - 2 X^T y = 2 X^T X w - 2 X^T y = 2 X^T (X w - y)
$

Setting to zero: $X^T X w = X^T y$ (normal equations).

Example 2: Gradient of log det

Let $f(X) = log det(X)$ for X being SPD (or invertible).

We can use the trace trick. For small perturbation dX:

$f(X + dX) = log det(X + dX)
          = log det(X (I + X^{-1} dX))
          = log det(X) + log det(I + X^{-1} dX)
          ≈ f(X) + tr(X^{-1} dX)      (since log det(I + εA) ≈ tr(εA))
$

Thus: $df = tr(X^{-1} dX) = tr((X^{-T})^T dX)$, so ∂f/∂X = X^{-T}.

If X is symmetric: ∂f/∂X = X^{-1} (or $2X^{-1} - diag(X^{-1})$ if we consider only the independent entries — the off-diagonal symmetry constraint matters).

Example 3: Gradient of a Neural Network Layer

Forward: $z = W x$, $a = ReLU(z)$, $L = ½||a - y||²$.

Backward:

∂L/∂a = a - y
∂L/∂z = (∂L/∂a) ⊙ ReLU'(z)  where ReLU'(z) = 1 if z > 0, else 0
∂L/∂W = (∂L/∂z) x^T           (outer product: m×1 · 1×n = m×n)
∂L/∂x = W^T (∂L/∂z)

For $W = [[1,2],[3,4]]$, $x = [1,1]^T$, $y = [5,11]^T$: - $z = Wx = [3, 7]^T$. Both positive, so $ReLU'(z) = [1, 1]^T$. - $a = [3, 7]^T$. - $∂L/∂a = [-2, -4]^T$. - $∂L/∂z = [-2, -4]^T$ (element-wise product with [1,1]^T). - $∂L/∂W = [-2,-4]^T [1,1] = [[-2,-2],[-4,-4]]$. - $∂L/∂x = [[1,3],[2,4]] [-2,-4]^T = [-14, -20]^T$.

Example 4: Product Rule Verification

Let $f(x) = x^T A x$. Compute $∇f$ using the product rule.

Rewrite: $f(x) = (x)^T (A x)$. Treat as u(x)^T v(x) where $u(x) = x$, $v(x) = A x$.

Product rule: $∇(u^T v) = (∂u/∂x)^T v + (∂v/∂x)^T u = I · A x + A^T · x = A x + A^T x$.

Matches the direct computation. ✓

Quiz

Q1: What does the concept of Jacobian matrix primarily refer to in this subject?

A) A visual representation of Jacobian matrix B) A historical anecdote about Jacobian matrix C) The definition and application of Jacobian matrix D) A computational error related to Jacobian matrix

Correct: C)

If you chose A: This is incorrect. Jacobian matrix is defined as: the definition and application of jacobian matrix. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Jacobian matrix is defined as: the definition and application of jacobian matrix. The other options describe different aspects that are not the primary focus.
If you chose C: Jacobian matrix is defined as: the definition and application of jacobian matrix. The other options describe different aspects that are not the primary focus. Correct!
If you chose D: This is incorrect. Jacobian matrix is defined as: the definition and application of jacobian matrix. The other options describe different aspects that are not the primary focus.

Q2: What is the primary purpose of Common Pitfalls?

A) It replaces all other methods in this domain B) It is primarily a historical notation system C) It is used to common pitfalls in mathematical analysis D) It is used only in advanced research contexts

Correct: C)

If you chose A: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose D: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.

Q3: Based on the worked examples in this subject, what is the correct result?

A) X^{-1}) | B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer

Correct: A)

If you chose A: The worked examples show that the result is X^{-1}) |. The other options represent common errors. Correct!
If you chose B: This is incorrect. The worked examples show that the result is X^{-1}) |. The other options represent common errors.
If you chose C: This is incorrect. The worked examples show that the result is X^{-1}) |. The other options represent common errors.
If you chose D: This is incorrect. The worked examples show that the result is X^{-1}) |. The other options represent common errors.

Practice Problems

Compute $∇_x f$ where $f(x) = ||Ax||² = x^T A^T A x$. Use the quadratic form gradient formula.
Find the Jacobian $∂f/∂x$ for $f(x) = [x₁² + x₂, sin(x₁ x₂)]^T$.
Compute $∂/∂X tr(A X B)$ where A, B are constant matrices.
Derive $∂/∂W ||W X - Y||_F²$ with respect to W. (X and Y are constant matrices.)
For $f(X) = det(X)$, use the trace approximation method to derive $∂f/∂X = det(X) (X^{-1})^T$.
Compute $∇_x log(1 + exp(w^T x))$ (the gradient of the log-sigmoid).

Answers

1. Using `∂/∂x (x^T M x) = (M + M^T) x` with `M = A^T A`: Since A^T A is symmetric: ∇f = 2 A^T A x. 2. f₁ = x₁² + x₂, f₂ = sin(x₁ x₂). ∂f₁/∂x₁ = 2x₁, ∂f₁/∂x₂ = 1. ∂f₂/∂x₁ = x₂ cos(x₁ x₂), ∂f₂/∂x₂ = x₁ cos(x₁ x₂). J = [[2x₁, 1], [x₂ cos(x₁ x₂), x₁ cos(x₁ x₂)]]. 3. `tr(A X B) = tr(B A X)` (cyclic property). `∂/∂X tr(C X) = C^T` where C = B A. So `∂/∂X tr(A X B) = (B A)^T = A^T B^T`. 4. `f(W) = ||W X - Y||_F² = tr((WX - Y)^T (WX - Y)) = tr(X^T W^T W X) - 2 tr(Y^T W X) + tr(Y^T Y)`. `∂/∂W tr(X^T W^T W X) = ∂/∂W tr(W X X^T W^T) = W X X^T + W X X^T = 2 W X X^T` (if we treat symmetrically). `∂/∂W (-2 tr(Y^T W X)) = -2 ∂/∂W tr(X Y^T W) = -2 (X Y^T)^T = -2 Y X^T`. So: `∂f/∂W = 2 W X X^T - 2 Y X^T = 2 (W X - Y) X^T`. 5. For small dX: `det(X + dX) ≈ det(X) + det(X) tr(X^{-1} dX)`. So `df = det(X) tr(X^{-1} dX) = tr(det(X) (X^{-T})^T dX)`. Therefore: `∂ det(X)/∂X = det(X) X^{-T}`. 6. `f(x) = log(1 + exp(w^T x))`. Let `s = w^T x`. `∂f/∂s = exp(s)/(1 + exp(s)) = σ(s)` (the sigmoid function). By chain rule: `∇_x f = (∂f/∂s) ∇_x (w^T x) = σ(w^T x) · w`.

Summary

Scalar-vector gradient: $∇_x f = [∂f/∂x₁, ..., ∂f/∂x_n]^T$. Key formulas: $∂(a^T x)/∂x = a$, $∂(x^T A x)/∂x = (A+A^T)x$
Vector-vector Jacobian: $J_{ij} = ∂f_i/∂x_j$. Key: $∂(Ax)/∂x = A$
Scalar-matrix derivatives use trace tricks: $tr(A^T B) = vec(A)^T vec(B)$. Key: $∂ tr(AX)/∂X = A^T$, ∂ log det(X)/∂X = X^{-T}
Matrix chain rule generalizes scalar chain rule: $∂h/∂x = (∂h/∂g)(∂g/∂x)$ with careful dimension alignment
Neural network gradients (e.g., $∂L/∂W = δ x^T$) are direct applications of matrix calculus — essential for understanding backpropagation

Pitfalls

Mixing numerator and denominator layout conventions. They are transposes of each other: $∂(Ax)/∂x = A$ in numerator layout but A^T in denominator layout. Pick one convention and apply it consistently throughout a derivation.
Forgetting the transpose in the quadratic form gradient. $∂(x^T A x)/∂x = (A + A^T)x$, not 2Ax, unless A is symmetric. The transpose contribution matters — dropping it gives wrong gradients for non-symmetric matrices.
Treating trace derivative rules as matrix product rules. $∂ tr(AX)/∂X = A^T$, not A. The transpose is a consequence of how the trace inner product works: $tr(C^T dX) → ∂/∂X = C$.
Misapplying the chain rule in matrix form. Dimensions must align: $∂h/∂x = (∂h/∂g)(∂g/∂x)$. For scalar-vector-vector compositions, the Jacobian of the outer function must multiply from the left.
Assuming ∂ log det(X)/∂X = X^{-1} for all matrices. For symmetric X, the correct gradient accounting for symmetry constraints is $2X^{-1} - diag(X^{-1})$. For general invertible X (treating all entries independently), it's X^{-T}.

Next Steps

Congratulations! You have completed Phase 9 (Matrix Decompositions & Advanced Linear Algebra). This concludes the core linear algebra sequence. Your next steps depend on your learning path:

For optimization: Continue to Phase 14 (Optimization Theory) to build on matrix calculus
For deep learning: Continue to Phase 16 (Neural Network Foundations) where these matrix calculus skills are directly applied
For numerical methods: Topics from 09-09 (Numerical Linear Algebra) are expanded in Phase 15 (Numerical Methods for ML)
For review: Return to Phase 8's diagonalization section to solidify the eigendecomposition intuition before moving to applications

Progress

Phases