Math graphic
๐Ÿ“ Concept diagram

09-10 โ€” Matrix Calculus

Phase: 9 โ€” Matrix Decompositions & Advanced Linear Algebra Subject: 09-10 Prerequisites: 09-09 โ€” Numerical Linear Algebra Next subject: 10-01 โ€” Probability Foundations


Learning Objectives

  1. Compute derivatives of scalar-valued functions with respect to vectors (gradients) using numerator layout convention
  2. Differentiate vector-valued functions of scalars and compute derivatives of vectors with respect to vectors (Jacobians)
  3. Differentiate scalar functions with respect to matrices, including trace-based differentiation tricks
  4. Apply the matrix chain rule and product rule to composite expressions
  5. Derive gradients of common ML functions: linear forms, quadratic forms, log-determinant, and matrix inverse

Core Content

Matrix calculus is the systematic extension of scalar calculus to functions involving vectors and matrices. It is foundational for optimization (gradient descent requires gradients with respect to parameters), backpropagation, and many ML derivations.

CRITICAL -- Foundational: Matrix calculus extends scalar calculus to matrix variables. Key identities: d(x^T A x)/dx = (A+A^T)x, d(tr(AX))/dX = A^T. Used everywhere in optimization and ML backprop.

1. Derivative of Scalar with Respect to Vector (Gradient)

Let f: โ„^n โ†’ โ„. The gradient (numerator layout) is a column vector:

$โˆ‡_x f = โˆ‚f/โˆ‚x = [โˆ‚f/โˆ‚xโ‚, โˆ‚f/โˆ‚xโ‚‚, ..., โˆ‚f/โˆ‚x_n]^T
$

Numerator vs. denominator layout: Two competing conventions exist: - Numerator layout: Gradient is a column vector, Jacobian of vector-valued function has dimensions of numerator. (This document uses numerator layout.) - Denominator layout: Gradient is a row vector.

Key derivatives:

$โˆ‚/โˆ‚x (a^T x) = a                         (a is a constant vector)
โˆ‚/โˆ‚x (x^T a) = a
โˆ‚/โˆ‚x (x^T x) = 2x                        (since โˆ‚/โˆ‚x_i ฮฃ x_jยฒ = 2x_i)
โˆ‚/โˆ‚x (x^T A x) = (A + A^T)x              (quadratic form)
$

If A is symmetric: $โˆ‚/โˆ‚x (x^T A x) = 2Ax$.

Derivation of quadratic form gradient:

$x^T A x = ฮฃ_i ฮฃ_j x_i A_{ij} x_j

โˆ‚/โˆ‚x_k (x^T A x) = โˆ‚/โˆ‚x_k (ฮฃ_i ฮฃ_j x_i A_{ij} x_j)
$

Terms involving x_k: - When i = k: A_{kj} x_j from $โˆ‚(x_k A_{kj} x_j)/โˆ‚x_k = A_{kj} x_j$ - When j = k: x_i A_{ik} from โˆ‚(x_i A_{ik} x_k)/โˆ‚x_k = x_i A_{ik}

$โˆ‚/โˆ‚x_k (x^T A x) = ฮฃ_j A_{kj} x_j + ฮฃ_i x_i A_{ik}
                 = (Ax)_k + (A^T x)_k
$

So: $โˆ‡_x (x^T A x) = Ax + A^T x = (A + A^T)x$.

Chain rule for scalar-vector: If f(g(x)) where g: โ„^m โ†’ โ„^n and f: โ„^n โ†’ โ„:

$โˆ‡_x f = (โˆ‚g/โˆ‚x)^T โˆ‡_g f
$

where $โˆ‚g/โˆ‚x$ is the Jacobian (n ร— m matrix).

2. Derivative of Vector with Respect to Scalar

Let f: โ„ โ†’ โ„^m. Then $โˆ‚f/โˆ‚t$ is an m ร— 1 vector:

$โˆ‚f/โˆ‚t = [โˆ‚fโ‚/โˆ‚t, โˆ‚fโ‚‚/โˆ‚t, ..., โˆ‚f_m/โˆ‚t]^T
$

Key example โ€” matrix exponential flow: If $x(t) = e^{At} xโ‚€$, then $dx/dt = A e^{At} xโ‚€ = A x(t)$.

3. Derivative of Vector with Respect to Vector (Jacobian)

Let f: โ„^n โ†’ โ„^m. The Jacobian matrix (numerator layout) is m ร— n:

$J_{ij} = โˆ‚f_i / โˆ‚x_j
$
$โˆ‚f/โˆ‚x = [โˆ‚fโ‚/โˆ‚xโ‚  โˆ‚fโ‚/โˆ‚xโ‚‚  ...  โˆ‚fโ‚/โˆ‚x_n]
        [โˆ‚fโ‚‚/โˆ‚xโ‚  โˆ‚fโ‚‚/โˆ‚xโ‚‚  ...  โˆ‚fโ‚‚/โˆ‚x_n]
        [  ...       ...     ...    ...   ]
        [โˆ‚f_m/โˆ‚xโ‚ โˆ‚f_m/โˆ‚xโ‚‚ ... โˆ‚f_m/โˆ‚x_n]
$

Key Jacobians:

$โˆ‚/โˆ‚x (Ax) = A            (linear transformation)
โˆ‚/โˆ‚x (x^T A) = A^T       (A is constant)
$

Product rule for vector-vector: If $h(x) = f(x) โŠ™ g(x)$ where โŠ™ is element-wise product:

$โˆ‚h/โˆ‚x = diag(g) ยท โˆ‚f/โˆ‚x + diag(f) ยท โˆ‚g/โˆ‚x
$

4. Derivative of Scalar with Respect to Matrix

Let $f: โ„^{mร—n} โ†’ โ„$. The gradient is an m ร— n matrix:

(โˆ‚f/โˆ‚X)_{ij} = โˆ‚f/โˆ‚X_{ij}

Key matrix derivatives:

Trace tricks: The trace is key because tr(A^T B) = ฮฃ_{i,j} A_{ij} B_{ij} and trace is cyclic: $tr(ABC) = tr(BCA) = tr(CAB)$ (when dimensions match).

โˆ‚/โˆ‚X tr(AX) = A^T
โˆ‚/โˆ‚X tr(X^T A) = A
โˆ‚/โˆ‚X tr(X A X^T) = X A^T + X A  (for symmetric A: 2XA)
โˆ‚/โˆ‚X tr(X^T A X) = A X + A^T X   (for symmetric A: 2AX)

Linear form:

$โˆ‚/โˆ‚X (a^T X b) = a b^T
$

Derivation: $a^T X b = ฮฃ_i ฮฃ_j a_i X_{ij} b_j$. $โˆ‚/โˆ‚X_{ij} = a_i b_j$. So the gradient matrix is a b^T.

Quadratic form with matrix:

$โˆ‚/โˆ‚X tr(X^T X) = 2X
โˆ‚/โˆ‚X ||X||_Fยฒ = 2X   (since ||X||_Fยฒ = tr(X^T X))
$

Determinant:

โˆ‚/โˆ‚X det(X) = det(X) (X^{-1})^T        (if X is invertible)
โˆ‚/โˆ‚X log det(X) = (X^{-1})^T           (for invertible X)

Matrix inverse:

โˆ‚/โˆ‚X (a^T X^{-1} b) = -X^{-T} a b^T X^{-T}

Or more generally for matrix-valued functions: d(X^{-1}) = -X^{-1} (dX) X^{-1}.

5. Chain Rule in Matrix Form

Suppose f(Y(X)) where X is an mร—n matrix, Y is a pร—q matrix, and f returns a scalar. The chain rule is:

Common Pitfall: Two conventions: numerator layout (Jacobian) and denominator layout (gradient). They are TRANSPOSES! d(Ax)/dx = A in numerator, A^T in denominator. Always check convention.

$โˆ‚f/โˆ‚X_{ij} = ฮฃ_{k=1}^{p} ฮฃ_{l=1}^{q} (โˆ‚f/โˆ‚Y_{kl}) ยท (โˆ‚Y_{kl}/โˆ‚X_{ij})
$

In matrix form (numerator layout), for vector functions:

$โˆ‚h/โˆ‚x = (โˆ‚h/โˆ‚g) ยท (โˆ‚g/โˆ‚x)
$

where dimensions must align.

For neural network layers: $z = W x + b$, $a = ฯƒ(z)$, $L = loss(a)$:

$โˆ‚L/โˆ‚W = (โˆ‚L/โˆ‚a โŠ™ ฯƒ'(z)) x^T       (outer product)
โˆ‚L/โˆ‚x = W^T (โˆ‚L/โˆ‚a โŠ™ ฯƒ'(z))
$

This is the essence of backpropagation.

6. Common ML Gradient Patterns

Expression Gradient w.r.t. Result
$||Ax - b||ยฒ$ x $2 A^T (Ax - b)$
x^T A x x $(A + A^T) x$
$log det(X)$ X X^{-T} (for symmetric: X^{-1})
||X||_Fยฒ X 2X
||W||_Fยฒ (L2 reg) W 2W
$-log softmax(z)_y$ (cross-entropy) z $softmax(z) - e_y$


Key Terms

Worked Examples

Example 1: Gradient of Linear Regression Loss

Given X โˆˆ โ„^{mร—n}, w โˆˆ โ„^n, y โˆˆ โ„^m:

$L(w) = ||X w - y||ยฒ = (Xw - y)^T (Xw - y)
$

Expand:

$L = w^T X^T X w - 2 y^T X w + y^T y
$

Using $โˆ‚/โˆ‚w (w^T A w) = (A + A^T) w$ and $โˆ‚/โˆ‚w (b^T w) = b$:

$โˆ‡L = (X^T X + X^T X) w - 2 X^T y = 2 X^T X w - 2 X^T y = 2 X^T (X w - y)
$

Setting to zero: $X^T X w = X^T y$ (normal equations).

Example 2: Gradient of log det

Let $f(X) = log det(X)$ for X being SPD (or invertible).

We can use the trace trick. For small perturbation dX:

$f(X + dX) = log det(X + dX)
          = log det(X (I + X^{-1} dX))
          = log det(X) + log det(I + X^{-1} dX)
          โ‰ˆ f(X) + tr(X^{-1} dX)      (since log det(I + ฮตA) โ‰ˆ tr(ฮตA))
$

Thus: $df = tr(X^{-1} dX) = tr((X^{-T})^T dX)$, so โˆ‚f/โˆ‚X = X^{-T}.

If X is symmetric: โˆ‚f/โˆ‚X = X^{-1} (or $2X^{-1} - diag(X^{-1})$ if we consider only the independent entries โ€” the off-diagonal symmetry constraint matters).

Example 3: Gradient of a Neural Network Layer

Forward: $z = W x$, $a = ReLU(z)$, $L = ยฝ||a - y||ยฒ$.

Backward:

โˆ‚L/โˆ‚a = a - y
โˆ‚L/โˆ‚z = (โˆ‚L/โˆ‚a) โŠ™ ReLU'(z)  where ReLU'(z) = 1 if z > 0, else 0
โˆ‚L/โˆ‚W = (โˆ‚L/โˆ‚z) x^T           (outer product: mร—1 ยท 1ร—n = mร—n)
โˆ‚L/โˆ‚x = W^T (โˆ‚L/โˆ‚z)

For $W = [[1,2],[3,4]]$, $x = [1,1]^T$, $y = [5,11]^T$: - $z = Wx = [3, 7]^T$. Both positive, so $ReLU'(z) = [1, 1]^T$. - $a = [3, 7]^T$. - $โˆ‚L/โˆ‚a = [-2, -4]^T$. - $โˆ‚L/โˆ‚z = [-2, -4]^T$ (element-wise product with [1,1]^T). - $โˆ‚L/โˆ‚W = [-2,-4]^T [1,1] = [[-2,-2],[-4,-4]]$. - $โˆ‚L/โˆ‚x = [[1,3],[2,4]] [-2,-4]^T = [-14, -20]^T$.

Example 4: Product Rule Verification

Let $f(x) = x^T A x$. Compute $โˆ‡f$ using the product rule.

Rewrite: $f(x) = (x)^T (A x)$. Treat as u(x)^T v(x) where $u(x) = x$, $v(x) = A x$.

Product rule: $โˆ‡(u^T v) = (โˆ‚u/โˆ‚x)^T v + (โˆ‚v/โˆ‚x)^T u = I ยท A x + A^T ยท x = A x + A^T x$.

Matches the direct computation. โœ“


Quiz

Q1: What does the concept of Jacobian matrix primarily refer to in this subject?

A) A visual representation of Jacobian matrix B) A historical anecdote about Jacobian matrix C) The definition and application of Jacobian matrix D) A computational error related to Jacobian matrix

Correct: C)

Q2: What is the primary purpose of Common Pitfalls?

A) It replaces all other methods in this domain B) It is primarily a historical notation system C) It is used to common pitfalls in mathematical analysis D) It is used only in advanced research contexts

Correct: C)

Q3: Based on the worked examples in this subject, what is the correct result?

A) X^{-1}) | B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer

Correct: A)

Practice Problems

  1. Compute $โˆ‡_x f$ where $f(x) = ||Ax||ยฒ = x^T A^T A x$. Use the quadratic form gradient formula.

  2. Find the Jacobian $โˆ‚f/โˆ‚x$ for $f(x) = [xโ‚ยฒ + xโ‚‚, sin(xโ‚ xโ‚‚)]^T$.

  3. Compute $โˆ‚/โˆ‚X tr(A X B)$ where A, B are constant matrices.

  4. Derive $โˆ‚/โˆ‚W ||W X - Y||_Fยฒ$ with respect to W. (X and Y are constant matrices.)

  5. For $f(X) = det(X)$, use the trace approximation method to derive $โˆ‚f/โˆ‚X = det(X) (X^{-1})^T$.

  6. Compute $โˆ‡_x log(1 + exp(w^T x))$ (the gradient of the log-sigmoid).

Answers 1. Using `โˆ‚/โˆ‚x (x^T M x) = (M + M^T) x` with `M = A^T A`: Since A^T A is symmetric: โˆ‡f = 2 A^T A x. 2. fโ‚ = xโ‚ยฒ + xโ‚‚, fโ‚‚ = sin(xโ‚ xโ‚‚). โˆ‚fโ‚/โˆ‚xโ‚ = 2xโ‚, โˆ‚fโ‚/โˆ‚xโ‚‚ = 1. โˆ‚fโ‚‚/โˆ‚xโ‚ = xโ‚‚ cos(xโ‚ xโ‚‚), โˆ‚fโ‚‚/โˆ‚xโ‚‚ = xโ‚ cos(xโ‚ xโ‚‚). J = [[2xโ‚, 1], [xโ‚‚ cos(xโ‚ xโ‚‚), xโ‚ cos(xโ‚ xโ‚‚)]]. 3. `tr(A X B) = tr(B A X)` (cyclic property). `โˆ‚/โˆ‚X tr(C X) = C^T` where C = B A. So `โˆ‚/โˆ‚X tr(A X B) = (B A)^T = A^T B^T`. 4. `f(W) = ||W X - Y||_Fยฒ = tr((WX - Y)^T (WX - Y)) = tr(X^T W^T W X) - 2 tr(Y^T W X) + tr(Y^T Y)`. `โˆ‚/โˆ‚W tr(X^T W^T W X) = โˆ‚/โˆ‚W tr(W X X^T W^T) = W X X^T + W X X^T = 2 W X X^T` (if we treat symmetrically). `โˆ‚/โˆ‚W (-2 tr(Y^T W X)) = -2 โˆ‚/โˆ‚W tr(X Y^T W) = -2 (X Y^T)^T = -2 Y X^T`. So: `โˆ‚f/โˆ‚W = 2 W X X^T - 2 Y X^T = 2 (W X - Y) X^T`. 5. For small dX: `det(X + dX) โ‰ˆ det(X) + det(X) tr(X^{-1} dX)`. So `df = det(X) tr(X^{-1} dX) = tr(det(X) (X^{-T})^T dX)`. Therefore: `โˆ‚ det(X)/โˆ‚X = det(X) X^{-T}`. 6. `f(x) = log(1 + exp(w^T x))`. Let `s = w^T x`. `โˆ‚f/โˆ‚s = exp(s)/(1 + exp(s)) = ฯƒ(s)` (the sigmoid function). By chain rule: `โˆ‡_x f = (โˆ‚f/โˆ‚s) โˆ‡_x (w^T x) = ฯƒ(w^T x) ยท w`.

Summary


Pitfalls



Next Steps

Congratulations! You have completed Phase 9 (Matrix Decompositions & Advanced Linear Algebra). This concludes the core linear algebra sequence. Your next steps depend on your learning path: