Math graphic
📐 Concept diagram

15-03 — Automatic Differentiation (AD)

Phase: Numerical Methods for ML | Subject: 15-03 Prerequisites: 4-4-differentiation-rules, 14-02-gradient-descent.md Next subject: 15-04-backpropagation-implementation.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Distinguish AD from symbolic differentiation and numerical differentiation
  2. Explain forward-mode AD using dual numbers
  3. Explain reverse-mode AD and why it's $O(1)$ in output dimension
  4. Construct computational graphs and trace gradient flow through them
  5. Justify why reverse-mode AD (backpropagation) is the right choice for neural network training

Core Content

Three Ways to Compute Derivatives

Method How It Works Complexity Issues
Numerical $f'(x) \approx \frac{f(x+h)-f(x)}{h}$ $O(1)$ per derivative Truncation + roundoff error; need to choose $h$
Symbolic Manipulate expressions algebraically Exponential expression swell Code blowup, can't handle control flow
Automatic Chain rule through elementary operations $O(1)$ per operation Exact to machine precision; handles control flow

⚠️ CRITICAL — AD is NOT the same as symbolic or numerical differentiation. AD computes exact derivatives (up to floating-point accuracy) by applying the chain rule to the trace of elementary operations. It works through loops, conditionals, and recursion — unlike symbolic differentiation, which operates on the expression tree.

AD is what powers PyTorch's autograd, TensorFlow's GradientTape, and JAX's grad.

Forward-Mode AD

Idea: Propagate derivatives forward through the computation alongside the values. Each variable is replaced by a dual number $(v, \dot{v})$ where $\dot{v} = \partial v / \partial x$ for the input of interest.

Chain rule in forward mode: $\dot{y} = \frac{\partial f}{\partial x} = \sum_i \frac{\partial f}{\partial x_i} \dot{x}_i$

For $f(x) = \sin(x^2) + e^x$: - $v_1 = x$, $\dot{v}_1 = 1$ (seed) - $v_2 = v_1^2$, $\dot{v}_2 = 2v_1 \cdot \dot{v}_1 = 2x$ - $v_3 = \sin(v_2)$, $\dot{v}_3 = \cos(v_2) \cdot \dot{v}_2 = \cos(x^2) \cdot 2x$ - $v_4 = e^{v_1}$, $\dot{v}_4 = e^{v_1} \cdot \dot{v}_1 = e^x$ - $v_5 = v_3 + v_4$, $\dot{v}_5 = \dot{v}_3 + \dot{v}_4 = 2x\cos(x^2) + e^x$ ✓

Cost: $O(n)$ to compute $\partial f / \partial x$ when $f: \mathbb{R}^n \to \mathbb{R}$ (one forward pass per input variable). Great for $n$ small, $m$ large ($f: \mathbb{R}^n \to \mathbb{R}^m$ with $n \ll m$).

Reverse-Mode AD (Backpropagation)

Idea: First compute all values forward (the "forward pass"), then propagate gradients backward from the output (the "backward pass").

For $f(x) = \sin(x^2) + e^x$:

Forward pass (build computation graph, store intermediate values): 1. $v_1 = x$ 2. $v_2 = v_1^2$ 3. $v_3 = \sin(v_2)$ 4. $v_4 = e^{v_1}$ 5. $v_5 = v_3 + v_4$ (output)

Backward pass (propagate $\bar{v}_i = \partial f / \partial v_i$ from output to input): - $\bar{v}_5 = 1$ (seed the output) - $\bar{v}_3 = \bar{v}_5 \cdot \frac{\partial v_5}{\partial v_3} = 1 \cdot 1 = 1$ - $\bar{v}_4 = \bar{v}_5 \cdot \frac{\partial v_5}{\partial v_4} = 1 \cdot 1 = 1$ - $\bar{v}_2 = \bar{v}_3 \cdot \frac{\partial v_3}{\partial v_2} = 1 \cdot \cos(v_2) = \cos(x^2)$ - $\bar{v}_1 = \bar{v}_2 \cdot \frac{\partial v_2}{\partial v_1} + \bar{v}_4 \cdot \frac{\partial v_4}{\partial v_1} = \cos(x^2) \cdot 2x + 1 \cdot e^x = 2x\cos(x^2) + e^x$ ✓

Cost: $O(1)$ to compute $\nabla f$ for $f: \mathbb{R}^n \to \mathbb{R}$, regardless of $n$. This is exactly what neural network training needs: compute the gradient of a scalar loss w.r.t. millions of parameters.

Forward vs Reverse: The Key Tradeoff

Forward Mode Reverse Mode
Cost for $f: \mathbb{R}^n \to \mathbb{R}^m$ $O(n)$ forward passes $O(m)$ backward passes
Memory $O(1)$ extra (no need to store intermediates) $O(\text{ops})$ — must store entire forward trace
Best when $n$ small, $m$ large (e.g., computing Jacobian) $n$ large, $m=1$ (e.g., neural network training)
ML use cases Computing Jacobian-vector products, Hessian-vector products Backpropagation (the default)

The Wengert Tape

In practice, reverse-mode AD implementations use a Wengert tape (or "trace"): a sequential record of every elementary operation performed during the forward pass, along with the intermediate values.

During the backward pass, the tape is read in reverse order. Each operation knows how to compute its local Jacobian and how to multiply it with the incoming gradient (the "vector-Jacobian product" or VJP).

⚠️ Critical limitation of reverse mode: You must store EVERY intermediate value from the forward pass. For a neural network with $10^9$ activations, this is a lot of memory. This is why gradient checkpointing exists: trade recomputation for memory by only storing checkpoints and recomputing intermediates during the backward pass.



Key Terms

Worked Examples

Example 1: Forward Mode for a Simple Function

Use forward-mode AD to compute $\partial f / \partial x$ at $x=2$ for $f(x) = x^3 + \ln x$.

Solution: Initialize dual number: $(v, \dot{v}) = (2, 1)$.

  1. $v_1 = x = (2, 1)$
  2. $v_2 = v_1^3 = (2^3, 3 \cdot 2^2 \cdot 1) = (8, 12)$
  3. $v_3 = \ln(v_1) = (\ln 2, 1/2 \cdot 1) = (0.6931, 0.5)$
  4. $v_4 = v_2 + v_3 = (8 + 0.6931, 12 + 0.5) = (8.6931, 12.5)$

$f(2) = 8.6931$, $f'(2) = 12.5$. Check: $f'(x) = 3x^2 + 1/x$, so $f'(2) = 12 + 0.5 = 12.5$ ✓.

Click for answer $f(2) \approx 8.6931$, $f'(2) = 12.5$. Forward mode traces the computation and applies the chain rule simultaneously.

Example 2: Reverse Mode on a 2D Function

Use reverse-mode AD to compute $\nabla g$ where $g(x,y) = x \cdot \sin(y)$ at $(x,y) = (3, \pi/6)$.

Solution: Forward: $v_1 = 3$, $v_2 = \pi/6$, $v_3 = \sin(v_2) = 0.5$, $v_4 = v_1 \cdot v_3 = 1.5$.

Backward: $\bar{v}_4 = 1$ $\bar{v}_3 = \bar{v}_4 \cdot v_1 = 1 \cdot 3 = 3$ $\bar{v}_1 = \bar{v}_4 \cdot v_3 = 1 \cdot 0.5 = 0.5$ $\bar{v}_2 = \bar{v}_3 \cdot \cos(v_2) = 3 \cdot \cos(\pi/6) = 3 \cdot 0.8660 = 2.598$

$\nabla g = (\partial g/\partial x,\; \partial g/\partial y) = (\bar{v}_1, \bar{v}_2) = (0.5, 2.598)$

Check: $\partial g/\partial x = \sin y = \sin(\pi/6) = 0.5$ ✓, $\partial g/\partial y = x\cos y = 3\cos(\pi/6) = 3 \cdot 0.8660 = 2.598$ ✓.

Click for answer $\nabla g(3, \pi/6) = (\sin(\pi/6),\; 3\cos(\pi/6)) = (0.5, 2.598)$. Reverse mode computes the full gradient in one backward pass.

Example 3: Control Flow in AD

$f(x) = \begin{cases} x^2 & \text{if } x > 0 \ 0 & \text{otherwise} \end{cases}$. What is the AD derivative at $x=0$?

Solution: AD traces the actual execution path. For $x=0$, the else branch is taken → $f(0) = 0$. The computation is $y = 0$, so $\partial f/\partial x = 0$.

For $x=0.001$, the then-branch is taken → $f(x) = x^2$, $\partial f/\partial x = 2x = 0.002$.

The AD derivative is the derivative of the executed program, not the mathematical derivative (which doesn't exist at $x=0$ due to the discontinuity in the derivative). This is both a feature (handles arbitrary code) and a caution (you get "subdifferential-adjacent" results at non-differentiable points).

Click for answer At $x=0$: AD returns $0$ (the derivative of the executed else-branch, which is the constant $0$). The true mathematical derivative doesn't exist at this point — the subdifferential is $\{0\}$ (both branches approach derivative 0 near $x=0$), so AD happens to return a valid subgradient. This is coincidental — in general, AD gives the derivative of the executed branch, which may or may not be a valid subgradient.


Quiz

Q1: What does the concept of Numerical primarily refer to in this subject?

A) A computational error related to Numerical B) The definition and application of Numerical C) A historical anecdote about Numerical D) A visual representation of Numerical

Correct: B)

Q2: Which of the following is the key formula discussed in this subject?

A) An unrelated formula from a different topic B) A simplified version of f'(x) \approx \frac{f(x+h)-f... C) The inverse operation of the formula in question D) f'(x) \approx \frac{f(x+h)-f(x)}{h}

Correct: D)

Q3: What is the primary purpose of Symbolic?

A) It is primarily a historical notation system B) It is used to symbolic in mathematical analysis C) It replaces all other methods in this domain D) It is used only in advanced research contexts

Correct: B)

Q4: Which statement about Automatic is TRUE?

A) Automatic is an advanced topic beyond this subject's scope B) Automatic is not related to this subject C) Automatic is mentioned only as a historical footnote D) Automatic is a fundamental concept covered in this subject

Correct: D)

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) Reverse Mode on a 2D Function C) The inverse of the correct answer D) A different result from a common mistake

Correct: B)

Q6: How are Automatic and Forward pass related?

A) Automatic and Forward pass are closely related concepts B) Automatic is a special case of Forward pass C) Automatic and Forward pass are completely unrelated topics D) Automatic is the inverse of Forward pass

Correct: A)

Q7: What is a common pitfall when working with Backward pass?

A) A common mistake is confusing Backward pass with a similar concept B) Backward pass is always computed the same way in all contexts C) Backward pass has no common misconceptions D) The main error with Backward pass is using it when it is not needed

Correct: A)

Q8: When should you apply Wengert tape?

A) Apply Wengert tape to solve problems in this subject's domain B) Use Wengert tape only in pure mathematics contexts C) Avoid Wengert tape unless explicitly instructed D) Wengert tape is not practically useful

Correct: A)

Practice Problems

  1. Trace forward-mode AD for $f(x) = \sqrt{x^2 + 1}$ at $x=3$.

    Click for answer $(v_1, \dot{v}_1) = (3, 1)$ $(v_2, \dot{v}_2) = (v_1^2, 2v_1\dot{v}_1) = (9, 6)$ $(v_3, \dot{v}_3) = (v_2 + 1, \dot{v}_2) = (10, 6)$ $(v_4, \dot{v}_4) = (\sqrt{v_3}, \dot{v}_3/(2\sqrt{v_3})) = (\sqrt{10}, 6/(2\sqrt{10})) = (3.1623, 0.9487)$ $f'(3) = 3/\sqrt{10} \approx 0.9487$ ✓.

  2. Derive the VJP (vector-Jacobian product) rule for the matrix multiplication $C = AB$. Given $\bar{C}$ (gradient of loss w.r.t. $C$), compute $\bar{A}$ and $\bar{B}$.

    Click for answer $C_{ij} = \sum_k A_{ik} B_{kj}$. $\frac{\partial L}{\partial A_{ik}} = \sum_j \frac{\partial L}{\partial C_{ij}} \frac{\partial C_{ij}}{\partial A_{ik}} = \sum_j \bar{C}_{ij} B_{kj} = (\bar{C} B^T)_{ik}$ So $\bar{A} = \bar{C} B^T$. Similarly, $\bar{B} = A^T \bar{C}$. These are the backward rules used in every deep learning framework.

  3. Explain why reverse-mode AD for $f: \mathbb{R}^n \to \mathbb{R}^m$ with $n \gg m$ is more efficient in the other direction (forward mode would be better when $m \gg n$).

    Click for answer Reverse mode: one backward pass computes $\nabla f_i$ for a single output $f_i$. To get the full Jacobian ($m$ outputs), you need $m$ backward passes. Total cost: $O(m \cdot \text{ops})$. Forward mode: one forward pass computes one column of the Jacobian. To get the full Jacobian, you need $n$ forward passes. Total cost: $O(n \cdot \text{ops})$. For $n \gg m$ (neural networks: $n \sim 10^9$, $m = 1$): reverse mode wins massively — $O(1 \cdot \text{ops})$ vs $O(10^9 \cdot \text{ops})$. For $m \gg n$ (e.g., computing Jacobian of a small parameter set w.r.t. many outputs): forward mode wins.

  4. In PyTorch, what does .detach() do and why is it used?

    Click for answer `.detach()` returns a new tensor that shares the same data but is disconnected from the computation graph. Gradients won't flow through it during backward(). Common use cases: (1) Freezing part of a model (preventing gradient updates); (2) Using a network's output as a target without backpropagating through it (e.g., target networks in DQN); (3) Computing metrics or logging values without building graph overhead.

  5. Compute the memory footprint of reverse-mode AD for a feedforward network with $L$ layers, each having $h$ hidden units, batch size $B$. Assume activations per layer are $Bh$ floats.

    Click for answer Reverse-mode AD must store ALL activations from the forward pass: $L$ layers × $Bh$ floats per layer = $LBh$ floats total. Plus the weights ($L \times h^2$ approx) and the Wengert tape metadata. For $L=100$, $h=1024$, $B=32$: $100 \times 32 \times 1024 \approx 3.3 \times 10^6$ floats ≈ 26 MB (at float32). Manageable. For $L=100$, $h=4096$, $B=512$: $100 \times 512 \times 4096 \approx 2.1 \times 10^8$ floats ≈ 840 MB. Painful but possible on modern GPUs (with checkpointing). This is why gradient checkpointing trades compute for memory: store only every $k$-th activation and recompute intermediates.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 15-04-backpropagation-implementation.md — implementing backprop from scratch for feedforward networks, understanding the backward pass in detail.