15-03 — Automatic Differentiation (AD)
Phase: Numerical Methods for ML | Subject: 15-03 Prerequisites: 4-4-differentiation-rules, 14-02-gradient-descent.md Next subject: 15-04-backpropagation-implementation.md
Learning Objectives
By the end of this subject, you will be able to:
- Distinguish AD from symbolic differentiation and numerical differentiation
- Explain forward-mode AD using dual numbers
- Explain reverse-mode AD and why it's $O(1)$ in output dimension
- Construct computational graphs and trace gradient flow through them
- Justify why reverse-mode AD (backpropagation) is the right choice for neural network training
Core Content
Three Ways to Compute Derivatives
| Method | How It Works | Complexity | Issues |
|---|---|---|---|
| Numerical | $f'(x) \approx \frac{f(x+h)-f(x)}{h}$ | $O(1)$ per derivative | Truncation + roundoff error; need to choose $h$ |
| Symbolic | Manipulate expressions algebraically | Exponential expression swell | Code blowup, can't handle control flow |
| Automatic | Chain rule through elementary operations | $O(1)$ per operation | Exact to machine precision; handles control flow |
⚠️ CRITICAL — AD is NOT the same as symbolic or numerical differentiation. AD computes exact derivatives (up to floating-point accuracy) by applying the chain rule to the trace of elementary operations. It works through loops, conditionals, and recursion — unlike symbolic differentiation, which operates on the expression tree.
AD is what powers PyTorch's autograd, TensorFlow's GradientTape, and JAX's grad.
Forward-Mode AD
Idea: Propagate derivatives forward through the computation alongside the values. Each variable is replaced by a dual number $(v, \dot{v})$ where $\dot{v} = \partial v / \partial x$ for the input of interest.
Chain rule in forward mode: $\dot{y} = \frac{\partial f}{\partial x} = \sum_i \frac{\partial f}{\partial x_i} \dot{x}_i$
For $f(x) = \sin(x^2) + e^x$: - $v_1 = x$, $\dot{v}_1 = 1$ (seed) - $v_2 = v_1^2$, $\dot{v}_2 = 2v_1 \cdot \dot{v}_1 = 2x$ - $v_3 = \sin(v_2)$, $\dot{v}_3 = \cos(v_2) \cdot \dot{v}_2 = \cos(x^2) \cdot 2x$ - $v_4 = e^{v_1}$, $\dot{v}_4 = e^{v_1} \cdot \dot{v}_1 = e^x$ - $v_5 = v_3 + v_4$, $\dot{v}_5 = \dot{v}_3 + \dot{v}_4 = 2x\cos(x^2) + e^x$ ✓
Cost: $O(n)$ to compute $\partial f / \partial x$ when $f: \mathbb{R}^n \to \mathbb{R}$ (one forward pass per input variable). Great for $n$ small, $m$ large ($f: \mathbb{R}^n \to \mathbb{R}^m$ with $n \ll m$).
Reverse-Mode AD (Backpropagation)
Idea: First compute all values forward (the "forward pass"), then propagate gradients backward from the output (the "backward pass").
For $f(x) = \sin(x^2) + e^x$:
Forward pass (build computation graph, store intermediate values): 1. $v_1 = x$ 2. $v_2 = v_1^2$ 3. $v_3 = \sin(v_2)$ 4. $v_4 = e^{v_1}$ 5. $v_5 = v_3 + v_4$ (output)
Backward pass (propagate $\bar{v}_i = \partial f / \partial v_i$ from output to input): - $\bar{v}_5 = 1$ (seed the output) - $\bar{v}_3 = \bar{v}_5 \cdot \frac{\partial v_5}{\partial v_3} = 1 \cdot 1 = 1$ - $\bar{v}_4 = \bar{v}_5 \cdot \frac{\partial v_5}{\partial v_4} = 1 \cdot 1 = 1$ - $\bar{v}_2 = \bar{v}_3 \cdot \frac{\partial v_3}{\partial v_2} = 1 \cdot \cos(v_2) = \cos(x^2)$ - $\bar{v}_1 = \bar{v}_2 \cdot \frac{\partial v_2}{\partial v_1} + \bar{v}_4 \cdot \frac{\partial v_4}{\partial v_1} = \cos(x^2) \cdot 2x + 1 \cdot e^x = 2x\cos(x^2) + e^x$ ✓
Cost: $O(1)$ to compute $\nabla f$ for $f: \mathbb{R}^n \to \mathbb{R}$, regardless of $n$. This is exactly what neural network training needs: compute the gradient of a scalar loss w.r.t. millions of parameters.
Forward vs Reverse: The Key Tradeoff
| Forward Mode | Reverse Mode | |
|---|---|---|
| Cost for $f: \mathbb{R}^n \to \mathbb{R}^m$ | $O(n)$ forward passes | $O(m)$ backward passes |
| Memory | $O(1)$ extra (no need to store intermediates) | $O(\text{ops})$ — must store entire forward trace |
| Best when | $n$ small, $m$ large (e.g., computing Jacobian) | $n$ large, $m=1$ (e.g., neural network training) |
| ML use cases | Computing Jacobian-vector products, Hessian-vector products | Backpropagation (the default) |
The Wengert Tape
In practice, reverse-mode AD implementations use a Wengert tape (or "trace"): a sequential record of every elementary operation performed during the forward pass, along with the intermediate values.
During the backward pass, the tape is read in reverse order. Each operation knows how to compute its local Jacobian and how to multiply it with the incoming gradient (the "vector-Jacobian product" or VJP).
⚠️ Critical limitation of reverse mode: You must store EVERY intermediate value from the forward pass. For a neural network with $10^9$ activations, this is a lot of memory. This is why gradient checkpointing exists: trade recomputation for memory by only storing checkpoints and recomputing intermediates during the backward pass.
Key Terms
- Automatic
- Backward pass
- Forward pass
- Numerical
- Symbolic
- Wengert tape
Worked Examples
Example 1: Forward Mode for a Simple Function
Use forward-mode AD to compute $\partial f / \partial x$ at $x=2$ for $f(x) = x^3 + \ln x$.
Solution: Initialize dual number: $(v, \dot{v}) = (2, 1)$.
- $v_1 = x = (2, 1)$
- $v_2 = v_1^3 = (2^3, 3 \cdot 2^2 \cdot 1) = (8, 12)$
- $v_3 = \ln(v_1) = (\ln 2, 1/2 \cdot 1) = (0.6931, 0.5)$
- $v_4 = v_2 + v_3 = (8 + 0.6931, 12 + 0.5) = (8.6931, 12.5)$
$f(2) = 8.6931$, $f'(2) = 12.5$. Check: $f'(x) = 3x^2 + 1/x$, so $f'(2) = 12 + 0.5 = 12.5$ ✓.
Click for answer
$f(2) \approx 8.6931$, $f'(2) = 12.5$. Forward mode traces the computation and applies the chain rule simultaneously.Example 2: Reverse Mode on a 2D Function
Use reverse-mode AD to compute $\nabla g$ where $g(x,y) = x \cdot \sin(y)$ at $(x,y) = (3, \pi/6)$.
Solution: Forward: $v_1 = 3$, $v_2 = \pi/6$, $v_3 = \sin(v_2) = 0.5$, $v_4 = v_1 \cdot v_3 = 1.5$.
Backward: $\bar{v}_4 = 1$ $\bar{v}_3 = \bar{v}_4 \cdot v_1 = 1 \cdot 3 = 3$ $\bar{v}_1 = \bar{v}_4 \cdot v_3 = 1 \cdot 0.5 = 0.5$ $\bar{v}_2 = \bar{v}_3 \cdot \cos(v_2) = 3 \cdot \cos(\pi/6) = 3 \cdot 0.8660 = 2.598$
$\nabla g = (\partial g/\partial x,\; \partial g/\partial y) = (\bar{v}_1, \bar{v}_2) = (0.5, 2.598)$
Check: $\partial g/\partial x = \sin y = \sin(\pi/6) = 0.5$ ✓, $\partial g/\partial y = x\cos y = 3\cos(\pi/6) = 3 \cdot 0.8660 = 2.598$ ✓.
Click for answer
$\nabla g(3, \pi/6) = (\sin(\pi/6),\; 3\cos(\pi/6)) = (0.5, 2.598)$. Reverse mode computes the full gradient in one backward pass.Example 3: Control Flow in AD
$f(x) = \begin{cases} x^2 & \text{if } x > 0 \ 0 & \text{otherwise} \end{cases}$. What is the AD derivative at $x=0$?
Solution: AD traces the actual execution path. For $x=0$, the else branch is taken → $f(0) = 0$. The computation is $y = 0$, so $\partial f/\partial x = 0$.
For $x=0.001$, the then-branch is taken → $f(x) = x^2$, $\partial f/\partial x = 2x = 0.002$.
The AD derivative is the derivative of the executed program, not the mathematical derivative (which doesn't exist at $x=0$ due to the discontinuity in the derivative). This is both a feature (handles arbitrary code) and a caution (you get "subdifferential-adjacent" results at non-differentiable points).
Click for answer
At $x=0$: AD returns $0$ (the derivative of the executed else-branch, which is the constant $0$). The true mathematical derivative doesn't exist at this point — the subdifferential is $\{0\}$ (both branches approach derivative 0 near $x=0$), so AD happens to return a valid subgradient. This is coincidental — in general, AD gives the derivative of the executed branch, which may or may not be a valid subgradient.Quiz
Q1: What does the concept of Numerical primarily refer to in this subject?
A) A computational error related to Numerical B) The definition and application of Numerical C) A historical anecdote about Numerical D) A visual representation of Numerical
Correct: B)
- If you chose A: This is incorrect. Numerical is defined as: the definition and application of numerical. The other options describe different aspects that are not the primary focus.
- If you chose B: Numerical is defined as: the definition and application of numerical. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. Numerical is defined as: the definition and application of numerical. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Numerical is defined as: the definition and application of numerical. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) An unrelated formula from a different topic B) A simplified version of f'(x) \approx \frac{f(x+h)-f... C) The inverse operation of the formula in question D) f'(x) \approx \frac{f(x+h)-f(x)}{h}
Correct: D)
- If you chose A: This is incorrect. The formula f'(x) \approx \frac{f(x+h)-f(x)}{h} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: This is incorrect. The formula f'(x) \approx \frac{f(x+h)-f(x)}{h} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula f'(x) \approx \frac{f(x+h)-f(x)}{h} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: The formula f'(x) \approx \frac{f(x+h)-f(x)}{h} is central to this subject. The other options are either simplified versions or unrelated. Correct!
Q3: What is the primary purpose of Symbolic?
A) It is primarily a historical notation system B) It is used to symbolic in mathematical analysis C) It replaces all other methods in this domain D) It is used only in advanced research contexts
Correct: B)
- If you chose A: This is incorrect. Symbolic serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Symbolic serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Symbolic serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Symbolic serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Automatic is TRUE?
A) Automatic is an advanced topic beyond this subject's scope B) Automatic is not related to this subject C) Automatic is mentioned only as a historical footnote D) Automatic is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. Automatic is a fundamental concept covered in this subject. This subject covers Automatic as part of its core content.
- If you chose B: This is incorrect. Automatic is a fundamental concept covered in this subject. This subject covers Automatic as part of its core content.
- If you chose C: This is incorrect. Automatic is a fundamental concept covered in this subject. This subject covers Automatic as part of its core content.
- If you chose D: Automatic is a fundamental concept covered in this subject. This subject covers Automatic as part of its core content. Correct!
Q5: Based on the worked examples in this subject, what is the correct result?
A) An unrelated numerical value B) Reverse Mode on a 2D Function C) The inverse of the correct answer D) A different result from a common mistake
Correct: B)
- If you chose A: This is incorrect. The worked examples show that the result is Reverse Mode on a 2D Function. The other options represent common errors.
- If you chose B: The worked examples show that the result is Reverse Mode on a 2D Function. The other options represent common errors. Correct!
- If you chose C: This is incorrect. The worked examples show that the result is Reverse Mode on a 2D Function. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is Reverse Mode on a 2D Function. The other options represent common errors.
Q6: How are Automatic and Forward pass related?
A) Automatic and Forward pass are closely related concepts B) Automatic is a special case of Forward pass C) Automatic and Forward pass are completely unrelated topics D) Automatic is the inverse of Forward pass
Correct: A)
- If you chose A: Both Automatic and Forward pass are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both Automatic and Forward pass are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Automatic and Forward pass are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Automatic and Forward pass are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with Backward pass?
A) A common mistake is confusing Backward pass with a similar concept B) Backward pass is always computed the same way in all contexts C) Backward pass has no common misconceptions D) The main error with Backward pass is using it when it is not needed
Correct: A)
- If you chose A: Students often confuse Backward pass with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose B: This is incorrect. Students often confuse Backward pass with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Backward pass with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Backward pass with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Wengert tape?
A) Apply Wengert tape to solve problems in this subject's domain B) Use Wengert tape only in pure mathematics contexts C) Avoid Wengert tape unless explicitly instructed D) Wengert tape is not practically useful
Correct: A)
- If you chose A: Wengert tape is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose B: This is incorrect. Wengert tape is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Wengert tape is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Wengert tape is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Trace forward-mode AD for $f(x) = \sqrt{x^2 + 1}$ at $x=3$.
Click for answer
$(v_1, \dot{v}_1) = (3, 1)$ $(v_2, \dot{v}_2) = (v_1^2, 2v_1\dot{v}_1) = (9, 6)$ $(v_3, \dot{v}_3) = (v_2 + 1, \dot{v}_2) = (10, 6)$ $(v_4, \dot{v}_4) = (\sqrt{v_3}, \dot{v}_3/(2\sqrt{v_3})) = (\sqrt{10}, 6/(2\sqrt{10})) = (3.1623, 0.9487)$ $f'(3) = 3/\sqrt{10} \approx 0.9487$ ✓. -
Derive the VJP (vector-Jacobian product) rule for the matrix multiplication $C = AB$. Given $\bar{C}$ (gradient of loss w.r.t. $C$), compute $\bar{A}$ and $\bar{B}$.
Click for answer
$C_{ij} = \sum_k A_{ik} B_{kj}$. $\frac{\partial L}{\partial A_{ik}} = \sum_j \frac{\partial L}{\partial C_{ij}} \frac{\partial C_{ij}}{\partial A_{ik}} = \sum_j \bar{C}_{ij} B_{kj} = (\bar{C} B^T)_{ik}$ So $\bar{A} = \bar{C} B^T$. Similarly, $\bar{B} = A^T \bar{C}$. These are the backward rules used in every deep learning framework. -
Explain why reverse-mode AD for $f: \mathbb{R}^n \to \mathbb{R}^m$ with $n \gg m$ is more efficient in the other direction (forward mode would be better when $m \gg n$).
Click for answer
Reverse mode: one backward pass computes $\nabla f_i$ for a single output $f_i$. To get the full Jacobian ($m$ outputs), you need $m$ backward passes. Total cost: $O(m \cdot \text{ops})$. Forward mode: one forward pass computes one column of the Jacobian. To get the full Jacobian, you need $n$ forward passes. Total cost: $O(n \cdot \text{ops})$. For $n \gg m$ (neural networks: $n \sim 10^9$, $m = 1$): reverse mode wins massively — $O(1 \cdot \text{ops})$ vs $O(10^9 \cdot \text{ops})$. For $m \gg n$ (e.g., computing Jacobian of a small parameter set w.r.t. many outputs): forward mode wins. -
In PyTorch, what does
.detach()do and why is it used?
Click for answer
`.detach()` returns a new tensor that shares the same data but is disconnected from the computation graph. Gradients won't flow through it during backward(). Common use cases: (1) Freezing part of a model (preventing gradient updates); (2) Using a network's output as a target without backpropagating through it (e.g., target networks in DQN); (3) Computing metrics or logging values without building graph overhead. -
Compute the memory footprint of reverse-mode AD for a feedforward network with $L$ layers, each having $h$ hidden units, batch size $B$. Assume activations per layer are $Bh$ floats.
Click for answer
Reverse-mode AD must store ALL activations from the forward pass: $L$ layers × $Bh$ floats per layer = $LBh$ floats total. Plus the weights ($L \times h^2$ approx) and the Wengert tape metadata. For $L=100$, $h=1024$, $B=32$: $100 \times 32 \times 1024 \approx 3.3 \times 10^6$ floats ≈ 26 MB (at float32). Manageable. For $L=100$, $h=4096$, $B=512$: $100 \times 512 \times 4096 \approx 2.1 \times 10^8$ floats ≈ 840 MB. Painful but possible on modern GPUs (with checkpointing). This is why gradient checkpointing trades compute for memory: store only every $k$-th activation and recompute intermediates.
Summary
Key takeaways:
- AD computes exact derivatives by applying chain rule to the trace of elementary operations
- Forward mode: propagates derivatives alongside values — $O(n)$ for $\nabla f$, good for $n$ small
- Reverse mode: forward pass records trace, backward pass propagates gradients — $O(1)$ for $\nabla f$ when $f: \mathbb{R}^n \to \mathbb{R}$
- Reverse mode (= backpropagation) is the right choice for neural networks ($n$ huge, scalar loss)
- The Wengert tape stores every intermediate value — memory is the bottleneck for reverse mode
- AD handles control flow (loops, branches) by differentiating the executed path
- VJP is the primitive: given $\bar{y} = \partial L/\partial y$, compute $\partial L/\partial x = \bar{y} \cdot \partial y/\partial x$
Pitfalls
- Confusing AD with numerical or symbolic differentiation: AD computes exact derivatives (to machine precision) through the execution trace, not approximations via finite differences or algebraic manipulation of expressions. Saying "AD is like symbolic diff but automated" misses that AD handles control flow (loops, branches, recursion) that symbolic differentiation fundamentally cannot.
- Forgetting the memory cost of reverse-mode AD: Reverse mode must store every intermediate value from the forward pass on the Wengert tape. For large models, this is the dominant memory consumer — not the weights. Students often prototype on tiny networks and are surprised when training a real model exhausts GPU memory. Always estimate activation memory: batch_size × hidden_dim × num_layers × bytes_per_float.
- Using reverse mode when forward mode is cheaper: If your function maps few inputs to many outputs ($f: \mathbb{R}^n \to \mathbb{R}^m$ with $n \ll m$), forward mode requires $O(n)$ passes vs reverse mode's $O(m)$ passes. This comes up in computing full Jacobians, sensitivity analysis, and certain ODE adjoint methods. Don't default to reverse mode without checking dimensions.
- Trusting AD derivatives at non-differentiable points: AD differentiates the executed code path, not the mathematical function. At a ReLU's $z=0$ or an
ifboundary, the AD derivative is the derivative of whichever branch was taken — which may not be a valid subgradient. For functions withabs, $max$,relu, or conditional logic, verify that AD returns what you expect at the boundaries. - Forgetting to
.detach()tensors that shouldn't receive gradients: In PyTorch, any tensor with $requires_grad=True$ that participates in a computation will receive gradients during.backward(). For metrics, logging, target networks, or frozen layers, forgetting.detach()silently builds unnecessary computation graph nodes, wastes memory, and can cause incorrect gradients if those values are used downstream.
Next Steps
Next up: 15-04-backpropagation-implementation.md — implementing backprop from scratch for feedforward networks, understanding the backward pass in detail.