15-04 — Backpropagation Implementation
Phase: Numerical Methods for ML | Subject: 15-04 Prerequisites: 15-03-automatic-differentiation.md, 16-01-perceptron-model.md Next subject: 15-05-numerical-linear-algebra-ml.md
Learning Objectives
By the end of this subject, you will be able to:
- Derive the backpropagation equations for a feedforward neural network
- Implement the forward and backward passes for common layers (Linear, ReLU, Sigmoid, Softmax+CrossEntropy)
- Explain why backpropagation requires storing activations from the forward pass
- Compute parameter gradients for a two-layer network by hand
- Debug gradient computation using finite-difference checks
Core Content
The Backpropagation Algorithm
Given a neural network that computes a scalar loss $L$:
- Forward pass: Compute all layer outputs $\mathbf{a}^{(l)}$ and activations $\mathbf{h}^{(l)}$, storing them
- Backward pass: Starting from $\partial L/\partial L = 1$, propagate gradients backward through each layer using the chain rule
- Parameter gradients: For each layer with parameters, compute $\partial L/\partial \mathbf{W}^{(l)}$ and $\partial L/\partial \mathbf{b}^{(l)}$
Notation
For layer $l$: - $\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$ — pre-activation - $\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$ — post-activation - $\mathbf{a}^{(0)} = \mathbf{x}$ — input - $\mathbf{a}^{(L)} = \hat{\mathbf{y}}$ — output
Gradient notation: $\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}$ (gradient w.r.t. pre-activation)
The Backpropagation Equations
Output layer (for loss $L$, activation $\mathbf{a}^{(L)}$): $$\boldsymbol{\delta}^{(L)} = \frac{\partial L}{\partial \mathbf{a}^{(L)}} \odot \sigma'(\mathbf{z}^{(L)})$$
Hidden layers ($l = L-1, \ldots, 1$): $$\boldsymbol{\delta}^{(l)} = \left((\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)}\right) \odot \sigma'(\mathbf{z}^{(l)})$$
Parameter gradients: $$\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T$$ $$\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$$
⚠️ CRITICAL — The pattern: Each hidden layer receives $\boldsymbol{\delta}^{(l+1)}$ from above, multiplies by $(\mathbf{W}^{(l+1)})^T$ to propagate the gradient backward through the weights, then element-wise multiplies by the activation derivative $\sigma'(\mathbf{z}^{(l)})$ to account for the nonlinearity.
Common Activation Derivatives
| Activation | $\sigma(z)$ | $\sigma'(z)$ |
|---|---|---|
| ReLU | $\max(0, z)$ | $1$ if $z > 0$, $0$ otherwise |
| Sigmoid | $\frac{1}{1+e^{-z}}$ | $\sigma(z)(1-\sigma(z))$ |
| Tanh | $\frac{e^z - e^{-z}}{e^z + e^{-z}}$ | $1 - \tanh^2(z)$ |
| Leaky ReLU | $\max(\alpha z, z)$ | $1$ if $z > 0$, $\alpha$ otherwise |
Softmax + Cross-Entropy: The Beautiful Simplification
For classification with $K$ classes, the output is $\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$ (softmax), and the loss is $L = -\sum_i y_i \log \hat{y}_i$ (cross-entropy with one-hot targets $y_i$).
The combined gradient simplifies dramatically:
$$\frac{\partial L}{\partial \mathbf{z}} = \hat{\mathbf{y}} - \mathbf{y}$$
That's it — no messy product of softmax Jacobian with cross-entropy gradient. This simplification is why softmax+cross-entropy is the universal choice for classification.
Derivation: $\frac{\partial L}{\partial z_k} = \sum_i \frac{\partial L}{\partial \hat{y}i} \frac{\partial \hat{y}_i}{\partial z_k}$ where $\frac{\partial L}{\partial \hat{y}_i} = -\frac{y_i}{\hat{y}_i}$ and $\frac{\partial \hat{y}_i}{\partial z_k} = \hat{y}_i(\delta{ik} - \hat{y}k)$. $= \sum_i -\frac{y_i}{\hat{y}_i} \hat{y}_i(\delta{ik} - \hat{y}k) = -\sum_i y_i(\delta{ik} - \hat{y}_k) = -y_k + \hat{y}_k\sum_i y_i = \hat{y}_k - y_k$ (since $\sum_i y_i = 1$).
Gradient Checking
To verify your backward pass implementation, compare with numerical gradients:
$$\frac{\partial L}{\partial w} \approx \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon}$$
with $\epsilon \approx 10^{-5}$ for float32, $10^{-7}$ for float64. The relative error should be $< 10^{-7}$ for float64.
Key Terms
- Hidden layers
- Output layer
Worked Examples
Example 1: Backprop Through a Two-Layer Network
Network: input $\mathbf{x} \in \mathbb{R}^3$, hidden (ReLU, 2 units), output (sigmoid, 1 unit), binary cross-entropy loss.
$\mathbf{W}^{(1)} = \begin{pmatrix}1&2&0\-1&1&3\end{pmatrix}$, $\mathbf{b}^{(1)} = (0, 0)^T$, $\mathbf{W}^{(2)} = (1, -1)$, $\mathbf{b}^{(2)} = 0$. Input: $\mathbf{x} = (1, -1, 2)^T$, Target: $y = 0$.
Forward pass: $\mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)} = \begin{pmatrix}1\cdot1+2(-1)+0\cdot2 \ -1\cdot1+1(-1)+3\cdot2\end{pmatrix} = \begin{pmatrix}-1\4\end{pmatrix}$ $\mathbf{a}^{(1)} = \operatorname{ReLU}(\mathbf{z}^{(1)}) = (0, 4)^T$
$z^{(2)} = \mathbf{W}^{(2)}\mathbf{a}^{(1)} + b^{(2)} = 1\cdot0 + (-1)\cdot4 = -4$ $\hat{y} = \sigma(-4) = \frac{1}{1+e^4} \approx 0.0180$
$L = -[0 \cdot \ln(0.0180) + 1 \cdot \ln(1-0.0180)] = -\ln(0.9820) \approx 0.01815$
Backward pass: Using the general BCE formula $L = -[y\ln\hat{y} + (1-y)\ln(1-\hat{y})]$: $\partial L/\partial \hat{y} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$. With $y=0$: $\partial L/\partial \hat{y} = \frac{1}{1-\hat{y}} \approx 1.0183$.
$\delta^{(2)} = \frac{\partial L}{\partial \hat{y}} \cdot \sigma'(z^{(2)}) = 1.0183 \cdot \hat{y}(1-\hat{y}) = 1.0183 \cdot 0.0180 \cdot 0.9820 \approx 0.0180$
$\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \cdot (\mathbf{a}^{(1)})^T = 0.0180 \cdot (0, 4) = (0, 0.0720)$ $\frac{\partial L}{\partial b^{(2)}} = 0.0180$
$\boldsymbol{\delta}^{(1)} = ((\mathbf{W}^{(2)})^T \delta^{(2)}) \odot \operatorname{ReLU}'(\mathbf{z}^{(1)}) = \begin{pmatrix}1\-1\end{pmatrix} \cdot 0.0180 \odot \begin{pmatrix}0\1\end{pmatrix} = \begin{pmatrix}0.0180\-0.0180\end{pmatrix} \odot \begin{pmatrix}0\1\end{pmatrix} = \begin{pmatrix}0\-0.0180\end{pmatrix}$
$\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \boldsymbol{\delta}^{(1)} \mathbf{x}^T = \begin{pmatrix}0\-0.0180\end{pmatrix} (1, -1, 2) = \begin{pmatrix}0&0&0\-0.0180&0.0180&-0.0360\end{pmatrix}$
The zero row for $\mathbf{W}^{(1)}$ makes sense — the first hidden unit is dead (ReLU output 0), so no gradient flows to its weights.
Click for answer
$\partial L/\partial \mathbf{W}^{(2)} = (0, 0.0720)$, $\partial L/\partial b^{(2)} = 0.0180$. $\partial L/\partial \mathbf{W}^{(1)}$ has zero gradient for the first hidden unit (dead ReLU).Example 2: Gradient Check
For the network in Example 1, verify $\partial L/\partial W^{(2)}_2$ (the second weight in the output layer) using finite differences with $\epsilon = 10^{-5}$.
Solution: The central difference formula is $f'(w) \approx (f(w+\epsilon) - f(w-\epsilon))/(2\epsilon)$. For a simple function $f(w) = w^2$ at $w=3$: Analytical: $f'(3) = 6$. Numerical: $(f(3.0001) - f(2.9999))/0.0002 = (9.00060001 - 8.99940001)/0.0002 = 0.0012/0.0002 = 6.0$. ✓
Click for answer
Finite difference: $f'(w) \approx (f(w+\epsilon) - f(w-\epsilon))/(2\epsilon)$. Use $\epsilon \approx 10^{-5}$ for float32, $10^{-7}$ for float64. Relative error $< 10^{-7}$ indicates correct implementation.Example 3: Memory of Backprop
A 50-layer ResNet with batch size 32 and hidden size 256 requires storing how much activation memory?
Solution: For a simple feedforward network: per layer, $32 \times 256 = 8192$ floats. At float32: 32 KB per layer. 50 layers: ~1.6 MB — surprisingly small. However, the real memory cost comes from convolutional layers: for a $32 \times 256 \times 56 \times 56$ feature map (batch=32, channels=256, spatial=56×56): $32 \times 256 \times 56 \times 56 \approx 25.7$ million floats ≈ 103 MB per layer. With 50 such layers: ~5 GB — this is why activation checkpointing exists.
Click for answer
For a moderate CNN: ~100 MB per layer × 50 layers ≈ 5 GB. Activation checkpointing reduces this to ~100 MB total by recomputing forward passes during backward.Quiz
Q1: What does the concept of Output layer primarily refer to in this subject?
A) A historical anecdote about Output layer B) The definition and application of Output layer C) A visual representation of Output layer D) A computational error related to Output layer
Correct: B)
- If you chose A: This is incorrect. Output layer is defined as: the definition and application of output layer. The other options describe different aspects that are not the primary focus.
- If you chose B: Output layer is defined as: the definition and application of output layer. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. Output layer is defined as: the definition and application of output layer. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Output layer is defined as: the definition and application of output layer. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) \mathbf{a}^{(l)} B) The inverse operation of the formula in question C) An unrelated formula from a different topic D) A simplified version of \mathbf{a}^{(l)}...
Correct: A)
- If you chose A: The formula \mathbf{a}^{(l)} is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose B: This is incorrect. The formula \mathbf{a}^{(l)} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula \mathbf{a}^{(l)} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula \mathbf{a}^{(l)} is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Hidden layers?
A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is used to hidden layers in mathematical analysis
Correct: D)
- If you chose A: This is incorrect. Hidden layers serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Hidden layers serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Hidden layers serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: Hidden layers serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
Q4: Which statement about The Backpropagation Algorithm is TRUE?
A) The Backpropagation Algorithm is a fundamental concept covered in this subject B) The Backpropagation Algorithm is mentioned only as a historical footnote C) The Backpropagation Algorithm is not related to this subject D) The Backpropagation Algorithm is an advanced topic beyond this subject's scope
Correct: A)
- If you chose A: The Backpropagation Algorithm is a fundamental concept covered in this subject. This subject covers The Backpropagation Algorithm as part of its core content. Correct!
- If you chose B: This is incorrect. The Backpropagation Algorithm is a fundamental concept covered in this subject. This subject covers The Backpropagation Algorithm as part of its core content.
- If you chose C: This is incorrect. The Backpropagation Algorithm is a fundamental concept covered in this subject. This subject covers The Backpropagation Algorithm as part of its core content.
- If you chose D: This is incorrect. The Backpropagation Algorithm is a fundamental concept covered in this subject. This subject covers The Backpropagation Algorithm as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) An unrelated numerical value C) A different result from a common mistake D) Gradient Check
Correct: D)
- If you chose A: This is incorrect. The worked examples show that the result is Gradient Check. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is Gradient Check. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is Gradient Check. The other options represent common errors.
- If you chose D: The worked examples show that the result is Gradient Check. The other options represent common errors. Correct!
Q6: How are The Backpropagation Algorithm and Notation related?
A) The Backpropagation Algorithm and Notation are completely unrelated topics B) The Backpropagation Algorithm and Notation are closely related concepts C) The Backpropagation Algorithm is a special case of Notation D) The Backpropagation Algorithm is the inverse of Notation
Correct: B)
- If you chose A: This is incorrect. Both The Backpropagation Algorithm and Notation are covered in this subject as interconnected topics.
- If you chose B: Both The Backpropagation Algorithm and Notation are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both The Backpropagation Algorithm and Notation are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both The Backpropagation Algorithm and Notation are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with The Backpropagation Equations?
A) The main error with The Backpropagation Equations is using it when it is not needed B) The Backpropagation Equations is always computed the same way in all contexts C) A common mistake is confusing The Backpropagation Equations with a similar concept D) The Backpropagation Equations has no common misconceptions
Correct: C)
- If you chose A: This is incorrect. Students often confuse The Backpropagation Equations with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse The Backpropagation Equations with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse The Backpropagation Equations with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse The Backpropagation Equations with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Common Activation Derivatives?
A) Use Common Activation Derivatives only in pure mathematics contexts B) Apply Common Activation Derivatives to solve problems in this subject's domain C) Common Activation Derivatives is not practically useful D) Avoid Common Activation Derivatives unless explicitly instructed
Correct: B)
- If you chose A: This is incorrect. Common Activation Derivatives is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Common Activation Derivatives is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Common Activation Derivatives is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Common Activation Derivatives is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Derive $\partial L/\partial \mathbf{x}$ for a linear layer $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$ given $\partial L/\partial \mathbf{y}$.
Click for answer
$y_i = \sum_j W_{ij} x_j + b_i$. $\frac{\partial L}{\partial x_j} = \sum_i \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial x_j} = \sum_i \frac{\partial L}{\partial y_i} W_{ij}$. In matrix form: $\frac{\partial L}{\partial \mathbf{x}} = \mathbf{W}^T \frac{\partial L}{\partial \mathbf{y}}$. This is exactly the backward rule for a linear layer. -
For a batch of $B$ inputs, the forward pass computes $\mathbf{Z} = \mathbf{X}\mathbf{W}^T + \mathbf{b}$ (PyTorch convention: $\mathbf{X} \in \mathbb{R}^{B \times d}$, $\mathbf{W} \in \mathbb{R}^{h \times d}$). Derive $\partial L/\partial \mathbf{W}$ and $\partial L/\partial \mathbf{X}$.
Click for answer
$\frac{\partial L}{\partial \mathbf{W}} = \left(\frac{\partial L}{\partial \mathbf{Z}}\right)^T \mathbf{X}$ — the sum of outer products over the batch. $\frac{\partial L}{\partial \mathbf{X}} = \frac{\partial L}{\partial \mathbf{Z}} \mathbf{W}$. $\frac{\partial L}{\partial \mathbf{b}} = \sum_{b=1}^B \left(\frac{\partial L}{\partial \mathbf{Z}}\right)_b$ — sum over batch dimension (since bias is broadcast). -
Implement the backward pass for a ReLU layer: given $\partial L/\partial \mathbf{a}$ (gradient of loss w.r.t. post-activation), compute $\partial L/\partial \mathbf{z}$ (gradient w.r.t. pre-activation).
Click for answer
$\frac{\partial L}{\partial z_i} = \frac{\partial L}{\partial a_i} \cdot \mathbb{1}[z_i > 0]$. In code: `dz = da * (z > 0)`. The gradient is zero for negative pre-activations ("dead" neurons). -
Derive the backprop rule for a residual connection: $\mathbf{y} = \mathbf{x} + F(\mathbf{x})$ where $F$ is a sub-network. Given $\partial L/\partial \mathbf{y}$, what is $\partial L/\partial \mathbf{x}$?
Click for answer
$\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \cdot (I + \frac{\partial F}{\partial \mathbf{x}})$. The gradient flows through TWO paths: directly ($\partial L/\partial \mathbf{y}$ via the identity) and through $F$ ($\partial L/\partial \mathbf{y}$ backpropagated through $F$). The identity path preserves gradient magnitude, preventing vanishing gradients — this is why ResNets can be 1000+ layers deep. -
Show that BatchNorm's backward pass involves computing gradients w.r.t. $\gamma$, $\beta$, and the input, accounting for the mean and variance normalization.
Click for answer
Forward: $\mu = \frac{1}{B}\sum x_i$, $\sigma^2 = \frac{1}{B}\sum(x_i-\mu)^2$, $\hat{x}_i = (x_i-\mu)/\sqrt{\sigma^2+\epsilon}$, $y_i = \gamma\hat{x}_i + \beta$. Backward (given $\partial L/\partial y_i$): $\frac{\partial L}{\partial \gamma} = \sum_i \frac{\partial L}{\partial y_i} \hat{x}_i$ $\frac{\partial L}{\partial \beta} = \sum_i \frac{\partial L}{\partial y_i}$ $\frac{\partial L}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i} \gamma$ Then propagate through normalization (the messy part involving $\mu$ and $\sigma^2$). The full derivation is lengthy but follows the chain rule mechanically.
Summary
Key takeaways:
- Backpropagation = reverse-mode AD specialized for neural network layer structure
- $\boldsymbol{\delta}^{(l)} = ((\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)}) \odot \sigma'(\mathbf{z}^{(l)})$ — the core recurrence
- $\partial L/\partial \mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^T$ — parameter gradients from outer product
- Softmax + cross-entropy simplifies to $\hat{\mathbf{y}} - \mathbf{y}$ — the most important gradient in classification
- ReLU derivative is 1 for positive inputs, 0 otherwise — dead neurons get zero gradient
- Gradient checking with finite differences validates your backward implementation
- Residual connections add identity gradient path: $\partial L/\partial \mathbf{x} = \partial L/\partial \mathbf{y} + \text{[through F]}$
Pitfalls
- Forgetting to zero gradients before each backward pass: In frameworks like PyTorch,
.backward()accumulates gradients into.gradby default. If you don't calloptimizer.zero_grad()(or manually zero gradients) between batches, you effectively use a larger batch size than intended — and gradients from previous iterations leak into the current step. This is the #1 bug in custom training loops. - Confusing the order of operations in the backward recurrence: The correct order is: multiply incoming gradient by $(\mathbf{W}^{(l+1)})^T$, THEN element-wise multiply by $\sigma'(\mathbf{z}^{(l)})$. Reversing this (applying $\sigma'$ first) is a common error that silently produces wrong gradients because $\sigma'$ must be evaluated at the pre-activation that was stored during the forward pass.
- Not storing activations from the forward pass: Parameter gradients require $\mathbf{a}^{(l-1)}$: $\partial L/\partial \mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^T$. If you overwrite activations during the forward pass (e.g., by reusing variables), the backward pass will fail or use incorrect values. Every deep learning framework stores activations automatically — but if you implement backprop from scratch, this is a hard requirement.
- Implementing activation derivatives incorrectly: Sigmoid's $\sigma(x)(1-\sigma(x))$ requires the output of sigmoid, not the input. ReLU's derivative is
(z > 0).float(), which returns $0$ at $z=0$ in most frameworks (acceptable in practice but technically a subgradient). Always gradient-check your custom backward implementations against finite differences. - Miscomputing bias gradients for batched inputs: In batch mode, $\mathbf{Z} = \mathbf{X}\mathbf{W}^T + \mathbf{b}$, the bias gradient is the SUM of incoming gradients over the batch dimension: $\partial L/\partial \mathbf{b} = \sum_{b} (\partial L/\partial \mathbf{Z})_b$. Simply equating it to $\boldsymbol{\delta}^{(l)}$ without summing over the batch dimension produces gradients that are too small by a factor of batch_size.
Next Steps
Next up: 15-05-numerical-linear-algebra-ml.md — efficient matrix operations, SVD for dimensionality reduction, and solving linear systems in ML contexts.