📐 Concept diagram

15-04 — Backpropagation Implementation

Phase: Numerical Methods for ML | Subject: 15-04 Prerequisites: 15-03-automatic-differentiation.md, 16-01-perceptron-model.md Next subject: 15-05-numerical-linear-algebra-ml.md

Learning Objectives

By the end of this subject, you will be able to:

Derive the backpropagation equations for a feedforward neural network
Implement the forward and backward passes for common layers (Linear, ReLU, Sigmoid, Softmax+CrossEntropy)
Explain why backpropagation requires storing activations from the forward pass
Compute parameter gradients for a two-layer network by hand
Debug gradient computation using finite-difference checks

Core Content

The Backpropagation Algorithm

Given a neural network that computes a scalar loss $L$:

Forward pass: Compute all layer outputs $\mathbf{a}^{(l)}$ and activations $\mathbf{h}^{(l)}$, storing them
Backward pass: Starting from $\partial L/\partial L = 1$, propagate gradients backward through each layer using the chain rule
Parameter gradients: For each layer with parameters, compute $\partial L/\partial \mathbf{W}^{(l)}$ and $\partial L/\partial \mathbf{b}^{(l)}$

Notation

For layer $l$: - $\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$ — pre-activation - $\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$ — post-activation - $\mathbf{a}^{(0)} = \mathbf{x}$ — input - $\mathbf{a}^{(L)} = \hat{\mathbf{y}}$ — output

Gradient notation: $\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}$ (gradient w.r.t. pre-activation)

The Backpropagation Equations

Output layer (for loss $L$, activation $\mathbf{a}^{(L)}$): $$\boldsymbol{\delta}^{(L)} = \frac{\partial L}{\partial \mathbf{a}^{(L)}} \odot \sigma'(\mathbf{z}^{(L)})$$

Hidden layers ($l = L-1, \ldots, 1$): $$\boldsymbol{\delta}^{(l)} = \left((\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)}\right) \odot \sigma'(\mathbf{z}^{(l)})$$

Parameter gradients: $$\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T$$ $$\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$$

⚠️ CRITICAL — The pattern: Each hidden layer receives $\boldsymbol{\delta}^{(l+1)}$ from above, multiplies by $(\mathbf{W}^{(l+1)})^T$ to propagate the gradient backward through the weights, then element-wise multiplies by the activation derivative $\sigma'(\mathbf{z}^{(l)})$ to account for the nonlinearity.

Common Activation Derivatives

Activation	$\sigma(z)$	$\sigma'(z)$
ReLU	$\max(0, z)$	$1$ if $z > 0$, $0$ otherwise
Sigmoid	$\frac{1}{1+e^{-z}}$	$\sigma(z)(1-\sigma(z))$
Tanh	$\frac{e^z - e^{-z}}{e^z + e^{-z}}$	$1 - \tanh^2(z)$
Leaky ReLU	$\max(\alpha z, z)$	$1$ if $z > 0$, $\alpha$ otherwise

Softmax + Cross-Entropy: The Beautiful Simplification

For classification with $K$ classes, the output is $\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$ (softmax), and the loss is $L = -\sum_i y_i \log \hat{y}_i$ (cross-entropy with one-hot targets $y_i$).

The combined gradient simplifies dramatically:

$$\frac{\partial L}{\partial \mathbf{z}} = \hat{\mathbf{y}} - \mathbf{y}$$

That's it — no messy product of softmax Jacobian with cross-entropy gradient. This simplification is why softmax+cross-entropy is the universal choice for classification.

Derivation: $\frac{\partial L}{\partial z_k} = \sum_i \frac{\partial L}{\partial \hat{y}i} \frac{\partial \hat{y}_i}{\partial z_k}$ where $\frac{\partial L}{\partial \hat{y}_i} = -\frac{y_i}{\hat{y}_i}$ and $\frac{\partial \hat{y}_i}{\partial z_k} = \hat{y}_i(\delta{ik} - \hat{y}k)$. $= \sum_i -\frac{y_i}{\hat{y}_i} \hat{y}_i(\delta{ik} - \hat{y}k) = -\sum_i y_i(\delta{ik} - \hat{y}_k) = -y_k + \hat{y}_k\sum_i y_i = \hat{y}_k - y_k$ (since $\sum_i y_i = 1$).

Gradient Checking

To verify your backward pass implementation, compare with numerical gradients:

$$\frac{\partial L}{\partial w} \approx \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon}$$

with $\epsilon \approx 10^{-5}$ for float32, $10^{-7}$ for float64. The relative error should be $< 10^{-7}$ for float64.

Key Terms

Hidden layers
Output layer

Worked Examples

Example 1: Backprop Through a Two-Layer Network

Network: input $\mathbf{x} \in \mathbb{R}^3$, hidden (ReLU, 2 units), output (sigmoid, 1 unit), binary cross-entropy loss.

$\mathbf{W}^{(1)} = \begin{pmatrix}1&2&0\-1&1&3\end{pmatrix}$, $\mathbf{b}^{(1)} = (0, 0)^T$, $\mathbf{W}^{(2)} = (1, -1)$, $\mathbf{b}^{(2)} = 0$. Input: $\mathbf{x} = (1, -1, 2)^T$, Target: $y = 0$.

Forward pass: $\mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)} = \begin{pmatrix}1\cdot1+2(-1)+0\cdot2 \ -1\cdot1+1(-1)+3\cdot2\end{pmatrix} = \begin{pmatrix}-1\4\end{pmatrix}$ $\mathbf{a}^{(1)} = \operatorname{ReLU}(\mathbf{z}^{(1)}) = (0, 4)^T$

$z^{(2)} = \mathbf{W}^{(2)}\mathbf{a}^{(1)} + b^{(2)} = 1\cdot0 + (-1)\cdot4 = -4$ $\hat{y} = \sigma(-4) = \frac{1}{1+e^4} \approx 0.0180$

$L = -[0 \cdot \ln(0.0180) + 1 \cdot \ln(1-0.0180)] = -\ln(0.9820) \approx 0.01815$

Backward pass: Using the general BCE formula $L = -[y\ln\hat{y} + (1-y)\ln(1-\hat{y})]$: $\partial L/\partial \hat{y} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$. With $y=0$: $\partial L/\partial \hat{y} = \frac{1}{1-\hat{y}} \approx 1.0183$.

$\delta^{(2)} = \frac{\partial L}{\partial \hat{y}} \cdot \sigma'(z^{(2)}) = 1.0183 \cdot \hat{y}(1-\hat{y}) = 1.0183 \cdot 0.0180 \cdot 0.9820 \approx 0.0180$

$\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \cdot (\mathbf{a}^{(1)})^T = 0.0180 \cdot (0, 4) = (0, 0.0720)$ $\frac{\partial L}{\partial b^{(2)}} = 0.0180$

$\boldsymbol{\delta}^{(1)} = ((\mathbf{W}^{(2)})^T \delta^{(2)}) \odot \operatorname{ReLU}'(\mathbf{z}^{(1)}) = \begin{pmatrix}1\-1\end{pmatrix} \cdot 0.0180 \odot \begin{pmatrix}0\1\end{pmatrix} = \begin{pmatrix}0.0180\-0.0180\end{pmatrix} \odot \begin{pmatrix}0\1\end{pmatrix} = \begin{pmatrix}0\-0.0180\end{pmatrix}$

$\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \boldsymbol{\delta}^{(1)} \mathbf{x}^T = \begin{pmatrix}0\-0.0180\end{pmatrix} (1, -1, 2) = \begin{pmatrix}0&0&0\-0.0180&0.0180&-0.0360\end{pmatrix}$

The zero row for $\mathbf{W}^{(1)}$ makes sense — the first hidden unit is dead (ReLU output 0), so no gradient flows to its weights.

Click for answer

$\partial L/\partial \mathbf{W}^{(2)} = (0, 0.0720)$, $\partial L/\partial b^{(2)} = 0.0180$. $\partial L/\partial \mathbf{W}^{(1)}$ has zero gradient for the first hidden unit (dead ReLU).

Example 2: Gradient Check

For the network in Example 1, verify $\partial L/\partial W^{(2)}_2$ (the second weight in the output layer) using finite differences with $\epsilon = 10^{-5}$.

Solution: The central difference formula is $f'(w) \approx (f(w+\epsilon) - f(w-\epsilon))/(2\epsilon)$. For a simple function $f(w) = w^2$ at $w=3$: Analytical: $f'(3) = 6$. Numerical: $(f(3.0001) - f(2.9999))/0.0002 = (9.00060001 - 8.99940001)/0.0002 = 0.0012/0.0002 = 6.0$. ✓

Click for answer

Finite difference: $f'(w) \approx (f(w+\epsilon) - f(w-\epsilon))/(2\epsilon)$. Use $\epsilon \approx 10^{-5}$ for float32, $10^{-7}$ for float64. Relative error $< 10^{-7}$ indicates correct implementation.

Example 3: Memory of Backprop

A 50-layer ResNet with batch size 32 and hidden size 256 requires storing how much activation memory?

Solution: For a simple feedforward network: per layer, $32 \times 256 = 8192$ floats. At float32: 32 KB per layer. 50 layers: ~1.6 MB — surprisingly small. However, the real memory cost comes from convolutional layers: for a $32 \times 256 \times 56 \times 56$ feature map (batch=32, channels=256, spatial=56×56): $32 \times 256 \times 56 \times 56 \approx 25.7$ million floats ≈ 103 MB per layer. With 50 such layers: ~5 GB — this is why activation checkpointing exists.

Click for answer

For a moderate CNN: ~100 MB per layer × 50 layers ≈ 5 GB. Activation checkpointing reduces this to ~100 MB total by recomputing forward passes during backward.

Quiz

Q1: What does the concept of Output layer primarily refer to in this subject?

A) A historical anecdote about Output layer B) The definition and application of Output layer C) A visual representation of Output layer D) A computational error related to Output layer

Correct: B)

If you chose A: This is incorrect. Output layer is defined as: the definition and application of output layer. The other options describe different aspects that are not the primary focus.
If you chose B: Output layer is defined as: the definition and application of output layer. The other options describe different aspects that are not the primary focus. Correct!
If you chose C: This is incorrect. Output layer is defined as: the definition and application of output layer. The other options describe different aspects that are not the primary focus.
If you chose D: This is incorrect. Output layer is defined as: the definition and application of output layer. The other options describe different aspects that are not the primary focus.

Q2: Which of the following is the key formula discussed in this subject?

A) \mathbf{a}^{(l)} B) The inverse operation of the formula in question C) An unrelated formula from a different topic D) A simplified version of \mathbf{a}^{(l)}...

Correct: A)

If you chose A: The formula \mathbf{a}^{(l)} is central to this subject. The other options are either simplified versions or unrelated. Correct!
If you chose B: This is incorrect. The formula \mathbf{a}^{(l)} is central to this subject. The other options are either simplified versions or unrelated.
If you chose C: This is incorrect. The formula \mathbf{a}^{(l)} is central to this subject. The other options are either simplified versions or unrelated.
If you chose D: This is incorrect. The formula \mathbf{a}^{(l)} is central to this subject. The other options are either simplified versions or unrelated.

Q3: What is the primary purpose of Hidden layers?

A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is used to hidden layers in mathematical analysis

Correct: D)

If you chose A: This is incorrect. Hidden layers serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. Hidden layers serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. Hidden layers serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: Hidden layers serves the purpose described in the correct answer. The other options misrepresent its role. Correct!

Q4: Which statement about The Backpropagation Algorithm is TRUE?

A) The Backpropagation Algorithm is a fundamental concept covered in this subject B) The Backpropagation Algorithm is mentioned only as a historical footnote C) The Backpropagation Algorithm is not related to this subject D) The Backpropagation Algorithm is an advanced topic beyond this subject's scope

Correct: A)

If you chose A: The Backpropagation Algorithm is a fundamental concept covered in this subject. This subject covers The Backpropagation Algorithm as part of its core content. Correct!
If you chose B: This is incorrect. The Backpropagation Algorithm is a fundamental concept covered in this subject. This subject covers The Backpropagation Algorithm as part of its core content.
If you chose C: This is incorrect. The Backpropagation Algorithm is a fundamental concept covered in this subject. This subject covers The Backpropagation Algorithm as part of its core content.
If you chose D: This is incorrect. The Backpropagation Algorithm is a fundamental concept covered in this subject. This subject covers The Backpropagation Algorithm as part of its core content.

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) An unrelated numerical value C) A different result from a common mistake D) Gradient Check

Correct: D)

If you chose A: This is incorrect. The worked examples show that the result is Gradient Check. The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is Gradient Check. The other options represent common errors.
If you chose C: This is incorrect. The worked examples show that the result is Gradient Check. The other options represent common errors.
If you chose D: The worked examples show that the result is Gradient Check. The other options represent common errors. Correct!

Q6: How are The Backpropagation Algorithm and Notation related?

A) The Backpropagation Algorithm and Notation are completely unrelated topics B) The Backpropagation Algorithm and Notation are closely related concepts C) The Backpropagation Algorithm is a special case of Notation D) The Backpropagation Algorithm is the inverse of Notation

Correct: B)

If you chose A: This is incorrect. Both The Backpropagation Algorithm and Notation are covered in this subject as interconnected topics.
If you chose B: Both The Backpropagation Algorithm and Notation are covered in this subject as interconnected topics. Correct!
If you chose C: This is incorrect. Both The Backpropagation Algorithm and Notation are covered in this subject as interconnected topics.
If you chose D: This is incorrect. Both The Backpropagation Algorithm and Notation are covered in this subject as interconnected topics.

Q7: What is a common pitfall when working with The Backpropagation Equations?

A) The main error with The Backpropagation Equations is using it when it is not needed B) The Backpropagation Equations is always computed the same way in all contexts C) A common mistake is confusing The Backpropagation Equations with a similar concept D) The Backpropagation Equations has no common misconceptions

Correct: C)

If you chose A: This is incorrect. Students often confuse The Backpropagation Equations with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse The Backpropagation Equations with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: Students often confuse The Backpropagation Equations with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose D: This is incorrect. Students often confuse The Backpropagation Equations with similar-sounding or related concepts. Pay attention to the precise definitions.

Q8: When should you apply Common Activation Derivatives?

A) Use Common Activation Derivatives only in pure mathematics contexts B) Apply Common Activation Derivatives to solve problems in this subject's domain C) Common Activation Derivatives is not practically useful D) Avoid Common Activation Derivatives unless explicitly instructed

Correct: B)

If you chose A: This is incorrect. Common Activation Derivatives is a practical tool used throughout this subject to solve relevant problems.
If you chose B: Common Activation Derivatives is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose C: This is incorrect. Common Activation Derivatives is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. Common Activation Derivatives is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Derive $\partial L/\partial \mathbf{x}$ for a linear layer $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$ given $\partial L/\partial \mathbf{y}$.

Click for answer
$y_i = \sum_j W_{ij} x_j + b_i$. $\frac{\partial L}{\partial x_j} = \sum_i \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial x_j} = \sum_i \frac{\partial L}{\partial y_i} W_{ij}$. In matrix form: $\frac{\partial L}{\partial \mathbf{x}} = \mathbf{W}^T \frac{\partial L}{\partial \mathbf{y}}$. This is exactly the backward rule for a linear layer.
For a batch of $B$ inputs, the forward pass computes $\mathbf{Z} = \mathbf{X}\mathbf{W}^T + \mathbf{b}$ (PyTorch convention: $\mathbf{X} \in \mathbb{R}^{B \times d}$, $\mathbf{W} \in \mathbb{R}^{h \times d}$). Derive $\partial L/\partial \mathbf{W}$ and $\partial L/\partial \mathbf{X}$.

Click for answer
$\frac{\partial L}{\partial \mathbf{W}} = \left(\frac{\partial L}{\partial \mathbf{Z}}\right)^T \mathbf{X}$ — the sum of outer products over the batch. $\frac{\partial L}{\partial \mathbf{X}} = \frac{\partial L}{\partial \mathbf{Z}} \mathbf{W}$. $\frac{\partial L}{\partial \mathbf{b}} = \sum_{b=1}^B \left(\frac{\partial L}{\partial \mathbf{Z}}\right)_b$ — sum over batch dimension (since bias is broadcast).
Implement the backward pass for a ReLU layer: given $\partial L/\partial \mathbf{a}$ (gradient of loss w.r.t. post-activation), compute $\partial L/\partial \mathbf{z}$ (gradient w.r.t. pre-activation).

Click for answer
$\frac{\partial L}{\partial z_i} = \frac{\partial L}{\partial a_i} \cdot \mathbb{1}[z_i > 0]$. In code: `dz = da * (z > 0)`. The gradient is zero for negative pre-activations ("dead" neurons).
Derive the backprop rule for a residual connection: $\mathbf{y} = \mathbf{x} + F(\mathbf{x})$ where $F$ is a sub-network. Given $\partial L/\partial \mathbf{y}$, what is $\partial L/\partial \mathbf{x}$?

Click for answer
$\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \cdot (I + \frac{\partial F}{\partial \mathbf{x}})$. The gradient flows through TWO paths: directly ($\partial L/\partial \mathbf{y}$ via the identity) and through $F$ ($\partial L/\partial \mathbf{y}$ backpropagated through $F$). The identity path preserves gradient magnitude, preventing vanishing gradients — this is why ResNets can be 1000+ layers deep.
Show that BatchNorm's backward pass involves computing gradients w.r.t. $\gamma$, $\beta$, and the input, accounting for the mean and variance normalization.

Click for answer
Forward: $\mu = \frac{1}{B}\sum x_i$, $\sigma^2 = \frac{1}{B}\sum(x_i-\mu)^2$, $\hat{x}_i = (x_i-\mu)/\sqrt{\sigma^2+\epsilon}$, $y_i = \gamma\hat{x}_i + \beta$. Backward (given $\partial L/\partial y_i$): $\frac{\partial L}{\partial \gamma} = \sum_i \frac{\partial L}{\partial y_i} \hat{x}_i$ $\frac{\partial L}{\partial \beta} = \sum_i \frac{\partial L}{\partial y_i}$ $\frac{\partial L}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i} \gamma$ Then propagate through normalization (the messy part involving $\mu$ and $\sigma^2$). The full derivation is lengthy but follows the chain rule mechanically.

Summary

Key takeaways:

Backpropagation = reverse-mode AD specialized for neural network layer structure
$\boldsymbol{\delta}^{(l)} = ((\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)}) \odot \sigma'(\mathbf{z}^{(l)})$ — the core recurrence
$\partial L/\partial \mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^T$ — parameter gradients from outer product
Softmax + cross-entropy simplifies to $\hat{\mathbf{y}} - \mathbf{y}$ — the most important gradient in classification
ReLU derivative is 1 for positive inputs, 0 otherwise — dead neurons get zero gradient
Gradient checking with finite differences validates your backward implementation
Residual connections add identity gradient path: $\partial L/\partial \mathbf{x} = \partial L/\partial \mathbf{y} + \text{[through F]}$

Pitfalls

Forgetting to zero gradients before each backward pass: In frameworks like PyTorch, .backward() accumulates gradients into .grad by default. If you don't call optimizer.zero_grad() (or manually zero gradients) between batches, you effectively use a larger batch size than intended — and gradients from previous iterations leak into the current step. This is the #1 bug in custom training loops.
Confusing the order of operations in the backward recurrence: The correct order is: multiply incoming gradient by $(\mathbf{W}^{(l+1)})^T$, THEN element-wise multiply by $\sigma'(\mathbf{z}^{(l)})$. Reversing this (applying $\sigma'$ first) is a common error that silently produces wrong gradients because $\sigma'$ must be evaluated at the pre-activation that was stored during the forward pass.
Not storing activations from the forward pass: Parameter gradients require $\mathbf{a}^{(l-1)}$: $\partial L/\partial \mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^T$. If you overwrite activations during the forward pass (e.g., by reusing variables), the backward pass will fail or use incorrect values. Every deep learning framework stores activations automatically — but if you implement backprop from scratch, this is a hard requirement.
Implementing activation derivatives incorrectly: Sigmoid's $\sigma(x)(1-\sigma(x))$ requires the output of sigmoid, not the input. ReLU's derivative is (z > 0).float(), which returns $0$ at $z=0$ in most frameworks (acceptable in practice but technically a subgradient). Always gradient-check your custom backward implementations against finite differences.
Miscomputing bias gradients for batched inputs: In batch mode, $\mathbf{Z} = \mathbf{X}\mathbf{W}^T + \mathbf{b}$, the bias gradient is the SUM of incoming gradients over the batch dimension: $\partial L/\partial \mathbf{b} = \sum_{b} (\partial L/\partial \mathbf{Z})_b$. Simply equating it to $\boldsymbol{\delta}^{(l)}$ without summing over the batch dimension produces gradients that are too small by a factor of batch_size.

Next Steps

Next up: 15-05-numerical-linear-algebra-ml.md — efficient matrix operations, SVD for dimensionality reduction, and solving linear systems in ML contexts.

Progress

Phases

15-04 — Backpropagation Implementation

Learning Objectives

Core Content

The Backpropagation Algorithm

Notation

The Backpropagation Equations

Common Activation Derivatives

Softmax + Cross-Entropy: The Beautiful Simplification

Gradient Checking

Key Terms

Worked Examples

Example 1: Backprop Through a Two-Layer Network

Example 2: Gradient Check

Example 3: Memory of Backprop

Quiz

Practice Problems

Summary

Pitfalls

Next Steps