Math graphic
📐 Concept diagram

15-04 — Backpropagation Implementation

Phase: Numerical Methods for ML | Subject: 15-04 Prerequisites: 15-03-automatic-differentiation.md, 16-01-perceptron-model.md Next subject: 15-05-numerical-linear-algebra-ml.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the backpropagation equations for a feedforward neural network
  2. Implement the forward and backward passes for common layers (Linear, ReLU, Sigmoid, Softmax+CrossEntropy)
  3. Explain why backpropagation requires storing activations from the forward pass
  4. Compute parameter gradients for a two-layer network by hand
  5. Debug gradient computation using finite-difference checks

Core Content

The Backpropagation Algorithm

Given a neural network that computes a scalar loss $L$:

  1. Forward pass: Compute all layer outputs $\mathbf{a}^{(l)}$ and activations $\mathbf{h}^{(l)}$, storing them
  2. Backward pass: Starting from $\partial L/\partial L = 1$, propagate gradients backward through each layer using the chain rule
  3. Parameter gradients: For each layer with parameters, compute $\partial L/\partial \mathbf{W}^{(l)}$ and $\partial L/\partial \mathbf{b}^{(l)}$

Notation

For layer $l$: - $\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$ — pre-activation - $\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$ — post-activation - $\mathbf{a}^{(0)} = \mathbf{x}$ — input - $\mathbf{a}^{(L)} = \hat{\mathbf{y}}$ — output

Gradient notation: $\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}$ (gradient w.r.t. pre-activation)

The Backpropagation Equations

Output layer (for loss $L$, activation $\mathbf{a}^{(L)}$): $$\boldsymbol{\delta}^{(L)} = \frac{\partial L}{\partial \mathbf{a}^{(L)}} \odot \sigma'(\mathbf{z}^{(L)})$$

Hidden layers ($l = L-1, \ldots, 1$): $$\boldsymbol{\delta}^{(l)} = \left((\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)}\right) \odot \sigma'(\mathbf{z}^{(l)})$$

Parameter gradients: $$\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T$$ $$\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$$

⚠️ CRITICAL — The pattern: Each hidden layer receives $\boldsymbol{\delta}^{(l+1)}$ from above, multiplies by $(\mathbf{W}^{(l+1)})^T$ to propagate the gradient backward through the weights, then element-wise multiplies by the activation derivative $\sigma'(\mathbf{z}^{(l)})$ to account for the nonlinearity.

Common Activation Derivatives

Activation $\sigma(z)$ $\sigma'(z)$
ReLU $\max(0, z)$ $1$ if $z > 0$, $0$ otherwise
Sigmoid $\frac{1}{1+e^{-z}}$ $\sigma(z)(1-\sigma(z))$
Tanh $\frac{e^z - e^{-z}}{e^z + e^{-z}}$ $1 - \tanh^2(z)$
Leaky ReLU $\max(\alpha z, z)$ $1$ if $z > 0$, $\alpha$ otherwise

Softmax + Cross-Entropy: The Beautiful Simplification

For classification with $K$ classes, the output is $\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$ (softmax), and the loss is $L = -\sum_i y_i \log \hat{y}_i$ (cross-entropy with one-hot targets $y_i$).

The combined gradient simplifies dramatically:

$$\frac{\partial L}{\partial \mathbf{z}} = \hat{\mathbf{y}} - \mathbf{y}$$

That's it — no messy product of softmax Jacobian with cross-entropy gradient. This simplification is why softmax+cross-entropy is the universal choice for classification.

Derivation: $\frac{\partial L}{\partial z_k} = \sum_i \frac{\partial L}{\partial \hat{y}i} \frac{\partial \hat{y}_i}{\partial z_k}$ where $\frac{\partial L}{\partial \hat{y}_i} = -\frac{y_i}{\hat{y}_i}$ and $\frac{\partial \hat{y}_i}{\partial z_k} = \hat{y}_i(\delta{ik} - \hat{y}k)$. $= \sum_i -\frac{y_i}{\hat{y}_i} \hat{y}_i(\delta{ik} - \hat{y}k) = -\sum_i y_i(\delta{ik} - \hat{y}_k) = -y_k + \hat{y}_k\sum_i y_i = \hat{y}_k - y_k$ (since $\sum_i y_i = 1$).

Gradient Checking

To verify your backward pass implementation, compare with numerical gradients:

$$\frac{\partial L}{\partial w} \approx \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon}$$

with $\epsilon \approx 10^{-5}$ for float32, $10^{-7}$ for float64. The relative error should be $< 10^{-7}$ for float64.



Key Terms

Worked Examples

Example 1: Backprop Through a Two-Layer Network

Network: input $\mathbf{x} \in \mathbb{R}^3$, hidden (ReLU, 2 units), output (sigmoid, 1 unit), binary cross-entropy loss.

$\mathbf{W}^{(1)} = \begin{pmatrix}1&2&0\-1&1&3\end{pmatrix}$, $\mathbf{b}^{(1)} = (0, 0)^T$, $\mathbf{W}^{(2)} = (1, -1)$, $\mathbf{b}^{(2)} = 0$. Input: $\mathbf{x} = (1, -1, 2)^T$, Target: $y = 0$.

Forward pass: $\mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)} = \begin{pmatrix}1\cdot1+2(-1)+0\cdot2 \ -1\cdot1+1(-1)+3\cdot2\end{pmatrix} = \begin{pmatrix}-1\4\end{pmatrix}$ $\mathbf{a}^{(1)} = \operatorname{ReLU}(\mathbf{z}^{(1)}) = (0, 4)^T$

$z^{(2)} = \mathbf{W}^{(2)}\mathbf{a}^{(1)} + b^{(2)} = 1\cdot0 + (-1)\cdot4 = -4$ $\hat{y} = \sigma(-4) = \frac{1}{1+e^4} \approx 0.0180$

$L = -[0 \cdot \ln(0.0180) + 1 \cdot \ln(1-0.0180)] = -\ln(0.9820) \approx 0.01815$

Backward pass: Using the general BCE formula $L = -[y\ln\hat{y} + (1-y)\ln(1-\hat{y})]$: $\partial L/\partial \hat{y} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$. With $y=0$: $\partial L/\partial \hat{y} = \frac{1}{1-\hat{y}} \approx 1.0183$.

$\delta^{(2)} = \frac{\partial L}{\partial \hat{y}} \cdot \sigma'(z^{(2)}) = 1.0183 \cdot \hat{y}(1-\hat{y}) = 1.0183 \cdot 0.0180 \cdot 0.9820 \approx 0.0180$

$\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \cdot (\mathbf{a}^{(1)})^T = 0.0180 \cdot (0, 4) = (0, 0.0720)$ $\frac{\partial L}{\partial b^{(2)}} = 0.0180$

$\boldsymbol{\delta}^{(1)} = ((\mathbf{W}^{(2)})^T \delta^{(2)}) \odot \operatorname{ReLU}'(\mathbf{z}^{(1)}) = \begin{pmatrix}1\-1\end{pmatrix} \cdot 0.0180 \odot \begin{pmatrix}0\1\end{pmatrix} = \begin{pmatrix}0.0180\-0.0180\end{pmatrix} \odot \begin{pmatrix}0\1\end{pmatrix} = \begin{pmatrix}0\-0.0180\end{pmatrix}$

$\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \boldsymbol{\delta}^{(1)} \mathbf{x}^T = \begin{pmatrix}0\-0.0180\end{pmatrix} (1, -1, 2) = \begin{pmatrix}0&0&0\-0.0180&0.0180&-0.0360\end{pmatrix}$

The zero row for $\mathbf{W}^{(1)}$ makes sense — the first hidden unit is dead (ReLU output 0), so no gradient flows to its weights.

Click for answer $\partial L/\partial \mathbf{W}^{(2)} = (0, 0.0720)$, $\partial L/\partial b^{(2)} = 0.0180$. $\partial L/\partial \mathbf{W}^{(1)}$ has zero gradient for the first hidden unit (dead ReLU).

Example 2: Gradient Check

For the network in Example 1, verify $\partial L/\partial W^{(2)}_2$ (the second weight in the output layer) using finite differences with $\epsilon = 10^{-5}$.

Solution: The central difference formula is $f'(w) \approx (f(w+\epsilon) - f(w-\epsilon))/(2\epsilon)$. For a simple function $f(w) = w^2$ at $w=3$: Analytical: $f'(3) = 6$. Numerical: $(f(3.0001) - f(2.9999))/0.0002 = (9.00060001 - 8.99940001)/0.0002 = 0.0012/0.0002 = 6.0$. ✓

Click for answer Finite difference: $f'(w) \approx (f(w+\epsilon) - f(w-\epsilon))/(2\epsilon)$. Use $\epsilon \approx 10^{-5}$ for float32, $10^{-7}$ for float64. Relative error $< 10^{-7}$ indicates correct implementation.

Example 3: Memory of Backprop

A 50-layer ResNet with batch size 32 and hidden size 256 requires storing how much activation memory?

Solution: For a simple feedforward network: per layer, $32 \times 256 = 8192$ floats. At float32: 32 KB per layer. 50 layers: ~1.6 MB — surprisingly small. However, the real memory cost comes from convolutional layers: for a $32 \times 256 \times 56 \times 56$ feature map (batch=32, channels=256, spatial=56×56): $32 \times 256 \times 56 \times 56 \approx 25.7$ million floats ≈ 103 MB per layer. With 50 such layers: ~5 GB — this is why activation checkpointing exists.

Click for answer For a moderate CNN: ~100 MB per layer × 50 layers ≈ 5 GB. Activation checkpointing reduces this to ~100 MB total by recomputing forward passes during backward.


Quiz

Q1: What does the concept of Output layer primarily refer to in this subject?

A) A historical anecdote about Output layer B) The definition and application of Output layer C) A visual representation of Output layer D) A computational error related to Output layer

Correct: B)

Q2: Which of the following is the key formula discussed in this subject?

A) \mathbf{a}^{(l)} B) The inverse operation of the formula in question C) An unrelated formula from a different topic D) A simplified version of \mathbf{a}^{(l)}...

Correct: A)

Q3: What is the primary purpose of Hidden layers?

A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is used to hidden layers in mathematical analysis

Correct: D)

Q4: Which statement about The Backpropagation Algorithm is TRUE?

A) The Backpropagation Algorithm is a fundamental concept covered in this subject B) The Backpropagation Algorithm is mentioned only as a historical footnote C) The Backpropagation Algorithm is not related to this subject D) The Backpropagation Algorithm is an advanced topic beyond this subject's scope

Correct: A)

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) An unrelated numerical value C) A different result from a common mistake D) Gradient Check

Correct: D)

Q6: How are The Backpropagation Algorithm and Notation related?

A) The Backpropagation Algorithm and Notation are completely unrelated topics B) The Backpropagation Algorithm and Notation are closely related concepts C) The Backpropagation Algorithm is a special case of Notation D) The Backpropagation Algorithm is the inverse of Notation

Correct: B)

Q7: What is a common pitfall when working with The Backpropagation Equations?

A) The main error with The Backpropagation Equations is using it when it is not needed B) The Backpropagation Equations is always computed the same way in all contexts C) A common mistake is confusing The Backpropagation Equations with a similar concept D) The Backpropagation Equations has no common misconceptions

Correct: C)

Q8: When should you apply Common Activation Derivatives?

A) Use Common Activation Derivatives only in pure mathematics contexts B) Apply Common Activation Derivatives to solve problems in this subject's domain C) Common Activation Derivatives is not practically useful D) Avoid Common Activation Derivatives unless explicitly instructed

Correct: B)

Practice Problems

  1. Derive $\partial L/\partial \mathbf{x}$ for a linear layer $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$ given $\partial L/\partial \mathbf{y}$.

    Click for answer $y_i = \sum_j W_{ij} x_j + b_i$. $\frac{\partial L}{\partial x_j} = \sum_i \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial x_j} = \sum_i \frac{\partial L}{\partial y_i} W_{ij}$. In matrix form: $\frac{\partial L}{\partial \mathbf{x}} = \mathbf{W}^T \frac{\partial L}{\partial \mathbf{y}}$. This is exactly the backward rule for a linear layer.

  2. For a batch of $B$ inputs, the forward pass computes $\mathbf{Z} = \mathbf{X}\mathbf{W}^T + \mathbf{b}$ (PyTorch convention: $\mathbf{X} \in \mathbb{R}^{B \times d}$, $\mathbf{W} \in \mathbb{R}^{h \times d}$). Derive $\partial L/\partial \mathbf{W}$ and $\partial L/\partial \mathbf{X}$.

    Click for answer $\frac{\partial L}{\partial \mathbf{W}} = \left(\frac{\partial L}{\partial \mathbf{Z}}\right)^T \mathbf{X}$ — the sum of outer products over the batch. $\frac{\partial L}{\partial \mathbf{X}} = \frac{\partial L}{\partial \mathbf{Z}} \mathbf{W}$. $\frac{\partial L}{\partial \mathbf{b}} = \sum_{b=1}^B \left(\frac{\partial L}{\partial \mathbf{Z}}\right)_b$ — sum over batch dimension (since bias is broadcast).

  3. Implement the backward pass for a ReLU layer: given $\partial L/\partial \mathbf{a}$ (gradient of loss w.r.t. post-activation), compute $\partial L/\partial \mathbf{z}$ (gradient w.r.t. pre-activation).

    Click for answer $\frac{\partial L}{\partial z_i} = \frac{\partial L}{\partial a_i} \cdot \mathbb{1}[z_i > 0]$. In code: `dz = da * (z > 0)`. The gradient is zero for negative pre-activations ("dead" neurons).

  4. Derive the backprop rule for a residual connection: $\mathbf{y} = \mathbf{x} + F(\mathbf{x})$ where $F$ is a sub-network. Given $\partial L/\partial \mathbf{y}$, what is $\partial L/\partial \mathbf{x}$?

    Click for answer $\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \cdot (I + \frac{\partial F}{\partial \mathbf{x}})$. The gradient flows through TWO paths: directly ($\partial L/\partial \mathbf{y}$ via the identity) and through $F$ ($\partial L/\partial \mathbf{y}$ backpropagated through $F$). The identity path preserves gradient magnitude, preventing vanishing gradients — this is why ResNets can be 1000+ layers deep.

  5. Show that BatchNorm's backward pass involves computing gradients w.r.t. $\gamma$, $\beta$, and the input, accounting for the mean and variance normalization.

    Click for answer Forward: $\mu = \frac{1}{B}\sum x_i$, $\sigma^2 = \frac{1}{B}\sum(x_i-\mu)^2$, $\hat{x}_i = (x_i-\mu)/\sqrt{\sigma^2+\epsilon}$, $y_i = \gamma\hat{x}_i + \beta$. Backward (given $\partial L/\partial y_i$): $\frac{\partial L}{\partial \gamma} = \sum_i \frac{\partial L}{\partial y_i} \hat{x}_i$ $\frac{\partial L}{\partial \beta} = \sum_i \frac{\partial L}{\partial y_i}$ $\frac{\partial L}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i} \gamma$ Then propagate through normalization (the messy part involving $\mu$ and $\sigma^2$). The full derivation is lengthy but follows the chain rule mechanically.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 15-05-numerical-linear-algebra-ml.md — efficient matrix operations, SVD for dimensionality reduction, and solving linear systems in ML contexts.