Math graphic
πŸ“ Concept diagram

16-05 β€” Backpropagation (Mathematics)

Phase: 16 β€” Neural Network Mathematics Subject: 16-05 Prerequisites: 16-01 to 16-04 (perceptron, activations, softmax, losses), Phase 4 (chain rule), Phase 6 (partial derivatives), Phase 9 (matrix calculus basics) Next subject: 16-06 β€” Gradient Flow in Deep Networks


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the backpropagation equations for a feedforward neural network using the chain rule
  2. Compute βˆ‚L/βˆ‚W^(β„“) and βˆ‚L/βˆ‚b^(β„“) for any layer β„“ of a multi-layer perceptron
  3. Explain the "local gradient" concept and how backprop stores intermediate results to avoid recomputation
  4. Trace gradient flow through common layer types: linear, activation, and loss
  5. Understand why backpropagation is reverse-mode automatic differentiation and compute its computational complexity

Core Content

1. The Problem: How Do We Train a Deep Network?

A feedforward neural network with L layers computes:

a^(0) = x (input) z^(β„“) = W^(β„“)a^(β„“βˆ’1) + b^(β„“) (pre-activation, layer β„“) a^(β„“) = f_β„“(z^(β„“)) (activation, layer β„“)

The final output is a^(L) = yΜ‚. We compute a loss L(yΜ‚, y).

To train the network, we need the gradient of the loss with respect to EVERY parameter (W^(β„“) and b^(β„“) for all β„“). With millions of parameters, computing each gradient individually via finite differences would be impossibly slow.

⚠️ THIS IS CRITICAL β€” Backpropagation is the algorithm that makes training deep networks computationally feasible. It computes all gradients in ONE forward pass and ONE backward pass, with the same computational cost (up to a constant factor) as the forward pass itself.

2. The Chain Rule on Computational Graphs

The key insight: the loss L is a composition of many functions. By the multivariate chain rule:

βˆ‚L/βˆ‚W^(β„“) = βˆ‚L/βˆ‚z^(β„“) Β· βˆ‚z^(β„“)/βˆ‚W^(β„“)

The first factor (βˆ‚L/βˆ‚z^(β„“)) depends on ALL layers after β„“. Backprop computes these "errors" working backward from the output.

Local gradients: Each layer only needs to know: 1. Its own local gradient: βˆ‚z^(β„“)/βˆ‚a^(β„“βˆ’1), βˆ‚z^(β„“)/βˆ‚W^(β„“), βˆ‚z^(β„“)/βˆ‚b^(β„“) 2. The "error signal" coming from later layers: βˆ‚L/βˆ‚z^(β„“)

We define Ξ΄^(β„“) = βˆ‚L/βˆ‚z^(β„“) (the error at layer β„“'s pre-activation). These are the quantities we propagate backward.

3. The Backpropagation Equations

For a network with L layers, loss L, and activation function f:

Output layer error (β„“ = L):

Ξ΄^(L) = βˆ‚L/βˆ‚z^(L) = βˆ‚L/βˆ‚a^(L) βŠ™ f'_L(z^(L))

where βŠ™ denotes element-wise multiplication.

For CCE + softmax: βˆ‚L/βˆ‚z^(L) = yΜ‚ βˆ’ y (the clean gradient we derived in 16-04).

Backward recurrence (β„“ = Lβˆ’1, ..., 1):

Ξ΄^(β„“) = ((W^(β„“+1))α΅€ Ξ΄^(β„“+1)) βŠ™ f'_β„“(z^(β„“))

Parameter gradients:

βˆ‚L/βˆ‚W^(β„“) = Ξ΄^(β„“) (a^(β„“βˆ’1))α΅€ βˆ‚L/βˆ‚b^(β„“) = Ξ΄^(β„“)

4. Deriving the Backward Recurrence

Let's derive Ξ΄^(β„“) from Ξ΄^(β„“+1) step by step.

Step 1: Apply the chain rule through the activation.

a^(β„“) = f_β„“(z^(β„“)) βˆ‚L/βˆ‚z^(β„“) = βˆ‚L/βˆ‚a^(β„“) Β· βˆ‚a^(β„“)/βˆ‚z^(β„“)

The diagonal Jacobian of the activation: βˆ‚aα΅’^(β„“)/βˆ‚zβ±Ό^(β„“) = f'β„“(zα΅’^(β„“)) Β· Ξ΄{ij} So: βˆ‚L/βˆ‚z^(β„“) = βˆ‚L/βˆ‚a^(β„“) βŠ™ f'_β„“(z^(β„“))

Step 2: Relate βˆ‚L/βˆ‚a^(β„“) to Ξ΄^(β„“+1).

z^(β„“+1) = W^(β„“+1) a^(β„“) + b^(β„“+1) βˆ‚zα΅’^(β„“+1)/βˆ‚aβ±Ό^(β„“) = W_{ij}^(β„“+1)

By the chain rule: βˆ‚L/βˆ‚aβ±Ό^(β„“) = Ξ£α΅’ βˆ‚L/βˆ‚zα΅’^(β„“+1) Β· βˆ‚zα΅’^(β„“+1)/βˆ‚aβ±Ό^(β„“) = Ξ£α΅’ Ξ΄α΅’^(β„“+1) Β· W_{ij}^(β„“+1) = ((W^(β„“+1))α΅€ Ξ΄^(β„“+1))_j

Putting it together:

Ξ΄^(β„“) = ((W^(β„“+1))α΅€ Ξ΄^(β„“+1)) βŠ™ f'_β„“(z^(β„“)) βœ“

5. Deriving Parameter Gradients

Weight gradient:

βˆ‚zα΅’^(β„“)/βˆ‚W_{jk}^(β„“) = { aβ‚–^(β„“βˆ’1) if i = j { 0 if i β‰  j

Each weight W_{ij}^(β„“) affects only zα΅’^(β„“). By the chain rule:

βˆ‚L/βˆ‚W_{ij}^(β„“) = βˆ‚L/βˆ‚zα΅’^(β„“) Β· βˆ‚zα΅’^(β„“)/βˆ‚W_{ij}^(β„“) = Ξ΄α΅’^(β„“) Β· aβ±Ό^(β„“βˆ’1)

In matrix form: βˆ‚L/βˆ‚W^(β„“) = Ξ΄^(β„“) (a^(β„“βˆ’1))α΅€ βœ“

Bias gradient:

zα΅’^(β„“) = ... + bα΅’^(β„“), so βˆ‚zα΅’^(β„“)/βˆ‚bα΅’^(β„“) = 1. βˆ‚L/βˆ‚bα΅’^(β„“) = Ξ΄α΅’^(β„“) Β· 1 = Ξ΄α΅’^(β„“) βˆ‚L/βˆ‚b^(β„“) = Ξ΄^(β„“) βœ“

6. Backpropagation Algorithm (Step by Step)

Forward pass: 1. Set a^(0) = x 2. For β„“ = 1 to L: - z^(β„“) = W^(β„“) a^(β„“βˆ’1) + b^(β„“) - a^(β„“) = f_β„“(z^(β„“)) 3. Compute L = loss(a^(L), y)

Backward pass: 4. Compute Ξ΄^(L) = βˆ‚L/βˆ‚z^(L) (using loss and activation derivatives) 5. For β„“ = L down to 1: - βˆ‚L/βˆ‚W^(β„“) = Ξ΄^(β„“) (a^(β„“βˆ’1))α΅€ - βˆ‚L/βˆ‚b^(β„“) = Ξ΄^(β„“) - If β„“ > 1: Ξ΄^(β„“βˆ’1) = ((W^(β„“))α΅€ Ξ΄^(β„“)) βŠ™ f'_{β„“βˆ’1}(z^(β„“βˆ’1))

Parameter update: 6. For β„“ = 1 to L: - W^(β„“) ← W^(β„“) βˆ’ Ξ· Β· βˆ‚L/βˆ‚W^(β„“) - b^(β„“) ← b^(β„“) βˆ’ Ξ· Β· βˆ‚L/βˆ‚b^(β„“)

7. Why "Back"-propagation?

The algorithm computes gradients from the OUTPUT backward toward the INPUT. This is reverse-mode automatic differentiation.

Consider the computation graph where each node is an operation. Forward-mode AD (starting from inputs) computes βˆ‚(output)/βˆ‚(input) efficiently when there are few inputs. Reverse-mode AD computes βˆ‚(output)/βˆ‚(ALL intermediate values) efficiently when there are few outputs β€” perfect for neural networks where we have ONE scalar output (the loss) and MANY parameters.

Computational complexity: Each backward pass has approximately the same number of operations as the forward pass (about 2-3Γ— the forward pass cost). This is proven by the fact that each operation in the forward graph has a corresponding backward operation of similar complexity.

8. Handling Batches

In practice, we process mini-batches of M examples simultaneously. The forward pass processes a matrix X ∈ ℝ^{nΓ—M}:

Z^(β„“) = W^(β„“) A^(β„“βˆ’1) + b^(β„“) (broadcast)

The parameter gradients become averages:

βˆ‚L/βˆ‚W^(β„“) = (1/M) Ξ”^(β„“) (A^(β„“βˆ’1))α΅€ βˆ‚L/βˆ‚b^(β„“) = (1/M) Ξ£_{examples} Ξ΄^(β„“) (sum across batch dimension)

9. Gradient Checking

How do we verify our backprop implementation is correct? Gradient checking via finite differences:

βˆ‚L/βˆ‚W_{ij} β‰ˆ [L(W_{ij} + Ξ΅) βˆ’ L(W_{ij} βˆ’ Ξ΅)] / (2Ξ΅)

For Ξ΅ β‰ˆ 10⁻⁴, the relative error between analytical and numerical gradients should be < 10⁻⁷ for double precision (or < 10⁻³ for float32). If the error is ∼10⁻³, you likely have a bug. If it's ∼10⁻⁷, your backprop is correct.



Key Terms

Worked Examples

Example 1: 2-Layer Network Backprop by Hand

Problem: A tiny network: input x = [1] (scalar), one hidden neuron, one output neuron. Parameters: w₁ = 2, b₁ = 0, wβ‚‚ = 3, bβ‚‚ = 1. Hidden activation: ReLU. Output activation: linear. Loss: MSE. Target: y = 10.

Compute all gradients by hand.

Solution:

Forward pass: z₁ = w₁·x + b₁ = 2Β·1 + 0 = 2 a₁ = ReLU(2) = 2 zβ‚‚ = wβ‚‚Β·a₁ + bβ‚‚ = 3Β·2 + 1 = 7 yΜ‚ = zβ‚‚ = 7 L = (7 βˆ’ 10)Β² = 9

Backward pass: βˆ‚L/βˆ‚zβ‚‚ = βˆ‚L/βˆ‚yΜ‚ Β· βˆ‚yΜ‚/βˆ‚zβ‚‚ = 2(7βˆ’10)Β·1 = βˆ’6

βˆ‚L/βˆ‚wβ‚‚ = βˆ‚L/βˆ‚zβ‚‚ Β· a₁ = βˆ’6Β·2 = βˆ’12 βˆ‚L/βˆ‚bβ‚‚ = βˆ‚L/βˆ‚zβ‚‚ = βˆ’6

δ₁ = βˆ‚L/βˆ‚z₁ = βˆ‚L/βˆ‚zβ‚‚ Β· βˆ‚zβ‚‚/βˆ‚a₁ Β· βˆ‚a₁/βˆ‚z₁ = βˆ’6 Β· wβ‚‚ Β· ReLU'(2) = βˆ’6 Β· 3 Β· 1 = βˆ’18

βˆ‚L/βˆ‚w₁ = δ₁ Β· x = βˆ’18Β·1 = βˆ’18 βˆ‚L/βˆ‚b₁ = δ₁ = βˆ’18

Verification: If we increase w₁ slightly, z₁ β†’ a₁ β†’ zβ‚‚ all increase, and yΜ‚ gets closer to 10, so L decreases. The negative gradient βˆ‚L/βˆ‚w₁ = βˆ’18 is correct β€” gradient descent will ADD to w₁ (w₁ ← w₁ βˆ’ Ξ·Β·(βˆ’18) = w₁ + 18Ξ·).

Example 2: Backprop Through Sigmoid + BCE

Problem: For a single neuron with sigmoid activation and BCE loss, x = [2], w = 1, b = βˆ’1, y = 0, compute βˆ‚L/βˆ‚w.

Solution:

Forward: z = wΒ·x + b = 1Β·2 βˆ’ 1 = 1 yΜ‚ = Οƒ(1) β‰ˆ 0.7311 L = βˆ’[0Β·log(0.7311) + 1Β·log(0.2689)] = βˆ’log(0.2689) β‰ˆ 1.313

Backward (using clean BCE+sigmoid gradient): βˆ‚L/βˆ‚z = yΜ‚ βˆ’ y = 0.7311 βˆ’ 0 = 0.7311 βˆ‚L/βˆ‚w = βˆ‚L/βˆ‚z Β· βˆ‚z/βˆ‚w = 0.7311 Β· 2 = 1.4622 βˆ‚L/βˆ‚b = βˆ‚L/βˆ‚z Β· 1 = 0.7311

Example 3: Matrix Shape Analysis

Problem: A layer has 64 inputs and 128 outputs. Batch size is 32. Determine the shapes of W, a^(β„“βˆ’1), z^(β„“), Ξ΄^(β„“), and βˆ‚L/βˆ‚W^(β„“).

Solution:

W^(β„“): 128 Γ— 64 (output_dim Γ— input_dim) a^(β„“βˆ’1): 64 Γ— 32 (input_dim Γ— batch_size) z^(β„“) = Wa: 128 Γ— 32 Ξ΄^(β„“) = βˆ‚L/βˆ‚z^(β„“): 128 Γ— 32 (same shape as z) βˆ‚L/βˆ‚W^(β„“) = Ξ΄^(β„“) (a^(β„“βˆ’1))α΅€: (128Γ—32) Γ— (32Γ—64) = 128 Γ— 64 βœ“ (same shape as W)

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: For the network x β†’ z₁=w₁x+b₁ β†’ a₁=ReLU(z₁) β†’ zβ‚‚=wβ‚‚a₁+bβ‚‚ β†’ yΜ‚=Οƒ(zβ‚‚), derive the full expression for βˆ‚L/βˆ‚w₁ when using BCE loss.

Problem 2: A linear layer has W ∈ ℝ^{d_{\text{out}}Γ—d_{\text{in}}}. Show that the backward computation (W)α΅€ Ξ΄ has complexity O(d_{\text{out}}Β·d_{\text{in}}Β·batch_size). Express this in terms of the forward pass complexity.

Problem 3: In backprop, we compute βˆ‚L/βˆ‚W^(β„“) = Ξ΄^(β„“) (a^(β„“βˆ’1))α΅€. Derive this from the element-wise chain rule: βˆ‚L/βˆ‚W_{ij}^(β„“) = Ξ£β‚– βˆ‚L/βˆ‚zβ‚–^(β„“) Β· βˆ‚zβ‚–^(β„“)/βˆ‚W_{ij}^(β„“).

Problem 4: A network has 3 linear+ReLU layers. The gradient βˆ‚L/βˆ‚z^(3) = [0.5, βˆ’0.3]α΅€, and z^(2) = [1, βˆ’2, 0]α΅€. W^(3) = [[1,2,3],[βˆ’1,0,1]] (2Γ—3). Compute Ξ΄^(2) = βˆ‚L/βˆ‚z^(2).

Problem 5: Prove that if all activations are linear (f_β„“(x) = x) and the loss is MSE, a deep network is equivalent to a single linear layer β€” the gradient of L w.r.t. the effective weight matrix can be computed without backprop through intermediate layers.

Answers (click to expand) **Problem 1:** βˆ‚L/βˆ‚w₁ = βˆ‚L/βˆ‚zβ‚‚ Β· βˆ‚zβ‚‚/βˆ‚a₁ Β· βˆ‚a₁/βˆ‚z₁ Β· βˆ‚z₁/βˆ‚w₁ = (yΜ‚ βˆ’ y) Β· wβ‚‚ Β· ReLU'(z₁) Β· x With ReLU', this is (yΜ‚ βˆ’ y)Β·wβ‚‚Β·x if z₁ > 0, and 0 if z₁ ≀ 0. **Problem 2:** Forward pass **W****a**: (d_out Γ— d_in) Γ— (d_in Γ— B) = O(d_outΒ·d_inΒ·B) Backward pass **W**α΅€**Ξ΄**: (d_in Γ— d_out) Γ— (d_out Γ— B) = O(d_inΒ·d_outΒ·B) Same asymptotic complexity! This is the key property of reverse-mode AD: gradient of any parameter can be computed with the same asymptotic cost as the forward pass. **Problem 3:** zβ‚–^(β„“) = Ξ£β±Ό W_{kj}^(β„“) aβ±Ό^(β„“βˆ’1) + bβ‚–^(β„“) βˆ‚zβ‚–^(β„“)/βˆ‚W_{ij}^(β„“) = βˆ‚/βˆ‚W_{ij}^(β„“)[Ξ£β±Ό W_{kj}^(β„“) aβ±Ό^(β„“βˆ’1)] = { aβ±Ό^(β„“βˆ’1) if k=i, j=j; 0 otherwise Actually more precisely: βˆ‚zβ‚–^(β„“)/βˆ‚W_{ij}^(β„“) = aβ±Ό^(β„“βˆ’1) if k=i, else 0. So βˆ‚L/βˆ‚W_{ij}^(β„“) = Ξ£β‚– Ξ΄β‚– Β· (aβ±Ό if k=i, else 0) = Ξ΄α΅’ Β· aβ±Ό. In matrix form: βˆ‚L/βˆ‚**W** = **Ξ΄** (**a**)α΅€ βœ“ **Problem 4:** **Ξ΄**^(2) = ((**W**^(3))α΅€ **Ξ΄**^(3)) βŠ™ ReLU'(**z**^(2)) = ([ [1,βˆ’1], [2,0], [3,1] ] Β· [0.5, βˆ’0.3]α΅€) βŠ™ [ReLU'(1), ReLU'(βˆ’2), ReLU'(0)] = ([0.5+0.3, 1+0, 1.5βˆ’0.3]α΅€) βŠ™ [1, 0, 0] = [0.8, 1.0, 1.2] βŠ™ [1, 0, 0] = [0.8, 0, 0]α΅€ **Problem 5:** With linear activations: **a**^(β„“) = **z**^(β„“) = **W**^(β„“)**a**^(β„“βˆ’1) + **b**^(β„“). The entire network: **yΜ‚** = **W**^(L)(...**W**^(2)(**W**^(1)**x**+**b**^(1))+**b**^(2)...) + **b**^(L) = **W**_eff **x** + **b**_eff. The gradient w.r.t. **W**_eff can be computed as (yΜ‚βˆ’y)**x**α΅€ (for MSE). No backprop through individual layers is needed because the chain rule simplifies β€” all intermediate Jacobians are identity. The deep architecture is redundant; a single linear layer is equally expressive.

Summary

  1. Backpropagation applies the chain rule through a computational graph from output to input, computing βˆ‚L/βˆ‚W^(β„“) and βˆ‚L/βˆ‚b^(β„“) for all layers in one backward pass.
  2. The error signal Ξ΄^(β„“) = βˆ‚L/βˆ‚z^(β„“) propagates via Ξ΄^(β„“) = ((W^(β„“+1))α΅€Ξ΄^(β„“+1)) βŠ™ f'(z^(β„“)).
  3. Parameter gradients are outer products: βˆ‚L/βˆ‚W^(β„“) = Ξ΄^(β„“)(a^(β„“βˆ’1))α΅€ and βˆ‚L/βˆ‚b^(β„“) = Ξ΄^(β„“).
  4. Backprop is reverse-mode automatic differentiation, computing gradients with ~2-3Γ— the forward pass cost regardless of parameter count.
  5. Gradient checking (finite differences) should be used to verify implementations: relative error < 10⁻⁷ is correct for double precision.

Pitfalls


Quiz

Q1: In backpropagation, what does Ξ΄^(β„“) represent?

A) The activation of layer β„“ B) The weights of layer β„“ C) The gradient of the loss with respect to the pre-activation of layer β„“ D) The gradient of the loss with respect to the weights of layer β„“

Answer and Explanations **Correct: C) The gradient of the loss with respect to the pre-activation of layer β„“** **Ξ΄**^(β„“) = βˆ‚L/βˆ‚**z**^(β„“). This is the "error signal" that is propagated backward through the network. It's the central quantity in backprop. - A) That's **a**^(β„“), not Ξ΄. - B) That's **W**^(β„“). - C) βœ“ Correct. Ξ΄^(β„“) = βˆ‚L/βˆ‚z^(β„“) is the error at the pre-activation. - D) That's βˆ‚L/βˆ‚W^(β„“) = Ξ΄^(β„“) (a^(β„“βˆ’1))α΅€, derived FROM Ξ΄^(β„“).

Q2: What is the gradient βˆ‚L/βˆ‚W^(β„“) in terms of Ξ΄^(β„“) and a^(β„“βˆ’1)?

A) (W^(β„“))α΅€ Ξ΄^(β„“) B) Ξ΄^(β„“) (a^(β„“βˆ’1))α΅€ C) (a^(β„“βˆ’1))α΅€ Ξ΄^(β„“) D) Ξ΄^(β„“) βŠ™ a^(β„“βˆ’1)

Answer and Explanations **Correct: B) Ξ΄^(β„“) (a^(β„“βˆ’1))α΅€** This is an outer product: (d_out Γ— 1) Γ— (1 Γ— d_in) = (d_out Γ— d_in), matching the shape of **W**^(β„“). Each element is Ξ΄α΅’^(β„“) Β· aβ±Ό^(β„“βˆ’1) β€” how much changing weight W_{ij} affects the loss depends on the error at output i and the activation of input j. - A) That's part of the backward recurrence for Ξ΄^(β„“βˆ’1), not βˆ‚L/βˆ‚W. - B) βœ“ Correct. Outer product of error and input activation. - C) Wrong shape β€” would give d_in Γ— d_out instead of d_out Γ— d_in. - D) Element-wise product, not the correct gradient for a linear layer.

Q3: Why is backpropagation more efficient than computing each gradient independently via finite differences?

A) Backprop uses fewer bits of precision B) Backprop reuses intermediate results via the chain rule, computing all gradients in O(forward_cost) time C) Backprop doesn't require the forward pass D) Backprop only computes gradients for parameters that matter

Answer and Explanations **Correct: B) Backprop reuses intermediate results via the chain rule, computing all gradients in O(forward_cost) time** With P parameters, finite differences requires P+1 forward passes. Backprop computes ALL P gradients in ~2-3 forward passes worth of computation. For a network with millions of parameters, this is the difference between feasible and impossible. - A) Incorrect. Precision is unrelated to efficiency. - B) βœ“ Correct. Reverse-mode AD computes all gradients with cost proportional to one forward pass. - C) Incorrect. Backprop REQUIRES the forward pass (to store activations). - D) Incorrect. Backprop computes ALL gradients, not a subset.

Q4: For a linear layer z = Wa + b, what is βˆ‚zα΅’/βˆ‚aβ±Ό?

A) Ξ΄_{ij} (Kronecker delta) B) W_{ij} C) W_{ji} D) It depends on the activation function

Answer and Explanations **Correct: B) W_{ij}** zα΅’ = Ξ£β‚– W_{iβ‚–} aβ‚– + bα΅’. βˆ‚zα΅’/βˆ‚aβ±Ό = W_{ij} (only the j-th term in the sum depends on aβ±Ό). The Jacobian βˆ‚**z**/βˆ‚**a** is exactly the weight matrix **W**. - A) That would be true if zα΅’ = aα΅’, which it isn't. - B) βœ“ Correct. The Jacobian of a linear transformation is the weight matrix itself. - C) This is βˆ‚zβ±Ό/βˆ‚aα΅’, which equals W_{ji}. Not the same as βˆ‚zα΅’/βˆ‚aβ±Ό. - D) The linear layer comes BEFORE the activation, so it doesn't depend on f.

Q5: When computing backprop through a ReLU activation a = max(0, z), what is the local gradient βˆ‚a/βˆ‚z used in the backward pass?

A) Always 1 B) Always 0 C) 1 if z > 0, 0 if z < 0 D) Οƒ(z)(1 βˆ’ Οƒ(z))

Answer and Explanations **Correct: C) 1 if z > 0, 0 if z < 0** ReLU'(z) = 1 for z > 0, 0 for z < 0. At z = 0, the subgradient is [0, 1] β€” in practice, frameworks use either 0 or define it as a non-linearity. This binary behavior is what causes "dying ReLU" when z ≀ 0 for all inputs. - A) That would be linear activation, not ReLU. - B) Always 0 means no gradient flows β€” the network can't learn. - C) βœ“ Correct. The gradient is gated by the sign of the pre-activation. - D) That's the derivative of sigmoid, not ReLU.

Next Steps

Move on to 16-06 β€” Gradient Flow in Deep Networks to understand why gradients vanish or explode in deep networks, and how to diagnose and mitigate these problems.