📐 Concept diagram

17-02 — Recurrent Neural Networks (RNNs)

Phase: 17 — Deep Learning Architectures (Math) Subject: 17-02 Prerequisites: 16-05 (Backpropagation), 16-02 (Activation Functions), 16-06 (Gradient Flow), 17-01 (CNNs — for sequence-to-architecture thinking) Next subject: 17-03 — LSTM Mathematics

Learning Objectives

By the end of this subject, you will be able to:

Derive the RNN recurrence relation and unroll it into a deep computational graph
Explain the shared-weight structure across time steps and its implications
Derive Backpropagation Through Time (BPTT) analytically for a simple RNN
Quantify the vanishing/exploding gradient problem using the product-of-Jacobians analysis
Distinguish between the RNN's short-term memory limitation and the exploding gradient problem

Core Content

1. Motivation: Why Feedforward Isn't Enough

A feedforward network maps a fixed-size input to a fixed-size output. But sequences — sentences, time series, audio — have variable length and temporal dependencies. An RNN processes one element at a time while maintaining a hidden state that summarizes all past inputs.

The key idea: use the same weights at every time step (parameter sharing across time), and let the hidden state carry information forward.

2. The RNN Recurrence

For input sequence x₁, x₂, ..., x_T (each x_t ∈ ℝ^d_in), the RNN computes:

h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h) y_t = W_hy · h_t + b_y

Where: - h_t ∈ ℝ^d_h is the hidden state at time t - W_hh ∈ ℝ^(d_h × d_h) — recurrent weight matrix (hidden-to-hidden) - W_xh ∈ ℝ^(d_h × d_in) — input weight matrix (input-to-hidden) - W_hy ∈ ℝ^(d_out × d_h) — output weight matrix (hidden-to-output) - b_h, b_y are biases

⚠️ THIS IS CRITICAL — The SAME weights W_hh and W_xh are used at every time step. This is what makes the RNN handle variable-length sequences. It's also what makes training hard: the same W_hh gets multiplied repeatedly in the unrolled computation graph.

The hidden state h_0 is typically initialized to 0 (or learned).

3. Unrolling Through Time

To understand training, we "unroll" the RNN into a feedforward network of depth T:

$x₁ → [W_xh] → + → [tanh] → h₁ → [W_hy] → y₁
                ↑
          h₀ → [W_hh]

x₂ → [W_xh] → + → [tanh] → h₂ → [W_hy] → y₂
                ↑
          h₁ → [W_hh]

x₃ → [W_xh] → + → [tanh] → h₃ → [W_hy] → y₃
                ↑
          h₂ → [W_hh]
$

This is a very deep network (depth T) with shared weights across layers. Training requires computing gradients through this entire unrolled structure.

4. Backpropagation Through Time (BPTT)

Given a loss L = Σ_{t=1}^T L_t(y_t, target_t), we need ∂L/∂W_hh, ∂L/∂W_xh, ∂L/∂W_hy.

Step 1: Gradient w.r.t. hidden states

Let δ_t = ∂L/∂h_t. From the computation graph:

δ_t = (∂L_t/∂h_t) + (∂h_{t+1}/∂h_t)ᵀ · δ_{t+1}

Where ∂L_t/∂h_t = W_hyᵀ · ∂L_t/∂y_t.

And the Jacobian of the recurrence:

∂h_{t+1}/∂h_t = diag(tanh'(z_{t+1})) · W_hh

where z_{t+1} = W_hh · h_t + W_xh · x_{t+1} + b_h (the pre-activation).

Step 2: Expanding the recurrence

δ_t = W_hyᵀ · ∂L_t/∂y_t + W_hhᵀ · diag(tanh'(z_{t+1})) · δ_{t+1}

Unrolling from t back to 1:

δ₁ = Σ_{k=1}^T (∏_{j=k}^{2} W_hhᵀ · diag(tanh'(z_j))) · W_hyᵀ · ∂L_k/∂y_k

This is a long product of Jacobians — and that's the smoking gun for the vanishing gradient problem.

Step 3: Gradient w.r.t. weights

Since W_hh is shared across all time steps, its gradient is the SUM of contributions from each step:

∂L/∂W_hh = Σ_{t=1}^T ∂L_t/∂W_hh

Where at each step t, the gradient flows through all paths from L_t back to the occurrence of W_hh:

∂L_t/∂W_hh = Σ_{k=1}^t (∂L_t/∂h_t · ∂h_t/∂h_k) · (∂h_k/∂W_hh)

The factor (∂h_t/∂h_k) is the product of Jacobians from step k to t.

5. The Vanishing/Exploding Gradient Problem

Consider the long-range dependency where we propagate gradient from time t back to time 1:

∂h_t/∂h₁ = ∏_{i=2}^t diag(tanh'(z_i)) · W_hh

Taking the spectral norm:

||∂h_t/∂h₁|| ≤ ∏_{i=2}^t ||diag(tanh'(z_i))|| · ||W_hh|| ≤ (||W_hh|| · max |tanh'|)^(t-1) ≤ (||W_hh|| · 1)^(t-1)

If ||W_hh|| < 1: gradient vanishes exponentially. Long-range dependencies are lost. If ||W_hh|| > 1: gradient explodes exponentially. Training becomes unstable.

The fundamental problem: For the RNN to have long-term memory, the recurrent Jacobian must have eigenvalues near 1. But standard gradient descent can't maintain this — the recurrent weight matrix inevitably drifts into the vanishing or exploding regime.

Why it's worse than deep feedforward: In a feedforward net of depth 100, each layer has its OWN weight matrix. In an RNN of 100 time steps, the SAME weight matrix is multiplied 100 times. If its spectral norm is 0.99, then 0.99^100 ≈ 0.37 (manageable). If it's 0.95, 0.95^100 ≈ 0.006 (problematic). This sensitivity to the recurrent weight's spectral radius is the core difficulty.

6. Truncated BPTT

The practical solution before LSTMs: truncate backpropagation to k steps.

Only propagate gradients back k < T steps

This avoids the vanishing/exploding problem but means the RNN can only learn dependencies spanning at most k time steps. It treats the RNN as having a fixed short-term memory window.

7. RNN Output Modes

RNNs support different input-output configurations:

Mode	Input	Output	Example
Many-to-one	Sequence	Single value	Sentiment classification
One-to-many	Single value	Sequence	Image captioning
Many-to-many (synced)	Sequence	Sequence (same length)	Part-of-speech tagging
Many-to-many (seq2seq)	Sequence	Sequence (different length)	Machine translation

Key Terms

17 02 Recurrent Neural Networks
Backpropagation Through Time (BPTT)
End-of-Subject Quiz
Example 1: Unrolling a Tiny RNN
Example 2: BPTT Gradient for One Weight Occurrence
Example 3: Vanishing Gradient Demonstration
Mode
Motivation: Why Feedforward Isn't Enough
Problem 1
Problem 2
Problem 3
Problem 4

Worked Examples

Example 1: Unrolling a Tiny RNN

Problem: An RNN with d_h=2, d_in=1, d_out=1. W_hh = [[0.5, 0.1],[0.1, 0.5]], W_xh = [[0.2],[0.3]], b_h = [0,0], W_hy = [1, 1], b_y = 0. h₀ = [0,0]. Input sequence: x = [1, 2]. Compute h₁, h₂ and y₁, y₂.

Solution: h₁ = tanh(W_hh·h₀ + W_xh·1 + b_h) = tanh([0.2, 0.3]ᵀ) = [tanh(0.2), tanh(0.3)] = [0.1974, 0.2913]

h₂ = tanh(W_hh·h₁ + W_xh·2 + b_h) W_hh·h₁ = [0.5·0.1974+0.1·0.2913, 0.1·0.1974+0.5·0.2913] = [0.1278, 0.1654] W_xh·2 = [0.4, 0.6] z = [0.5278, 0.7654] h₂ = [tanh(0.5278), tanh(0.7654)] = [0.4832, 0.6437]

y₁ = W_hy·h₁ + b_y = 0.1974 + 0.2913 = 0.4887 y₂ = 0.4832 + 0.6437 = 1.1269

Example 2: BPTT Gradient for One Weight Occurrence

Problem: For the RNN in Example 1, compute ∂y₂/∂W_xh[0] (the first element of W_xh) analytically at the first time step.

Solution: y₂ = W_hy · h₂, and h₂ = tanh(W_hh·h₁ + W_xh·x₂ + b_h)

So ∂y₂/∂W_xh[0,0] depends on how W_xh[0,0] = 0.2 affects h₂. But W_xh[0,0] also affects h₁! So there are two paths: Path 1 (through h₁): ∂h₁/∂W_xh[0,0] · ∂h₂/∂h₁ · ∂y₂/∂h₂ Path 2 (direct at time 2): depends only on x₂

∂h₁/∂W_xh[0,0] = tanh'(z₁₁) · x₁ · [1,0]ᵀ where z₁₁ is the first component of z₁ tanh'(0.2) = 1 − tanh²(0.2) = 1 − 0.0390 = 0.9610 So ∂h₁/∂W_xh[0,0] = [0.9610, 0]ᵀ

∂h₂/∂h₁ = diag(tanh'(z₂)) · W_hh tanh'(0.5278) = 1 − 0.2335 = 0.7665 tanh'(0.7654) = 1 − 0.4144 = 0.5856 diag = [[0.7665, 0],[0, 0.5856]] ∂h₂/∂h₁ = [[0.7665·0.5, 0.7665·0.1],[0.5856·0.1, 0.5856·0.5]] = [[0.3833, 0.0767],[0.0586, 0.2928]]

∂y₂/∂h₂ = W_hy = [1, 1]

Path 1 contribution: [1,1] · [[0.3833, 0.0767],[0.0586, 0.2928]] · [0.9610, 0]ᵀ = [0.4419, 0.3695] · [0.9610, 0]ᵀ = 0.4247

This shows how the gradient from y₂ back to weights at time 1 flows through h₁ → h₂ → y₂, being attenuated by the Jacobian product.

Example 3: Vanishing Gradient Demonstration

Problem: For an RNN with one-dimensional hidden state, W_hh = 0.5, and tanh acting as activation. If tanh'(z) ≈ 0.8 on average (z is near zero), what fraction of the gradient at time 100 reaches time 1?

Solution: ∂h_100/∂h₁ = ∏_{i=2}^{100} tanh'(z_i) · W_hh ≈ (0.8 · 0.5)^99 = (0.4)^99 ≈ 10^(−39)

The gradient is effectively zero. The RNN cannot learn dependencies spanning 100 time steps with these parameters.

Quiz

Q1: What does the concept of Backpropagation Through Time (BPTT) primarily refer to in this subject?

A) A historical anecdote about Backpropagation Through Time (BPTT) B) A visual representation of Backpropagation Through Time (BPTT) C) A computational error related to Backpropagation Through Time (BPTT) D) The definition and application of Backpropagation Through Time (BPTT)

Correct: D)

If you chose A: This is incorrect. Backpropagation Through Time (BPTT) is defined as: the definition and application of backpropagation through time (bptt). The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Backpropagation Through Time (BPTT) is defined as: the definition and application of backpropagation through time (bptt). The other options describe different aspects that are not the primary focus.
If you chose C: This is incorrect. Backpropagation Through Time (BPTT) is defined as: the definition and application of backpropagation through time (bptt). The other options describe different aspects that are not the primary focus.
If you chose D: Backpropagation Through Time (BPTT) is defined as: the definition and application of backpropagation through time (bptt). The other options describe different aspects that are not the primary focus. Correct!

Q2: What is the primary purpose of End-of-Subject Quiz?

A) It is used to end-of-subject quiz in mathematical analysis B) It is used only in advanced research contexts C) It is primarily a historical notation system D) It replaces all other methods in this domain

Correct: A)

If you chose A: End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose B: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.

Q3: Which statement about Motivation: Why Feedforward Isn'T Enough is TRUE?

A) Motivation: Why Feedforward Isn'T Enough is mentioned only as a historical footnote B) Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject C) Motivation: Why Feedforward Isn'T Enough is not related to this subject D) Motivation: Why Feedforward Isn'T Enough is an advanced topic beyond this subject's scope

Correct: B)

If you chose A: This is incorrect. Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject. This subject covers Motivation: Why Feedforward Isn'T Enough as part of its core content.
If you chose B: Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject. This subject covers Motivation: Why Feedforward Isn'T Enough as part of its core content. Correct!
If you chose C: This is incorrect. Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject. This subject covers Motivation: Why Feedforward Isn'T Enough as part of its core content.
If you chose D: This is incorrect. Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject. This subject covers Motivation: Why Feedforward Isn'T Enough as part of its core content.

Q4: Based on the worked examples in this subject, what is the correct result?

A) 3×3, W_xh: 3×2, b_h: 3×1. B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer

Correct: A)

If you chose A: The worked examples show that the result is 3×3, W_xh: 3×2, b_h: 3×1.. The other options represent common errors. Correct!
If you chose B: This is incorrect. The worked examples show that the result is 3×3, W_xh: 3×2, b_h: 3×1.. The other options represent common errors.
If you chose C: This is incorrect. The worked examples show that the result is 3×3, W_xh: 3×2, b_h: 3×1.. The other options represent common errors.
If you chose D: This is incorrect. The worked examples show that the result is 3×3, W_xh: 3×2, b_h: 3×1.. The other options represent common errors.

Q5: How are Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence related?

A) Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are completely unrelated topics B) Motivation: Why Feedforward Isn'T Enough is the inverse of The Rnn Recurrence C) Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are closely related concepts D) Motivation: Why Feedforward Isn'T Enough is a special case of The Rnn Recurrence

Correct: C)

If you chose A: This is incorrect. Both Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are covered in this subject as interconnected topics.
If you chose C: Both Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are covered in this subject as interconnected topics. Correct!
If you chose D: This is incorrect. Both Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are covered in this subject as interconnected topics.

Q6: What is a common pitfall when working with Unrolling Through Time?

A) A common mistake is confusing Unrolling Through Time with a similar concept B) Unrolling Through Time is always computed the same way in all contexts C) The main error with Unrolling Through Time is using it when it is not needed D) Unrolling Through Time has no common misconceptions

Correct: A)

If you chose A: Students often confuse Unrolling Through Time with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose B: This is incorrect. Students often confuse Unrolling Through Time with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: This is incorrect. Students often confuse Unrolling Through Time with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: This is incorrect. Students often confuse Unrolling Through Time with similar-sounding or related concepts. Pay attention to the precise definitions.

Q7: When should you apply The Vanishing/Exploding Gradient Problem?

A) The Vanishing/Exploding Gradient Problem is not practically useful B) Avoid The Vanishing/Exploding Gradient Problem unless explicitly instructed C) Use The Vanishing/Exploding Gradient Problem only in pure mathematics contexts D) Apply The Vanishing/Exploding Gradient Problem to solve problems in this subject's domain

Correct: D)

If you chose A: This is incorrect. The Vanishing/Exploding Gradient Problem is a practical tool used throughout this subject to solve relevant problems.
If you chose B: This is incorrect. The Vanishing/Exploding Gradient Problem is a practical tool used throughout this subject to solve relevant problems.
If you chose C: This is incorrect. The Vanishing/Exploding Gradient Problem is a practical tool used throughout this subject to solve relevant problems.
If you chose D: The Vanishing/Exploding Gradient Problem is a practical tool used throughout this subject to solve relevant problems. Correct!

Practice Problems

Problem 1

An RNN has d_h = 3, d_in = 2. What are the dimensions of W_hh, W_xh, b_h?

Answer

W_hh: 3×3, W_xh: 3×2, b_h: 3×1.

Problem 2

If ||W_hh|| = 1.2 and max |tanh'| = 1, what happens to gradients over 50 time steps?

Answer

||∂h_50/∂h₁|| ≈ 1.2^49 ≈ 7.5×10³ — gradients explode. Training will be unstable.

Problem 3

Explain why weight sharing in RNNs makes the vanishing gradient problem worse than in deep feedforward nets.

Answer

In feedforward nets, each layer has its own weight matrix, so eigenvalues can vary across layers; some can be >1 while others <1, partially balancing. In RNNs, the same W_hh is multiplied T times, so if ||W_hh|| < 1, ALL T multiplications shrink the gradient. There's no opportunity for balancing — the same matrix's spectral properties are compounded T times.

Problem 4

Compute ∂L/∂W_hh for an RNN with T=3 time steps. Write it as a sum of contributions from t=1,2,3.

Answer

∂L/∂W_hh = (∂L₁/∂h₁)(∂h₁/∂W_hh) + (∂L₂/∂h₂)(∂h₂/∂h₁)(∂h₁/∂W_hh) + (∂L₂/∂h₂)(∂h₂/∂W_hh at t=2) + (∂L₃/∂h₃)(∂h₃/∂h₂)(∂h₂/∂h₁)(∂h₁/∂W_hh) + ... Each occurrence of W_hh in the unrolled graph contributes a term, and gradients flow back through the entire chain of hidden states.

Problem 5

If we use truncated BPTT with k=10, what is the maximum temporal dependency the RNN can learn? What if k=T?

Answer

With k=10, at most 10 steps. With k=T (full BPTT), the RNN can theoretically learn dependencies of any length, but the vanishing gradient makes this impossible in practice with standard RNNs.

Summary

RNNs process sequences by applying the same weights at each time step: h_t = tanh(W_hh·h_{t-1} + W_xh·x_t + b_h)
Training unrolls the RNN into a depth-T feedforward net; BPTT computes gradients through all T steps
The recurrent Jacobian product ∏ diag(tanh')·W_hh causes vanishing gradients if ||W_hh|| < 1 and exploding gradients if ||W_hh|| > 1
Weight sharing amplifies the problem: the same matrix is multiplied T times with no layer-to-layer diversity
Truncated BPTT limits the effective memory to k steps, which motivated the development of gated architectures (LSTM, GRU)

Pitfalls

Using ReLU in vanilla RNNs. ReLU's output is unbounded above (max(0, z)). Since the same W_hh multiplies the hidden state at every time step, ReLU activations can grow without bound over long sequences, causing exploding hidden states. tanh's bounded output (−1, 1) provides inherent stability — use tanh with vanilla RNNs.
Truncating BPTT too aggressively. Truncation length k is a hard limit on learnable temporal dependencies. If your task genuinely requires 50-step dependencies but you truncate to k=20, even a perfect architecture cannot learn the task. Match truncation length to the longest relevant dependency in your data.
Forgetting to mask loss and hidden state at padding positions. RNN frameworks require fixed-shape tensors, so sequences are padded with zeros. If the loss is computed over padding positions, the model learns to predict zeros. If hidden states aren't reset at sequence boundaries, information leaks between unrelated sequences. Always use masking.
Expecting a single RNN layer to capture multi-scale temporal patterns. One RNN layer processes the sequence at a single timescale. Stacking 2–3 RNN layers allows higher layers to capture slower, more abstract temporal dynamics — this is standard practice for complex sequence modeling.
Making the initial hidden state h₀ a learned parameter without considering its role. A learned h₀ can absorb dataset-wide biases, but it's a single vector shared across all sequences. For tasks where initial conditions vary (e.g., conditioning on a context vector), h₀ should be computed from context, not treated as a static learned parameter.

Next Steps

Continue to 17-03 — LSTM Mathematics to see how gating mechanisms solve the vanishing gradient problem.

Progress

Phases

17-02 — Recurrent Neural Networks (RNNs)

Learning Objectives

Core Content

1. Motivation: Why Feedforward Isn't Enough

2. The RNN Recurrence

3. Unrolling Through Time

4. Backpropagation Through Time (BPTT)

5. The Vanishing/Exploding Gradient Problem

6. Truncated BPTT

7. RNN Output Modes

Key Terms

Worked Examples

Example 1: Unrolling a Tiny RNN

Example 2: BPTT Gradient for One Weight Occurrence

Example 3: Vanishing Gradient Demonstration

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Next Steps