17-02 β Recurrent Neural Networks (RNNs)
Phase: 17 β Deep Learning Architectures (Math) Subject: 17-02 Prerequisites: 16-05 (Backpropagation), 16-02 (Activation Functions), 16-06 (Gradient Flow), 17-01 (CNNs β for sequence-to-architecture thinking) Next subject: 17-03 β LSTM Mathematics
Learning Objectives
By the end of this subject, you will be able to:
- Derive the RNN recurrence relation and unroll it into a deep computational graph
- Explain the shared-weight structure across time steps and its implications
- Derive Backpropagation Through Time (BPTT) analytically for a simple RNN
- Quantify the vanishing/exploding gradient problem using the product-of-Jacobians analysis
- Distinguish between the RNN's short-term memory limitation and the exploding gradient problem
Core Content
1. Motivation: Why Feedforward Isn't Enough
A feedforward network maps a fixed-size input to a fixed-size output. But sequences β sentences, time series, audio β have variable length and temporal dependencies. An RNN processes one element at a time while maintaining a hidden state that summarizes all past inputs.
The key idea: use the same weights at every time step (parameter sharing across time), and let the hidden state carry information forward.
2. The RNN Recurrence
For input sequence xβ, xβ, ..., x_T (each x_t β β^d_in), the RNN computes:
h_t = tanh(W_hh Β· h_{t-1} + W_xh Β· x_t + b_h) y_t = W_hy Β· h_t + b_y
Where: - h_t β β^d_h is the hidden state at time t - W_hh β β^(d_h Γ d_h) β recurrent weight matrix (hidden-to-hidden) - W_xh β β^(d_h Γ d_in) β input weight matrix (input-to-hidden) - W_hy β β^(d_out Γ d_h) β output weight matrix (hidden-to-output) - b_h, b_y are biases
β οΈ THIS IS CRITICAL β The SAME weights W_hh and W_xh are used at every time step. This is what makes the RNN handle variable-length sequences. It's also what makes training hard: the same W_hh gets multiplied repeatedly in the unrolled computation graph.
The hidden state h_0 is typically initialized to 0 (or learned).
3. Unrolling Through Time
To understand training, we "unroll" the RNN into a feedforward network of depth T:
$xβ β [W_xh] β + β [tanh] β hβ β [W_hy] β yβ
β
hβ β [W_hh]
xβ β [W_xh] β + β [tanh] β hβ β [W_hy] β yβ
β
hβ β [W_hh]
xβ β [W_xh] β + β [tanh] β hβ β [W_hy] β yβ
β
hβ β [W_hh]
$
This is a very deep network (depth T) with shared weights across layers. Training requires computing gradients through this entire unrolled structure.
4. Backpropagation Through Time (BPTT)
Given a loss L = Ξ£_{t=1}^T L_t(y_t, target_t), we need βL/βW_hh, βL/βW_xh, βL/βW_hy.
Step 1: Gradient w.r.t. hidden states
Let Ξ΄_t = βL/βh_t. From the computation graph:
Ξ΄_t = (βL_t/βh_t) + (βh_{t+1}/βh_t)α΅ Β· Ξ΄_{t+1}
Where βL_t/βh_t = W_hyα΅ Β· βL_t/βy_t.
And the Jacobian of the recurrence:
βh_{t+1}/βh_t = diag(tanh'(z_{t+1})) Β· W_hh
where z_{t+1} = W_hh Β· h_t + W_xh Β· x_{t+1} + b_h (the pre-activation).
Step 2: Expanding the recurrence
Ξ΄_t = W_hyα΅ Β· βL_t/βy_t + W_hhα΅ Β· diag(tanh'(z_{t+1})) Β· Ξ΄_{t+1}
Unrolling from t back to 1:
Ξ΄β = Ξ£_{k=1}^T (β_{j=k}^{2} W_hhα΅ Β· diag(tanh'(z_j))) Β· W_hyα΅ Β· βL_k/βy_k
This is a long product of Jacobians β and that's the smoking gun for the vanishing gradient problem.
Step 3: Gradient w.r.t. weights
Since W_hh is shared across all time steps, its gradient is the SUM of contributions from each step:
βL/βW_hh = Ξ£_{t=1}^T βL_t/βW_hh
Where at each step t, the gradient flows through all paths from L_t back to the occurrence of W_hh:
βL_t/βW_hh = Ξ£_{k=1}^t (βL_t/βh_t Β· βh_t/βh_k) Β· (βh_k/βW_hh)
The factor (βh_t/βh_k) is the product of Jacobians from step k to t.
5. The Vanishing/Exploding Gradient Problem
Consider the long-range dependency where we propagate gradient from time t back to time 1:
βh_t/βhβ = β_{i=2}^t diag(tanh'(z_i)) Β· W_hh
Taking the spectral norm:
||βh_t/βhβ|| β€ β_{i=2}^t ||diag(tanh'(z_i))|| Β· ||W_hh|| β€ (||W_hh|| Β· max |tanh'|)^(t-1) β€ (||W_hh|| Β· 1)^(t-1)
If ||W_hh|| < 1: gradient vanishes exponentially. Long-range dependencies are lost. If ||W_hh|| > 1: gradient explodes exponentially. Training becomes unstable.
The fundamental problem: For the RNN to have long-term memory, the recurrent Jacobian must have eigenvalues near 1. But standard gradient descent can't maintain this β the recurrent weight matrix inevitably drifts into the vanishing or exploding regime.
Why it's worse than deep feedforward: In a feedforward net of depth 100, each layer has its OWN weight matrix. In an RNN of 100 time steps, the SAME weight matrix is multiplied 100 times. If its spectral norm is 0.99, then 0.99^100 β 0.37 (manageable). If it's 0.95, 0.95^100 β 0.006 (problematic). This sensitivity to the recurrent weight's spectral radius is the core difficulty.
6. Truncated BPTT
The practical solution before LSTMs: truncate backpropagation to k steps.
Only propagate gradients back k < T steps
This avoids the vanishing/exploding problem but means the RNN can only learn dependencies spanning at most k time steps. It treats the RNN as having a fixed short-term memory window.
7. RNN Output Modes
RNNs support different input-output configurations:
| Mode | Input | Output | Example |
|---|---|---|---|
| Many-to-one | Sequence | Single value | Sentiment classification |
| One-to-many | Single value | Sequence | Image captioning |
| Many-to-many (synced) | Sequence | Sequence (same length) | Part-of-speech tagging |
| Many-to-many (seq2seq) | Sequence | Sequence (different length) | Machine translation |
Key Terms
- 17 02 Recurrent Neural Networks
- Backpropagation Through Time (BPTT)
- End-of-Subject Quiz
- Example 1: Unrolling a Tiny RNN
- Example 2: BPTT Gradient for One Weight Occurrence
- Example 3: Vanishing Gradient Demonstration
- Mode
- Motivation: Why Feedforward Isn't Enough
- Problem 1
- Problem 2
- Problem 3
- Problem 4
Worked Examples
Example 1: Unrolling a Tiny RNN
Problem: An RNN with d_h=2, d_in=1, d_out=1. W_hh = [[0.5, 0.1],[0.1, 0.5]], W_xh = [[0.2],[0.3]], b_h = [0,0], W_hy = [1, 1], b_y = 0. hβ = [0,0]. Input sequence: x = [1, 2]. Compute hβ, hβ and yβ, yβ.
Solution: hβ = tanh(W_hhΒ·hβ + W_xhΒ·1 + b_h) = tanh([0.2, 0.3]α΅) = [tanh(0.2), tanh(0.3)] = [0.1974, 0.2913]
hβ = tanh(W_hhΒ·hβ + W_xhΒ·2 + b_h) W_hhΒ·hβ = [0.5Β·0.1974+0.1Β·0.2913, 0.1Β·0.1974+0.5Β·0.2913] = [0.1278, 0.1654] W_xhΒ·2 = [0.4, 0.6] z = [0.5278, 0.7654] hβ = [tanh(0.5278), tanh(0.7654)] = [0.4832, 0.6437]
yβ = W_hyΒ·hβ + b_y = 0.1974 + 0.2913 = 0.4887 yβ = 0.4832 + 0.6437 = 1.1269
Example 2: BPTT Gradient for One Weight Occurrence
Problem: For the RNN in Example 1, compute βyβ/βW_xh[0] (the first element of W_xh) analytically at the first time step.
Solution: yβ = W_hy Β· hβ, and hβ = tanh(W_hhΒ·hβ + W_xhΒ·xβ + b_h)
So βyβ/βW_xh[0,0] depends on how W_xh[0,0] = 0.2 affects hβ. But W_xh[0,0] also affects hβ! So there are two paths: Path 1 (through hβ): βhβ/βW_xh[0,0] Β· βhβ/βhβ Β· βyβ/βhβ Path 2 (direct at time 2): depends only on xβ
βhβ/βW_xh[0,0] = tanh'(zββ) Β· xβ Β· [1,0]α΅ where zββ is the first component of zβ tanh'(0.2) = 1 β tanhΒ²(0.2) = 1 β 0.0390 = 0.9610 So βhβ/βW_xh[0,0] = [0.9610, 0]α΅
βhβ/βhβ = diag(tanh'(zβ)) Β· W_hh tanh'(0.5278) = 1 β 0.2335 = 0.7665 tanh'(0.7654) = 1 β 0.4144 = 0.5856 diag = [[0.7665, 0],[0, 0.5856]] βhβ/βhβ = [[0.7665Β·0.5, 0.7665Β·0.1],[0.5856Β·0.1, 0.5856Β·0.5]] = [[0.3833, 0.0767],[0.0586, 0.2928]]
βyβ/βhβ = W_hy = [1, 1]
Path 1 contribution: [1,1] Β· [[0.3833, 0.0767],[0.0586, 0.2928]] Β· [0.9610, 0]α΅ = [0.4419, 0.3695] Β· [0.9610, 0]α΅ = 0.4247
This shows how the gradient from yβ back to weights at time 1 flows through hβ β hβ β yβ, being attenuated by the Jacobian product.
Example 3: Vanishing Gradient Demonstration
Problem: For an RNN with one-dimensional hidden state, W_hh = 0.5, and tanh acting as activation. If tanh'(z) β 0.8 on average (z is near zero), what fraction of the gradient at time 100 reaches time 1?
Solution: βh_100/βhβ = β_{i=2}^{100} tanh'(z_i) Β· W_hh β (0.8 Β· 0.5)^99 = (0.4)^99 β 10^(β39)
The gradient is effectively zero. The RNN cannot learn dependencies spanning 100 time steps with these parameters.
Quiz
Q1: What does the concept of Backpropagation Through Time (BPTT) primarily refer to in this subject?
A) A historical anecdote about Backpropagation Through Time (BPTT) B) A visual representation of Backpropagation Through Time (BPTT) C) A computational error related to Backpropagation Through Time (BPTT) D) The definition and application of Backpropagation Through Time (BPTT)
Correct: D)
- If you chose A: This is incorrect. Backpropagation Through Time (BPTT) is defined as: the definition and application of backpropagation through time (bptt). The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Backpropagation Through Time (BPTT) is defined as: the definition and application of backpropagation through time (bptt). The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Backpropagation Through Time (BPTT) is defined as: the definition and application of backpropagation through time (bptt). The other options describe different aspects that are not the primary focus.
- If you chose D: Backpropagation Through Time (BPTT) is defined as: the definition and application of backpropagation through time (bptt). The other options describe different aspects that are not the primary focus. Correct!
Q2: What is the primary purpose of End-of-Subject Quiz?
A) It is used to end-of-subject quiz in mathematical analysis B) It is used only in advanced research contexts C) It is primarily a historical notation system D) It replaces all other methods in this domain
Correct: A)
- If you chose A: End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about Motivation: Why Feedforward Isn'T Enough is TRUE?
A) Motivation: Why Feedforward Isn'T Enough is mentioned only as a historical footnote B) Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject C) Motivation: Why Feedforward Isn'T Enough is not related to this subject D) Motivation: Why Feedforward Isn'T Enough is an advanced topic beyond this subject's scope
Correct: B)
- If you chose A: This is incorrect. Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject. This subject covers Motivation: Why Feedforward Isn'T Enough as part of its core content.
- If you chose B: Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject. This subject covers Motivation: Why Feedforward Isn'T Enough as part of its core content. Correct!
- If you chose C: This is incorrect. Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject. This subject covers Motivation: Why Feedforward Isn'T Enough as part of its core content.
- If you chose D: This is incorrect. Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject. This subject covers Motivation: Why Feedforward Isn'T Enough as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) 3Γ3, W_xh: 3Γ2, b_h: 3Γ1. B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer
Correct: A)
- If you chose A: The worked examples show that the result is 3Γ3, W_xh: 3Γ2, b_h: 3Γ1.. The other options represent common errors. Correct!
- If you chose B: This is incorrect. The worked examples show that the result is 3Γ3, W_xh: 3Γ2, b_h: 3Γ1.. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is 3Γ3, W_xh: 3Γ2, b_h: 3Γ1.. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is 3Γ3, W_xh: 3Γ2, b_h: 3Γ1.. The other options represent common errors.
Q5: How are Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence related?
A) Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are completely unrelated topics B) Motivation: Why Feedforward Isn'T Enough is the inverse of The Rnn Recurrence C) Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are closely related concepts D) Motivation: Why Feedforward Isn'T Enough is a special case of The Rnn Recurrence
Correct: C)
- If you chose A: This is incorrect. Both Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are covered in this subject as interconnected topics.
- If you chose C: Both Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are covered in this subject as interconnected topics. Correct!
- If you chose D: This is incorrect. Both Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Unrolling Through Time?
A) A common mistake is confusing Unrolling Through Time with a similar concept B) Unrolling Through Time is always computed the same way in all contexts C) The main error with Unrolling Through Time is using it when it is not needed D) Unrolling Through Time has no common misconceptions
Correct: A)
- If you chose A: Students often confuse Unrolling Through Time with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose B: This is incorrect. Students often confuse Unrolling Through Time with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Unrolling Through Time with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Unrolling Through Time with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply The Vanishing/Exploding Gradient Problem?
A) The Vanishing/Exploding Gradient Problem is not practically useful B) Avoid The Vanishing/Exploding Gradient Problem unless explicitly instructed C) Use The Vanishing/Exploding Gradient Problem only in pure mathematics contexts D) Apply The Vanishing/Exploding Gradient Problem to solve problems in this subject's domain
Correct: D)
- If you chose A: This is incorrect. The Vanishing/Exploding Gradient Problem is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. The Vanishing/Exploding Gradient Problem is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. The Vanishing/Exploding Gradient Problem is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: The Vanishing/Exploding Gradient Problem is a practical tool used throughout this subject to solve relevant problems. Correct!
Practice Problems
Problem 1
An RNN has d_h = 3, d_in = 2. What are the dimensions of W_hh, W_xh, b_h?
Answer
W_hh: 3Γ3, W_xh: 3Γ2, b_h: 3Γ1.Problem 2
If ||W_hh|| = 1.2 and max |tanh'| = 1, what happens to gradients over 50 time steps?
Answer
||βh_50/βhβ|| β 1.2^49 β 7.5Γ10Β³ β gradients explode. Training will be unstable.Problem 3
Explain why weight sharing in RNNs makes the vanishing gradient problem worse than in deep feedforward nets.
Answer
In feedforward nets, each layer has its own weight matrix, so eigenvalues can vary across layers; some can be >1 while others <1, partially balancing. In RNNs, the same W_hh is multiplied T times, so if ||W_hh|| < 1, ALL T multiplications shrink the gradient. There's no opportunity for balancing β the same matrix's spectral properties are compounded T times.Problem 4
Compute βL/βW_hh for an RNN with T=3 time steps. Write it as a sum of contributions from t=1,2,3.
Answer
βL/βW_hh = (βLβ/βhβ)(βhβ/βW_hh) + (βLβ/βhβ)(βhβ/βhβ)(βhβ/βW_hh) + (βLβ/βhβ)(βhβ/βW_hh at t=2) + (βLβ/βhβ)(βhβ/βhβ)(βhβ/βhβ)(βhβ/βW_hh) + ... Each occurrence of W_hh in the unrolled graph contributes a term, and gradients flow back through the entire chain of hidden states.Problem 5
If we use truncated BPTT with k=10, what is the maximum temporal dependency the RNN can learn? What if k=T?
Answer
With k=10, at most 10 steps. With k=T (full BPTT), the RNN can theoretically learn dependencies of any length, but the vanishing gradient makes this impossible in practice with standard RNNs.Summary
- RNNs process sequences by applying the same weights at each time step: h_t = tanh(W_hhΒ·h_{t-1} + W_xhΒ·x_t + b_h)
- Training unrolls the RNN into a depth-T feedforward net; BPTT computes gradients through all T steps
- The recurrent Jacobian product β diag(tanh')Β·W_hh causes vanishing gradients if ||W_hh|| < 1 and exploding gradients if ||W_hh|| > 1
- Weight sharing amplifies the problem: the same matrix is multiplied T times with no layer-to-layer diversity
- Truncated BPTT limits the effective memory to k steps, which motivated the development of gated architectures (LSTM, GRU)
Pitfalls
- Using ReLU in vanilla RNNs. ReLU's output is unbounded above (max(0, z)). Since the same W_hh multiplies the hidden state at every time step, ReLU activations can grow without bound over long sequences, causing exploding hidden states. tanh's bounded output (β1, 1) provides inherent stability β use tanh with vanilla RNNs.
- Truncating BPTT too aggressively. Truncation length k is a hard limit on learnable temporal dependencies. If your task genuinely requires 50-step dependencies but you truncate to k=20, even a perfect architecture cannot learn the task. Match truncation length to the longest relevant dependency in your data.
- Forgetting to mask loss and hidden state at padding positions. RNN frameworks require fixed-shape tensors, so sequences are padded with zeros. If the loss is computed over padding positions, the model learns to predict zeros. If hidden states aren't reset at sequence boundaries, information leaks between unrelated sequences. Always use masking.
- Expecting a single RNN layer to capture multi-scale temporal patterns. One RNN layer processes the sequence at a single timescale. Stacking 2β3 RNN layers allows higher layers to capture slower, more abstract temporal dynamics β this is standard practice for complex sequence modeling.
- Making the initial hidden state hβ a learned parameter without considering its role. A learned hβ can absorb dataset-wide biases, but it's a single vector shared across all sequences. For tasks where initial conditions vary (e.g., conditioning on a context vector), hβ should be computed from context, not treated as a static learned parameter.
Next Steps
Continue to 17-03 β LSTM Mathematics to see how gating mechanisms solve the vanishing gradient problem.