Math graphic
πŸ“ Concept diagram

17-02 β€” Recurrent Neural Networks (RNNs)

Phase: 17 β€” Deep Learning Architectures (Math) Subject: 17-02 Prerequisites: 16-05 (Backpropagation), 16-02 (Activation Functions), 16-06 (Gradient Flow), 17-01 (CNNs β€” for sequence-to-architecture thinking) Next subject: 17-03 β€” LSTM Mathematics


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the RNN recurrence relation and unroll it into a deep computational graph
  2. Explain the shared-weight structure across time steps and its implications
  3. Derive Backpropagation Through Time (BPTT) analytically for a simple RNN
  4. Quantify the vanishing/exploding gradient problem using the product-of-Jacobians analysis
  5. Distinguish between the RNN's short-term memory limitation and the exploding gradient problem

Core Content

1. Motivation: Why Feedforward Isn't Enough

A feedforward network maps a fixed-size input to a fixed-size output. But sequences β€” sentences, time series, audio β€” have variable length and temporal dependencies. An RNN processes one element at a time while maintaining a hidden state that summarizes all past inputs.

The key idea: use the same weights at every time step (parameter sharing across time), and let the hidden state carry information forward.

2. The RNN Recurrence

For input sequence x₁, xβ‚‚, ..., x_T (each x_t ∈ ℝ^d_in), the RNN computes:

h_t = tanh(W_hh Β· h_{t-1} + W_xh Β· x_t + b_h) y_t = W_hy Β· h_t + b_y

Where: - h_t ∈ ℝ^d_h is the hidden state at time t - W_hh ∈ ℝ^(d_h Γ— d_h) β€” recurrent weight matrix (hidden-to-hidden) - W_xh ∈ ℝ^(d_h Γ— d_in) β€” input weight matrix (input-to-hidden) - W_hy ∈ ℝ^(d_out Γ— d_h) β€” output weight matrix (hidden-to-output) - b_h, b_y are biases

⚠️ THIS IS CRITICAL β€” The SAME weights W_hh and W_xh are used at every time step. This is what makes the RNN handle variable-length sequences. It's also what makes training hard: the same W_hh gets multiplied repeatedly in the unrolled computation graph.

The hidden state h_0 is typically initialized to 0 (or learned).

3. Unrolling Through Time

To understand training, we "unroll" the RNN into a feedforward network of depth T:

$x₁ β†’ [W_xh] β†’ + β†’ [tanh] β†’ h₁ β†’ [W_hy] β†’ y₁
                ↑
          hβ‚€ β†’ [W_hh]

xβ‚‚ β†’ [W_xh] β†’ + β†’ [tanh] β†’ hβ‚‚ β†’ [W_hy] β†’ yβ‚‚
                ↑
          h₁ β†’ [W_hh]

x₃ β†’ [W_xh] β†’ + β†’ [tanh] β†’ h₃ β†’ [W_hy] β†’ y₃
                ↑
          hβ‚‚ β†’ [W_hh]
$

This is a very deep network (depth T) with shared weights across layers. Training requires computing gradients through this entire unrolled structure.

4. Backpropagation Through Time (BPTT)

Given a loss L = Ξ£_{t=1}^T L_t(y_t, target_t), we need βˆ‚L/βˆ‚W_hh, βˆ‚L/βˆ‚W_xh, βˆ‚L/βˆ‚W_hy.

Step 1: Gradient w.r.t. hidden states

Let Ξ΄_t = βˆ‚L/βˆ‚h_t. From the computation graph:

Ξ΄_t = (βˆ‚L_t/βˆ‚h_t) + (βˆ‚h_{t+1}/βˆ‚h_t)α΅€ Β· Ξ΄_{t+1}

Where βˆ‚L_t/βˆ‚h_t = W_hyα΅€ Β· βˆ‚L_t/βˆ‚y_t.

And the Jacobian of the recurrence:

βˆ‚h_{t+1}/βˆ‚h_t = diag(tanh'(z_{t+1})) Β· W_hh

where z_{t+1} = W_hh Β· h_t + W_xh Β· x_{t+1} + b_h (the pre-activation).

Step 2: Expanding the recurrence

Ξ΄_t = W_hyα΅€ Β· βˆ‚L_t/βˆ‚y_t + W_hhα΅€ Β· diag(tanh'(z_{t+1})) Β· Ξ΄_{t+1}

Unrolling from t back to 1:

δ₁ = Ξ£_{k=1}^T (∏_{j=k}^{2} W_hhα΅€ Β· diag(tanh'(z_j))) Β· W_hyα΅€ Β· βˆ‚L_k/βˆ‚y_k

This is a long product of Jacobians β€” and that's the smoking gun for the vanishing gradient problem.

Step 3: Gradient w.r.t. weights

Since W_hh is shared across all time steps, its gradient is the SUM of contributions from each step:

βˆ‚L/βˆ‚W_hh = Ξ£_{t=1}^T βˆ‚L_t/βˆ‚W_hh

Where at each step t, the gradient flows through all paths from L_t back to the occurrence of W_hh:

βˆ‚L_t/βˆ‚W_hh = Ξ£_{k=1}^t (βˆ‚L_t/βˆ‚h_t Β· βˆ‚h_t/βˆ‚h_k) Β· (βˆ‚h_k/βˆ‚W_hh)

The factor (βˆ‚h_t/βˆ‚h_k) is the product of Jacobians from step k to t.

5. The Vanishing/Exploding Gradient Problem

Consider the long-range dependency where we propagate gradient from time t back to time 1:

βˆ‚h_t/βˆ‚h₁ = ∏_{i=2}^t diag(tanh'(z_i)) Β· W_hh

Taking the spectral norm:

||βˆ‚h_t/βˆ‚h₁|| ≀ ∏_{i=2}^t ||diag(tanh'(z_i))|| Β· ||W_hh|| ≀ (||W_hh|| Β· max |tanh'|)^(t-1) ≀ (||W_hh|| Β· 1)^(t-1)

If ||W_hh|| < 1: gradient vanishes exponentially. Long-range dependencies are lost. If ||W_hh|| > 1: gradient explodes exponentially. Training becomes unstable.

The fundamental problem: For the RNN to have long-term memory, the recurrent Jacobian must have eigenvalues near 1. But standard gradient descent can't maintain this β€” the recurrent weight matrix inevitably drifts into the vanishing or exploding regime.

Why it's worse than deep feedforward: In a feedforward net of depth 100, each layer has its OWN weight matrix. In an RNN of 100 time steps, the SAME weight matrix is multiplied 100 times. If its spectral norm is 0.99, then 0.99^100 β‰ˆ 0.37 (manageable). If it's 0.95, 0.95^100 β‰ˆ 0.006 (problematic). This sensitivity to the recurrent weight's spectral radius is the core difficulty.

6. Truncated BPTT

The practical solution before LSTMs: truncate backpropagation to k steps.

Only propagate gradients back k < T steps

This avoids the vanishing/exploding problem but means the RNN can only learn dependencies spanning at most k time steps. It treats the RNN as having a fixed short-term memory window.

7. RNN Output Modes

RNNs support different input-output configurations:

Mode Input Output Example
Many-to-one Sequence Single value Sentiment classification
One-to-many Single value Sequence Image captioning
Many-to-many (synced) Sequence Sequence (same length) Part-of-speech tagging
Many-to-many (seq2seq) Sequence Sequence (different length) Machine translation


Key Terms

Worked Examples

Example 1: Unrolling a Tiny RNN

Problem: An RNN with d_h=2, d_in=1, d_out=1. W_hh = [[0.5, 0.1],[0.1, 0.5]], W_xh = [[0.2],[0.3]], b_h = [0,0], W_hy = [1, 1], b_y = 0. hβ‚€ = [0,0]. Input sequence: x = [1, 2]. Compute h₁, hβ‚‚ and y₁, yβ‚‚.

Solution: h₁ = tanh(W_hhΒ·hβ‚€ + W_xhΒ·1 + b_h) = tanh([0.2, 0.3]α΅€) = [tanh(0.2), tanh(0.3)] = [0.1974, 0.2913]

hβ‚‚ = tanh(W_hhΒ·h₁ + W_xhΒ·2 + b_h) W_hhΒ·h₁ = [0.5Β·0.1974+0.1Β·0.2913, 0.1Β·0.1974+0.5Β·0.2913] = [0.1278, 0.1654] W_xhΒ·2 = [0.4, 0.6] z = [0.5278, 0.7654] hβ‚‚ = [tanh(0.5278), tanh(0.7654)] = [0.4832, 0.6437]

y₁ = W_hyΒ·h₁ + b_y = 0.1974 + 0.2913 = 0.4887 yβ‚‚ = 0.4832 + 0.6437 = 1.1269

Example 2: BPTT Gradient for One Weight Occurrence

Problem: For the RNN in Example 1, compute βˆ‚yβ‚‚/βˆ‚W_xh[0] (the first element of W_xh) analytically at the first time step.

Solution: yβ‚‚ = W_hy Β· hβ‚‚, and hβ‚‚ = tanh(W_hhΒ·h₁ + W_xhΒ·xβ‚‚ + b_h)

So βˆ‚yβ‚‚/βˆ‚W_xh[0,0] depends on how W_xh[0,0] = 0.2 affects hβ‚‚. But W_xh[0,0] also affects h₁! So there are two paths: Path 1 (through h₁): βˆ‚h₁/βˆ‚W_xh[0,0] Β· βˆ‚hβ‚‚/βˆ‚h₁ Β· βˆ‚yβ‚‚/βˆ‚hβ‚‚ Path 2 (direct at time 2): depends only on xβ‚‚

βˆ‚h₁/βˆ‚W_xh[0,0] = tanh'(z₁₁) Β· x₁ Β· [1,0]α΅€ where z₁₁ is the first component of z₁ tanh'(0.2) = 1 βˆ’ tanhΒ²(0.2) = 1 βˆ’ 0.0390 = 0.9610 So βˆ‚h₁/βˆ‚W_xh[0,0] = [0.9610, 0]α΅€

βˆ‚hβ‚‚/βˆ‚h₁ = diag(tanh'(zβ‚‚)) Β· W_hh tanh'(0.5278) = 1 βˆ’ 0.2335 = 0.7665 tanh'(0.7654) = 1 βˆ’ 0.4144 = 0.5856 diag = [[0.7665, 0],[0, 0.5856]] βˆ‚hβ‚‚/βˆ‚h₁ = [[0.7665Β·0.5, 0.7665Β·0.1],[0.5856Β·0.1, 0.5856Β·0.5]] = [[0.3833, 0.0767],[0.0586, 0.2928]]

βˆ‚yβ‚‚/βˆ‚hβ‚‚ = W_hy = [1, 1]

Path 1 contribution: [1,1] Β· [[0.3833, 0.0767],[0.0586, 0.2928]] Β· [0.9610, 0]α΅€ = [0.4419, 0.3695] Β· [0.9610, 0]α΅€ = 0.4247

This shows how the gradient from yβ‚‚ back to weights at time 1 flows through h₁ β†’ hβ‚‚ β†’ yβ‚‚, being attenuated by the Jacobian product.

Example 3: Vanishing Gradient Demonstration

Problem: For an RNN with one-dimensional hidden state, W_hh = 0.5, and tanh acting as activation. If tanh'(z) β‰ˆ 0.8 on average (z is near zero), what fraction of the gradient at time 100 reaches time 1?

Solution: βˆ‚h_100/βˆ‚h₁ = ∏_{i=2}^{100} tanh'(z_i) Β· W_hh β‰ˆ (0.8 Β· 0.5)^99 = (0.4)^99 β‰ˆ 10^(βˆ’39)

The gradient is effectively zero. The RNN cannot learn dependencies spanning 100 time steps with these parameters.


Quiz

Q1: What does the concept of Backpropagation Through Time (BPTT) primarily refer to in this subject?

A) A historical anecdote about Backpropagation Through Time (BPTT) B) A visual representation of Backpropagation Through Time (BPTT) C) A computational error related to Backpropagation Through Time (BPTT) D) The definition and application of Backpropagation Through Time (BPTT)

Correct: D)

Q2: What is the primary purpose of End-of-Subject Quiz?

A) It is used to end-of-subject quiz in mathematical analysis B) It is used only in advanced research contexts C) It is primarily a historical notation system D) It replaces all other methods in this domain

Correct: A)

Q3: Which statement about Motivation: Why Feedforward Isn'T Enough is TRUE?

A) Motivation: Why Feedforward Isn'T Enough is mentioned only as a historical footnote B) Motivation: Why Feedforward Isn'T Enough is a fundamental concept covered in this subject C) Motivation: Why Feedforward Isn'T Enough is not related to this subject D) Motivation: Why Feedforward Isn'T Enough is an advanced topic beyond this subject's scope

Correct: B)

Q4: Based on the worked examples in this subject, what is the correct result?

A) 3Γ—3, W_xh: 3Γ—2, b_h: 3Γ—1. B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer

Correct: A)

Q5: How are Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence related?

A) Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are completely unrelated topics B) Motivation: Why Feedforward Isn'T Enough is the inverse of The Rnn Recurrence C) Motivation: Why Feedforward Isn'T Enough and The Rnn Recurrence are closely related concepts D) Motivation: Why Feedforward Isn'T Enough is a special case of The Rnn Recurrence

Correct: C)

Q6: What is a common pitfall when working with Unrolling Through Time?

A) A common mistake is confusing Unrolling Through Time with a similar concept B) Unrolling Through Time is always computed the same way in all contexts C) The main error with Unrolling Through Time is using it when it is not needed D) Unrolling Through Time has no common misconceptions

Correct: A)

Q7: When should you apply The Vanishing/Exploding Gradient Problem?

A) The Vanishing/Exploding Gradient Problem is not practically useful B) Avoid The Vanishing/Exploding Gradient Problem unless explicitly instructed C) Use The Vanishing/Exploding Gradient Problem only in pure mathematics contexts D) Apply The Vanishing/Exploding Gradient Problem to solve problems in this subject's domain

Correct: D)

Practice Problems

Problem 1

An RNN has d_h = 3, d_in = 2. What are the dimensions of W_hh, W_xh, b_h?

Answer W_hh: 3Γ—3, W_xh: 3Γ—2, b_h: 3Γ—1.

Problem 2

If ||W_hh|| = 1.2 and max |tanh'| = 1, what happens to gradients over 50 time steps?

Answer ||βˆ‚h_50/βˆ‚h₁|| β‰ˆ 1.2^49 β‰ˆ 7.5Γ—10Β³ β€” gradients explode. Training will be unstable.

Problem 3

Explain why weight sharing in RNNs makes the vanishing gradient problem worse than in deep feedforward nets.

Answer In feedforward nets, each layer has its own weight matrix, so eigenvalues can vary across layers; some can be >1 while others <1, partially balancing. In RNNs, the same W_hh is multiplied T times, so if ||W_hh|| < 1, ALL T multiplications shrink the gradient. There's no opportunity for balancing β€” the same matrix's spectral properties are compounded T times.

Problem 4

Compute βˆ‚L/βˆ‚W_hh for an RNN with T=3 time steps. Write it as a sum of contributions from t=1,2,3.

Answer βˆ‚L/βˆ‚W_hh = (βˆ‚L₁/βˆ‚h₁)(βˆ‚h₁/βˆ‚W_hh) + (βˆ‚Lβ‚‚/βˆ‚hβ‚‚)(βˆ‚hβ‚‚/βˆ‚h₁)(βˆ‚h₁/βˆ‚W_hh) + (βˆ‚Lβ‚‚/βˆ‚hβ‚‚)(βˆ‚hβ‚‚/βˆ‚W_hh at t=2) + (βˆ‚L₃/βˆ‚h₃)(βˆ‚h₃/βˆ‚hβ‚‚)(βˆ‚hβ‚‚/βˆ‚h₁)(βˆ‚h₁/βˆ‚W_hh) + ... Each occurrence of W_hh in the unrolled graph contributes a term, and gradients flow back through the entire chain of hidden states.

Problem 5

If we use truncated BPTT with k=10, what is the maximum temporal dependency the RNN can learn? What if k=T?

Answer With k=10, at most 10 steps. With k=T (full BPTT), the RNN can theoretically learn dependencies of any length, but the vanishing gradient makes this impossible in practice with standard RNNs.

Summary

  1. RNNs process sequences by applying the same weights at each time step: h_t = tanh(W_hhΒ·h_{t-1} + W_xhΒ·x_t + b_h)
  2. Training unrolls the RNN into a depth-T feedforward net; BPTT computes gradients through all T steps
  3. The recurrent Jacobian product ∏ diag(tanh')·W_hh causes vanishing gradients if ||W_hh|| < 1 and exploding gradients if ||W_hh|| > 1
  4. Weight sharing amplifies the problem: the same matrix is multiplied T times with no layer-to-layer diversity
  5. Truncated BPTT limits the effective memory to k steps, which motivated the development of gated architectures (LSTM, GRU)

Pitfalls



Next Steps

Continue to 17-03 β€” LSTM Mathematics to see how gating mechanisms solve the vanishing gradient problem.