17-04 β GRU Mathematics
Phase: 17 β Deep Learning Architectures (Math) Subject: 17-04 Prerequisites: 17-02 (RNNs), 17-03 (LSTM β for gating concepts), 16-02 (Activation Functions) Next subject: 17-05 β Residual Connections
Learning Objectives
By the end of this subject, you will be able to:
- Derive the full GRU equations from first principles and contrast with LSTM
- Explain how the update gate z_t performs the role of BOTH forget and input gates
- Derive the gradient βh_t/βh_{t-1} for the GRU and show how it avoids vanishing gradients
- Quantify the parameter reduction of GRU vs. LSTM (25% fewer gates = 25% fewer recurrent params)
- Identify when the GRU's simplified gating fails compared to LSTM
Core Content
1. GRU Philosophy: Do More With Less
The GRU (Gated Recurrent Unit) asks: do we really need a separate cell state? The answer: not necessarily, if we design the gating cleverly.
The GRU uses only TWO gates (vs. LSTM's four) and a SINGLE hidden state (no separate cell state). It achieves comparable performance with fewer parameters.
2. GRU Equations
For input x_t and previous hidden state h_{t-1}:
Reset gate r_t: Controls how much of the previous hidden state to "forget" when computing the candidate
r_t = Ο(W_r Β· [h_{t-1} β₯ x_t] + b_r)
Update gate z_t: Controls the interpolation between old state and new candidate (combines LSTM's forget + input gates)
z_t = Ο(W_z Β· [h_{t-1} β₯ x_t] + b_z)
Candidate hidden state hΜ_t: The proposed new state, computed with gated access to h_{t-1}
hΜ_t = tanh(W Β· [r_t β h_{t-1} β₯ x_t] + b)
Hidden state update h_t: Interpolation between old and candidate
h_t = (1 β z_t) β h_{t-1} + z_t β hΜ_t
Where: - All gates are in (0,1) via sigmoid - β = element-wise product - [a β₯ b] = concatenation - Each W_ β β^(d Γ (d+d_in)), b_ β β^d
β οΈ THIS IS CRITICAL β The update h_t = (1βz_t)βh_{t-1} + z_tβhΜ_t is an element-wise LINEAR INTERPOLATION between old and new. When z_t β 0, the state is copied verbatim (gradient highway). When z_t β 1, the state is fully updated. This is like the LSTM cell state update, but using the SAME state for both memory and output.
3. The Two-Gate Design
Reset gate r_t: When r_t β 0, the candidate hΜt is computed as if h{t-1} were all zeros β the GRU "resets" and treats the current input as the start of a new sequence. When r_t β 1, the candidate considers the full history.
Update gate z_t: When z_t β 0, h_t β h_{t-1} β perfect memory, no gradient decay. When z_t β 1, h_t β hΜ_t β full update using the candidate.
This is elegant: one gate (r) controls what to forget when computing the candidate; the other (z) controls the final interpolation.
4. Gradient Flow in GRU
Let's analyze βh_t/βh_{t-1}. From the update equation:
h_t = (1 β z_t)βh_{t-1} + z_tβhΜ_t
The direct dependency through the first term:
βh_t/βh_{t-1} = diag(1 β z_t) + (terms from βz_t/βh_{t-1} and βhΜt/βh{t-1})
The critical observation: diag(1 β z_t) is a DIAGONAL matrix! Just like the LSTM, the GRU has an additive update that creates a gradient highway.
When z_t β 0: βh_t/βh_{t-1} β I (identity β perfect gradient flow) When z_t β 1: βh_t/βh_{t-1} β βhΜt/βh{t-1} (gradient flows through the candidate computation)
Long-range gradient:
βh_t/βh_{t-1} for the direct path β β{j=0}^{k-1} diag(1 β z{t-j})
If z values stay small (the GRU keeps old information), the product stays near I.
But there's a subtlety: Because the GRU uses the SAME state for everything, the gradient also flows through the gate computations (βz_t/βh_{t-1} and βr_t/βh_{t-1}, which DO involve W_z and W_r). These paths can still suffer from vanishing gradients. However, the dominant path is the additive one.
5. GRU vs. LSTM: Mathematical Comparison
| Aspect | LSTM | GRU |
|---|---|---|
| States | c_t (cell), h_t (hidden) | h_t only |
| Gates | f, i, o + candidate = 4 | r, z = 2 |
| Memory update | c_t = fβc_{t-1} + iβCΜ | h_t = (1βz)βh_{t-1} + zβhΜ |
| Output | h_t = oβtanh(c_t) | h_t already carries output |
| Gradient highway | Through c_t (diag(f)) | Through h_t (diag(1βz)) |
| Parameters | 4d(d+d_in) | 3d(d+d_in) |
GRU advantage: 25% fewer recurrent parameters (3 weight matrices vs 4). LSTM advantage: Separate cell state means the gradient highway (c_t) is decoupled from the output gating (o_t). The GRU forces the same state to serve both purposes.
When GRU may fail: For tasks requiring the LSTM to remember information but NOT output it (output gate o_t β 0), the LSTM can store in c_t without revealing it in h_t. The GRU can't do this because h_t is both memory and output.
6. Parameter Count Example
For d=512, d_in=256: - LSTM: 4 Γ 512 Γ 768 = 1,572,864 recurrent parameters - GRU: 3 Γ 512 Γ 768 = 1,179,648 recurrent parameters
GRU saves ~393K parameters, or 25%.
7. The Reset Gate's Role in Gradient Flow
The reset gate affects gradient flow through the candidate computation:
hΜt = tanh(W Β· [r_tβh{t-1} β₯ x_t] + b)
When r_t β 0: hΜt doesn't depend on h{t-1} at all β gradient through the candidate path is zero. The only gradient flow is through (1βz_t)βh_{t-1}.
When r_t β 1: hΜt depends fully on h{t-1}, and gradients flow through both the additive path AND the candidate path.
This gives the GRU fine-grained control: even when updating (z_t large), it can choose whether the new state should depend on history (r_t large) or be a fresh start (r_t small).
8. Empirical Notes
In practice, GRU and LSTM perform similarly on most tasks. GRU often wins on smaller datasets (fewer parameters = less overfitting). LSTM sometimes wins on very long sequences (separate cell state provides cleaner gradient highway). For modern applications, both have been largely superseded by Transformers, but understanding their gating mechanisms is essential β the same principles appear in attention gating, highway networks, and adaptive computation.
Key Terms
- 17 04 Gru Mathematics
- Aspect
- Empirical Notes
- End-of-Subject Quiz
- Example 1: Forward Pass of a 2D GRU
- Example 2: Gradient Through the Additive Path
- Example 3: Long-Sequence Gradient Decay
- GRU Equations
- GRU Philosophy: Do More With Less
- GRU vs. LSTM: Mathematical Comparison
- Gates
- Gradient Flow in GRU
Worked Examples
Example 1: Forward Pass of a 2D GRU
Problem: A GRU with d=2, d_in=1. W_r = [[0.5,0.1,0.1],[0.1,0.5,0.1]], W_z = same, W = [[0.2,0.3,0.1],[0.3,0.2,0.1]]. All biases zero. h_{t-1}=[0.5, 0.5], x_t=1. Compute r_t, z_t, hΜ_t, h_t.
Solution: [h_{t-1} β₯ x_t] = [0.5, 0.5, 1]
r_preact = W_r Β· [0.5, 0.5, 1] = [0.25+0.05+0.1, 0.05+0.25+0.1] = [0.40, 0.40] r_t = [Ο(0.40), Ο(0.40)] = [0.5987, 0.5987]
z_preact = same = [0.40, 0.40] z_t = [0.5987, 0.5987]
r_tβh_{t-1} = [0.2993, 0.2993] hΜ_preact = W Β· [0.2993, 0.2993, 1] = [0.0599+0.0898+0.1, 0.0898+0.0599+0.1] = [0.2497, 0.2497] hΜ_t = [tanh(0.2497), tanh(0.2497)] = [0.2446, 0.2446]
h_t = (1β0.5987)Β·0.5 + 0.5987Β·0.2446 = [0.2007+0.1464, 0.2007+0.1464] = [0.3471, 0.3471]
Example 2: Gradient Through the Additive Path
Problem: If βL/βh_t = [1, 0] at time t and z_t = [0.2, 0.8], what is the direct contribution to βL/βh_{t-1}?
Solution: The direct path: h_t = (1βz_t)βh_{t-1} + ... So βh_t/βh_{t-1} (direct) = diag(1βz_t) = diag([0.8, 0.2])
βL/βh_{t-1} (direct) = diag(1βz_t)α΅ Β· [1,0] = [0.8Β·1, 0.2Β·0] = [0.8, 0]
The first dimension preserves 80% of the gradient; the second dimension (which was heavily updated) only preserves 20%.
Example 3: Long-Sequence Gradient Decay
Problem: A GRU dimension consistently has z_t = 0.1 for 100 time steps (rarely updated β strong memory). What fraction of gradient survives the direct path?
Solution: β (1β0.1) = 0.9^100 β 2.66 Γ 10^{-5}
This is surprisingly small! Even with z=0.1 (90% preserved each step), after 100 steps only 0.003% survives. This shows that even gated architectures aren't magic β very long sequences still challenge them, which is one reason Transformers with direct attention won.
Quiz
Q1: How many gates does a GRU have, and what are they?
A) 4 gates: forget, input, output, and candidate B) 2 gates: reset gate r_t and update gate z_t C) 3 gates: reset, update, and output D) 1 gate: the update gate
Answer & Explanation
**B** β The GRU has exactly two gates: r_t (reset gate) and z_t (update gate). The LSTM has 4 gates. The update gate combines the roles of LSTM's forget and input gates into a single interpolation mechanism.Q2: What creates the gradient highway in a GRU?
A) The tanh activation in the candidate computation B) The term (1 β z_t) β h_{tβ1} in the hidden state update C) The reset gate multiplying h_{tβ1} D) The concatenation of h_{tβ1} and x_t
Answer & Explanation
**B** β h_t = (1 β z_t) β h_{tβ1} + z_t β hΜ_t. The term (1 β z_t) β h_{tβ1} gives βh_t/βh_{tβ1} containing diag(1 β z_t) β an additive term without weight matrix multiplication. When z_t β 0, this is near-identity.Q3: What happens when the reset gate r_t β 0?
A) The hidden state is copied verbatim from the previous step B) The candidate hΜt is computed as if h{tβ1} were all zeros β the GRU "resets" C) The update gate is forced to 0 D) The GRU switches to LSTM mode
Answer & Explanation
**B** β hΜ_t = tanh(W Β· [r_t β h_{tβ1} β₯ x_t] + b). When r_t β 0, r_t β h_{tβ1} β 0, so the candidate depends only on x_t. A describes z_t β 0 behavior.Q4: How many recurrent weight matrices does a GRU have versus an LSTM?
A) GRU: 2, LSTM: 4 B) GRU: 3, LSTM: 4 C) GRU: 4, LSTM: 4 D) GRU: 2, LSTM: 2
Answer & Explanation
**B** β GRU has W_r, W_z, W (3 matrices). LSTM has W_f, W_i, W_C, W_o (4 matrices). GRU saves ~25% on recurrent parameters (3d(d+d_in) vs 4d(d+d_in)).Q5: What is a limitation of the GRU compared to the LSTM?
A) The GRU cannot handle sequences longer than 10 steps B) The GRU cannot separately control what to remember and what to output, since it has a single hidden state C) The GRU cannot use sigmoid gates D) The GRU requires more memory than the LSTM
Answer & Explanation
**B** β LSTM has separate c_t (memory) and h_t = o_t β tanh(c_t) (output). If o_t β 0, information is stored in c_t without exposure. GRU's single state h_t serves as both memory and output β everything remembered is always exposed.Practice Problems
Problem 1
Write the GRU update equation and identify which part creates the gradient highway.
Answer
h_t = (1βz_t)βh_{t-1} + z_tβhΜ_t. The term (1βz_t)βh_{t-1} creates the gradient highway β it's additive and doesn't multiply h_{t-1} by a weight matrix. βh_t/βh_{t-1} includes diag(1βz_t) from this path.Problem 2
How many weight matrices does a GRU have, and what are they?
Answer
Three: W_r (reset gate), W_z (update gate), and W (candidate hidden state). Each is d Γ (d+d_in).Problem 3
What happens when z_t = 0 for all t?
Answer
h_t = 1Β·h_{t-1} + 0Β·hΜ_t = h_{t-1}. The hidden state becomes constant (h_t = h_0 for all t). The GRU ignores all inputs β perfect memory, zero learning.Problem 4
Derive βh_t/βr_t (gradient of hidden state w.r.t reset gate).
Answer
h_t depends on r_t only through hΜ_t = tanh(W[r_tβh_{t-1} β₯ x_t] + b). So: βh_t/βr_t = z_t β (βhΜ_t/βr_t) = z_t β (tanh'(preact) β (W_{:, :d} Β· h_{t-1})) where W_{:, :d} are the columns of W that multiply r_tβh_{t-1}, and the product involves broadcasting tanh' across the hidden dimensions.Problem 5
Compare GRU and LSTM: which can "remember without revealing"? Explain mathematically.
Answer
LSTM can, GRU can't. LSTM has separate c_t (memory) and h_t = o_tβtanh(c_t) (output). If o_t β 0, h_t β 0 regardless of c_t β the LSTM hides its memory. GRU has one state h_t that serves as both memory and output. If h_t β 0, the memory is lost. If h_t is nonzero, the information is exposed.Summary
- GRU simplifies LSTM with 2 gates instead of 4: update gate z_t (interpolation weight) and reset gate r_t (controls history influence on candidate)
- The update h_t = (1βz_t)βh_{t-1} + z_tβhΜt is a linear interpolation creating a gradient highway: βh_t/βh{t-1} contains diag(1βz_t)
- When z_t β 0, gradients flow freely (near-identity Jacobian); when z_t β 1, the GRU fully updates
- GRU has 25% fewer recurrent parameters than LSTM (3d(d+d_in) vs 4d(d+d_in))
- The single-state design means GRU can't "remember without revealing" like LSTM can with its output gate
Pitfalls
- Assuming GRU always matches LSTM performance. While GRU and LSTM perform similarly on most benchmarks, GRU can underperform on tasks requiring the "remember without revealing" capability β LSTM's output gate o_t lets it store information in c_t while outputting h_t β 0. The GRU's single state h_t cannot separately control memory retention and information exposure.
- Misunderstanding the coupling between z_t and (1 β z_t). Since the update gate z_t controls both how much to retain (1 β z_t) and how much to update (z_t) through a single sigmoid output, you cannot independently set "keep 90%" and "add 50%." This coupling is the GRU's fundamental architectural constraint relative to LSTM.
- Expecting perfect gradient flow over very long sequences (1000+ steps). Even with z_t = 0.1 (90% preserved per step), gradient survival is 0.9^1000 β 10^β46. Additive gating helps enormously but doesn't create infinite memory β Transformers with direct attention ultimately won for very long-range dependencies.
- Confusing the reset gate r_t with the update gate z_t. r_t only affects the candidate computation hΜ_t β when r_t β 0, the candidate ignores history entirely. z_t controls the final interpolation between old state and candidate. A common debugging mistake: expecting r_t to control the output when it only gates information flow into hΜ_t.
- Treating GRU as strictly "simpler than LSTM." While GRU has fewer gates (3 weight matrices vs. 4), the single-state design means all gradient paths funnel through one state vector. The gradient dynamics through gate computations are actually more entangled than in LSTM, where the cell state and hidden state provide partially decoupled paths.
Next Steps
Continue to 17-05 β Residual Connections to learn about skip connections and how they enable very deep networks.