Math graphic
πŸ“ Concept diagram

17-04 β€” GRU Mathematics

Phase: 17 β€” Deep Learning Architectures (Math) Subject: 17-04 Prerequisites: 17-02 (RNNs), 17-03 (LSTM β€” for gating concepts), 16-02 (Activation Functions) Next subject: 17-05 β€” Residual Connections


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the full GRU equations from first principles and contrast with LSTM
  2. Explain how the update gate z_t performs the role of BOTH forget and input gates
  3. Derive the gradient βˆ‚h_t/βˆ‚h_{t-1} for the GRU and show how it avoids vanishing gradients
  4. Quantify the parameter reduction of GRU vs. LSTM (25% fewer gates = 25% fewer recurrent params)
  5. Identify when the GRU's simplified gating fails compared to LSTM

Core Content

1. GRU Philosophy: Do More With Less

The GRU (Gated Recurrent Unit) asks: do we really need a separate cell state? The answer: not necessarily, if we design the gating cleverly.

The GRU uses only TWO gates (vs. LSTM's four) and a SINGLE hidden state (no separate cell state). It achieves comparable performance with fewer parameters.

2. GRU Equations

For input x_t and previous hidden state h_{t-1}:

Reset gate r_t: Controls how much of the previous hidden state to "forget" when computing the candidate

r_t = Οƒ(W_r Β· [h_{t-1} βˆ₯ x_t] + b_r)

Update gate z_t: Controls the interpolation between old state and new candidate (combines LSTM's forget + input gates)

z_t = Οƒ(W_z Β· [h_{t-1} βˆ₯ x_t] + b_z)

Candidate hidden state h̃_t: The proposed new state, computed with gated access to h_{t-1}

hΜƒ_t = tanh(W Β· [r_t βŠ™ h_{t-1} βˆ₯ x_t] + b)

Hidden state update h_t: Interpolation between old and candidate

h_t = (1 βˆ’ z_t) βŠ™ h_{t-1} + z_t βŠ™ hΜƒ_t

Where: - All gates are in (0,1) via sigmoid - βŠ™ = element-wise product - [a βˆ₯ b] = concatenation - Each W_ ∈ ℝ^(d Γ— (d+d_in)), b_ ∈ ℝ^d

⚠️ THIS IS CRITICAL β€” The update h_t = (1βˆ’z_t)βŠ™h_{t-1} + z_tβŠ™hΜƒ_t is an element-wise LINEAR INTERPOLATION between old and new. When z_t β‰ˆ 0, the state is copied verbatim (gradient highway). When z_t β‰ˆ 1, the state is fully updated. This is like the LSTM cell state update, but using the SAME state for both memory and output.

3. The Two-Gate Design

Reset gate r_t: When r_t β‰ˆ 0, the candidate hΜƒt is computed as if h{t-1} were all zeros β€” the GRU "resets" and treats the current input as the start of a new sequence. When r_t β‰ˆ 1, the candidate considers the full history.

Update gate z_t: When z_t β‰ˆ 0, h_t β‰ˆ h_{t-1} β€” perfect memory, no gradient decay. When z_t β‰ˆ 1, h_t β‰ˆ hΜƒ_t β€” full update using the candidate.

This is elegant: one gate (r) controls what to forget when computing the candidate; the other (z) controls the final interpolation.

4. Gradient Flow in GRU

Let's analyze βˆ‚h_t/βˆ‚h_{t-1}. From the update equation:

h_t = (1 βˆ’ z_t)βŠ™h_{t-1} + z_tβŠ™hΜƒ_t

The direct dependency through the first term:

βˆ‚h_t/βˆ‚h_{t-1} = diag(1 βˆ’ z_t) + (terms from βˆ‚z_t/βˆ‚h_{t-1} and βˆ‚hΜƒt/βˆ‚h{t-1})

The critical observation: diag(1 βˆ’ z_t) is a DIAGONAL matrix! Just like the LSTM, the GRU has an additive update that creates a gradient highway.

When z_t β‰ˆ 0: βˆ‚h_t/βˆ‚h_{t-1} β‰ˆ I (identity β€” perfect gradient flow) When z_t β‰ˆ 1: βˆ‚h_t/βˆ‚h_{t-1} β‰ˆ βˆ‚hΜƒt/βˆ‚h{t-1} (gradient flows through the candidate computation)

Long-range gradient:

βˆ‚h_t/βˆ‚h_{t-1} for the direct path β‰ˆ ∏{j=0}^{k-1} diag(1 βˆ’ z{t-j})

If z values stay small (the GRU keeps old information), the product stays near I.

But there's a subtlety: Because the GRU uses the SAME state for everything, the gradient also flows through the gate computations (βˆ‚z_t/βˆ‚h_{t-1} and βˆ‚r_t/βˆ‚h_{t-1}, which DO involve W_z and W_r). These paths can still suffer from vanishing gradients. However, the dominant path is the additive one.

5. GRU vs. LSTM: Mathematical Comparison

Aspect LSTM GRU
States c_t (cell), h_t (hidden) h_t only
Gates f, i, o + candidate = 4 r, z = 2
Memory update c_t = fβŠ™c_{t-1} + iβŠ™CΜƒ h_t = (1βˆ’z)βŠ™h_{t-1} + zβŠ™hΜƒ
Output h_t = oβŠ™tanh(c_t) h_t already carries output
Gradient highway Through c_t (diag(f)) Through h_t (diag(1βˆ’z))
Parameters 4d(d+d_in) 3d(d+d_in)

GRU advantage: 25% fewer recurrent parameters (3 weight matrices vs 4). LSTM advantage: Separate cell state means the gradient highway (c_t) is decoupled from the output gating (o_t). The GRU forces the same state to serve both purposes.

When GRU may fail: For tasks requiring the LSTM to remember information but NOT output it (output gate o_t β‰ˆ 0), the LSTM can store in c_t without revealing it in h_t. The GRU can't do this because h_t is both memory and output.

6. Parameter Count Example

For d=512, d_in=256: - LSTM: 4 Γ— 512 Γ— 768 = 1,572,864 recurrent parameters - GRU: 3 Γ— 512 Γ— 768 = 1,179,648 recurrent parameters

GRU saves ~393K parameters, or 25%.

7. The Reset Gate's Role in Gradient Flow

The reset gate affects gradient flow through the candidate computation:

hΜƒt = tanh(W Β· [r_tβŠ™h{t-1} βˆ₯ x_t] + b)

When r_t β‰ˆ 0: hΜƒt doesn't depend on h{t-1} at all β€” gradient through the candidate path is zero. The only gradient flow is through (1βˆ’z_t)βŠ™h_{t-1}.

When r_t β‰ˆ 1: hΜƒt depends fully on h{t-1}, and gradients flow through both the additive path AND the candidate path.

This gives the GRU fine-grained control: even when updating (z_t large), it can choose whether the new state should depend on history (r_t large) or be a fresh start (r_t small).

8. Empirical Notes

In practice, GRU and LSTM perform similarly on most tasks. GRU often wins on smaller datasets (fewer parameters = less overfitting). LSTM sometimes wins on very long sequences (separate cell state provides cleaner gradient highway). For modern applications, both have been largely superseded by Transformers, but understanding their gating mechanisms is essential β€” the same principles appear in attention gating, highway networks, and adaptive computation.



Key Terms

Worked Examples

Example 1: Forward Pass of a 2D GRU

Problem: A GRU with d=2, d_in=1. W_r = [[0.5,0.1,0.1],[0.1,0.5,0.1]], W_z = same, W = [[0.2,0.3,0.1],[0.3,0.2,0.1]]. All biases zero. h_{t-1}=[0.5, 0.5], x_t=1. Compute r_t, z_t, h̃_t, h_t.

Solution: [h_{t-1} βˆ₯ x_t] = [0.5, 0.5, 1]

r_preact = W_r Β· [0.5, 0.5, 1] = [0.25+0.05+0.1, 0.05+0.25+0.1] = [0.40, 0.40] r_t = [Οƒ(0.40), Οƒ(0.40)] = [0.5987, 0.5987]

z_preact = same = [0.40, 0.40] z_t = [0.5987, 0.5987]

r_tβŠ™h_{t-1} = [0.2993, 0.2993] hΜƒ_preact = W Β· [0.2993, 0.2993, 1] = [0.0599+0.0898+0.1, 0.0898+0.0599+0.1] = [0.2497, 0.2497] hΜƒ_t = [tanh(0.2497), tanh(0.2497)] = [0.2446, 0.2446]

h_t = (1βˆ’0.5987)Β·0.5 + 0.5987Β·0.2446 = [0.2007+0.1464, 0.2007+0.1464] = [0.3471, 0.3471]

Example 2: Gradient Through the Additive Path

Problem: If βˆ‚L/βˆ‚h_t = [1, 0] at time t and z_t = [0.2, 0.8], what is the direct contribution to βˆ‚L/βˆ‚h_{t-1}?

Solution: The direct path: h_t = (1βˆ’z_t)βŠ™h_{t-1} + ... So βˆ‚h_t/βˆ‚h_{t-1} (direct) = diag(1βˆ’z_t) = diag([0.8, 0.2])

βˆ‚L/βˆ‚h_{t-1} (direct) = diag(1βˆ’z_t)α΅€ Β· [1,0] = [0.8Β·1, 0.2Β·0] = [0.8, 0]

The first dimension preserves 80% of the gradient; the second dimension (which was heavily updated) only preserves 20%.

Example 3: Long-Sequence Gradient Decay

Problem: A GRU dimension consistently has z_t = 0.1 for 100 time steps (rarely updated β€” strong memory). What fraction of gradient survives the direct path?

Solution: ∏ (1βˆ’0.1) = 0.9^100 β‰ˆ 2.66 Γ— 10^{-5}

This is surprisingly small! Even with z=0.1 (90% preserved each step), after 100 steps only 0.003% survives. This shows that even gated architectures aren't magic β€” very long sequences still challenge them, which is one reason Transformers with direct attention won.


Quiz

Q1: How many gates does a GRU have, and what are they?

A) 4 gates: forget, input, output, and candidate B) 2 gates: reset gate r_t and update gate z_t C) 3 gates: reset, update, and output D) 1 gate: the update gate

Answer & Explanation **B** β€” The GRU has exactly two gates: r_t (reset gate) and z_t (update gate). The LSTM has 4 gates. The update gate combines the roles of LSTM's forget and input gates into a single interpolation mechanism.

Q2: What creates the gradient highway in a GRU?

A) The tanh activation in the candidate computation B) The term (1 βˆ’ z_t) βŠ™ h_{tβˆ’1} in the hidden state update C) The reset gate multiplying h_{tβˆ’1} D) The concatenation of h_{tβˆ’1} and x_t

Answer & Explanation **B** β€” h_t = (1 βˆ’ z_t) βŠ™ h_{tβˆ’1} + z_t βŠ™ hΜƒ_t. The term (1 βˆ’ z_t) βŠ™ h_{tβˆ’1} gives βˆ‚h_t/βˆ‚h_{tβˆ’1} containing diag(1 βˆ’ z_t) β€” an additive term without weight matrix multiplication. When z_t β‰ˆ 0, this is near-identity.

Q3: What happens when the reset gate r_t β‰ˆ 0?

A) The hidden state is copied verbatim from the previous step B) The candidate hΜƒt is computed as if h{tβˆ’1} were all zeros β€” the GRU "resets" C) The update gate is forced to 0 D) The GRU switches to LSTM mode

Answer & Explanation **B** β€” hΜƒ_t = tanh(W Β· [r_t βŠ™ h_{tβˆ’1} βˆ₯ x_t] + b). When r_t β‰ˆ 0, r_t βŠ™ h_{tβˆ’1} β‰ˆ 0, so the candidate depends only on x_t. A describes z_t β‰ˆ 0 behavior.

Q4: How many recurrent weight matrices does a GRU have versus an LSTM?

A) GRU: 2, LSTM: 4 B) GRU: 3, LSTM: 4 C) GRU: 4, LSTM: 4 D) GRU: 2, LSTM: 2

Answer & Explanation **B** β€” GRU has W_r, W_z, W (3 matrices). LSTM has W_f, W_i, W_C, W_o (4 matrices). GRU saves ~25% on recurrent parameters (3d(d+d_in) vs 4d(d+d_in)).

Q5: What is a limitation of the GRU compared to the LSTM?

A) The GRU cannot handle sequences longer than 10 steps B) The GRU cannot separately control what to remember and what to output, since it has a single hidden state C) The GRU cannot use sigmoid gates D) The GRU requires more memory than the LSTM

Answer & Explanation **B** β€” LSTM has separate c_t (memory) and h_t = o_t βŠ™ tanh(c_t) (output). If o_t β‰ˆ 0, information is stored in c_t without exposure. GRU's single state h_t serves as both memory and output β€” everything remembered is always exposed.

Practice Problems

Problem 1

Write the GRU update equation and identify which part creates the gradient highway.

Answer h_t = (1βˆ’z_t)βŠ™h_{t-1} + z_tβŠ™hΜƒ_t. The term (1βˆ’z_t)βŠ™h_{t-1} creates the gradient highway β€” it's additive and doesn't multiply h_{t-1} by a weight matrix. βˆ‚h_t/βˆ‚h_{t-1} includes diag(1βˆ’z_t) from this path.

Problem 2

How many weight matrices does a GRU have, and what are they?

Answer Three: W_r (reset gate), W_z (update gate), and W (candidate hidden state). Each is d Γ— (d+d_in).

Problem 3

What happens when z_t = 0 for all t?

Answer h_t = 1·h_{t-1} + 0·h̃_t = h_{t-1}. The hidden state becomes constant (h_t = h_0 for all t). The GRU ignores all inputs — perfect memory, zero learning.

Problem 4

Derive βˆ‚h_t/βˆ‚r_t (gradient of hidden state w.r.t reset gate).

Answer h_t depends on r_t only through hΜƒ_t = tanh(W[r_tβŠ™h_{t-1} βˆ₯ x_t] + b). So: βˆ‚h_t/βˆ‚r_t = z_t βŠ™ (βˆ‚hΜƒ_t/βˆ‚r_t) = z_t βŠ™ (tanh'(preact) βŠ™ (W_{:, :d} Β· h_{t-1})) where W_{:, :d} are the columns of W that multiply r_tβŠ™h_{t-1}, and the product involves broadcasting tanh' across the hidden dimensions.

Problem 5

Compare GRU and LSTM: which can "remember without revealing"? Explain mathematically.

Answer LSTM can, GRU can't. LSTM has separate c_t (memory) and h_t = o_tβŠ™tanh(c_t) (output). If o_t β‰ˆ 0, h_t β‰ˆ 0 regardless of c_t β€” the LSTM hides its memory. GRU has one state h_t that serves as both memory and output. If h_t β‰ˆ 0, the memory is lost. If h_t is nonzero, the information is exposed.

Summary

  1. GRU simplifies LSTM with 2 gates instead of 4: update gate z_t (interpolation weight) and reset gate r_t (controls history influence on candidate)
  2. The update h_t = (1βˆ’z_t)βŠ™h_{t-1} + z_tβŠ™hΜƒt is a linear interpolation creating a gradient highway: βˆ‚h_t/βˆ‚h{t-1} contains diag(1βˆ’z_t)
  3. When z_t β‰ˆ 0, gradients flow freely (near-identity Jacobian); when z_t β‰ˆ 1, the GRU fully updates
  4. GRU has 25% fewer recurrent parameters than LSTM (3d(d+d_in) vs 4d(d+d_in))
  5. The single-state design means GRU can't "remember without revealing" like LSTM can with its output gate

Pitfalls


Next Steps

Continue to 17-05 β€” Residual Connections to learn about skip connections and how they enable very deep networks.