📐 Concept diagram

17-04 — GRU Mathematics

Phase: 17 — Deep Learning Architectures (Math) Subject: 17-04 Prerequisites: 17-02 (RNNs), 17-03 (LSTM — for gating concepts), 16-02 (Activation Functions) Next subject: 17-05 — Residual Connections

Learning Objectives

By the end of this subject, you will be able to:

Derive the full GRU equations from first principles and contrast with LSTM
Explain how the update gate z_t performs the role of BOTH forget and input gates
Derive the gradient ∂h_t/∂h_{t-1} for the GRU and show how it avoids vanishing gradients
Quantify the parameter reduction of GRU vs. LSTM (25% fewer gates = 25% fewer recurrent params)
Identify when the GRU's simplified gating fails compared to LSTM

Core Content

1. GRU Philosophy: Do More With Less

The GRU (Gated Recurrent Unit) asks: do we really need a separate cell state? The answer: not necessarily, if we design the gating cleverly.

The GRU uses only TWO gates (vs. LSTM's four) and a SINGLE hidden state (no separate cell state). It achieves comparable performance with fewer parameters.

2. GRU Equations

For input x_t and previous hidden state h_{t-1}:

Reset gate r_t: Controls how much of the previous hidden state to "forget" when computing the candidate

r_t = σ(W_r · [h_{t-1} ∥ x_t] + b_r)

Update gate z_t: Controls the interpolation between old state and new candidate (combines LSTM's forget + input gates)

z_t = σ(W_z · [h_{t-1} ∥ x_t] + b_z)

Candidate hidden state h̃_t: The proposed new state, computed with gated access to h_{t-1}

h̃_t = tanh(W · [r_t ⊙ h_{t-1} ∥ x_t] + b)

Hidden state update h_t: Interpolation between old and candidate

h_t = (1 − z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t

Where: - All gates are in (0,1) via sigmoid - ⊙ = element-wise product - [a ∥ b] = concatenation - Each W_ ∈ ℝ^(d × (d+d_in)), b_ ∈ ℝ^d

⚠️ THIS IS CRITICAL — The update h_t = (1−z_t)⊙h_{t-1} + z_t⊙h̃_t is an element-wise LINEAR INTERPOLATION between old and new. When z_t ≈ 0, the state is copied verbatim (gradient highway). When z_t ≈ 1, the state is fully updated. This is like the LSTM cell state update, but using the SAME state for both memory and output.

3. The Two-Gate Design

Reset gate r_t: When r_t ≈ 0, the candidate h̃t is computed as if h{t-1} were all zeros — the GRU "resets" and treats the current input as the start of a new sequence. When r_t ≈ 1, the candidate considers the full history.

Update gate z_t: When z_t ≈ 0, h_t ≈ h_{t-1} — perfect memory, no gradient decay. When z_t ≈ 1, h_t ≈ h̃_t — full update using the candidate.

This is elegant: one gate (r) controls what to forget when computing the candidate; the other (z) controls the final interpolation.

4. Gradient Flow in GRU

Let's analyze ∂h_t/∂h_{t-1}. From the update equation:

h_t = (1 − z_t)⊙h_{t-1} + z_t⊙h̃_t

The direct dependency through the first term:

∂h_t/∂h_{t-1} = diag(1 − z_t) + (terms from ∂z_t/∂h_{t-1} and ∂h̃t/∂h{t-1})

The critical observation: diag(1 − z_t) is a DIAGONAL matrix! Just like the LSTM, the GRU has an additive update that creates a gradient highway.

When z_t ≈ 0: ∂h_t/∂h_{t-1} ≈ I (identity — perfect gradient flow) When z_t ≈ 1: ∂h_t/∂h_{t-1} ≈ ∂h̃t/∂h{t-1} (gradient flows through the candidate computation)

Long-range gradient:

∂h_t/∂h_{t-1} for the direct path ≈ ∏{j=0}^{k-1} diag(1 − z{t-j})

If z values stay small (the GRU keeps old information), the product stays near I.

But there's a subtlety: Because the GRU uses the SAME state for everything, the gradient also flows through the gate computations (∂z_t/∂h_{t-1} and ∂r_t/∂h_{t-1}, which DO involve W_z and W_r). These paths can still suffer from vanishing gradients. However, the dominant path is the additive one.

5. GRU vs. LSTM: Mathematical Comparison

Aspect	LSTM	GRU
States	c_t (cell), h_t (hidden)	h_t only
Gates	f, i, o + candidate = 4	r, z = 2
Memory update	c_t = f⊙c_{t-1} + i⊙C̃	h_t = (1−z)⊙h_{t-1} + z⊙h̃
Output	h_t = o⊙tanh(c_t)	h_t already carries output
Gradient highway	Through c_t (diag(f))	Through h_t (diag(1−z))
Parameters	4d(d+d_in)	3d(d+d_in)

GRU advantage: 25% fewer recurrent parameters (3 weight matrices vs 4). LSTM advantage: Separate cell state means the gradient highway (c_t) is decoupled from the output gating (o_t). The GRU forces the same state to serve both purposes.

When GRU may fail: For tasks requiring the LSTM to remember information but NOT output it (output gate o_t ≈ 0), the LSTM can store in c_t without revealing it in h_t. The GRU can't do this because h_t is both memory and output.

6. Parameter Count Example

For d=512, d_in=256: - LSTM: 4 × 512 × 768 = 1,572,864 recurrent parameters - GRU: 3 × 512 × 768 = 1,179,648 recurrent parameters

GRU saves ~393K parameters, or 25%.

7. The Reset Gate's Role in Gradient Flow

The reset gate affects gradient flow through the candidate computation:

h̃t = tanh(W · [r_t⊙h{t-1} ∥ x_t] + b)

When r_t ≈ 0: h̃t doesn't depend on h{t-1} at all — gradient through the candidate path is zero. The only gradient flow is through (1−z_t)⊙h_{t-1}.

When r_t ≈ 1: h̃t depends fully on h{t-1}, and gradients flow through both the additive path AND the candidate path.

This gives the GRU fine-grained control: even when updating (z_t large), it can choose whether the new state should depend on history (r_t large) or be a fresh start (r_t small).

8. Empirical Notes

In practice, GRU and LSTM perform similarly on most tasks. GRU often wins on smaller datasets (fewer parameters = less overfitting). LSTM sometimes wins on very long sequences (separate cell state provides cleaner gradient highway). For modern applications, both have been largely superseded by Transformers, but understanding their gating mechanisms is essential — the same principles appear in attention gating, highway networks, and adaptive computation.

Key Terms

17 04 Gru Mathematics
Aspect
Empirical Notes
End-of-Subject Quiz
Example 1: Forward Pass of a 2D GRU
Example 2: Gradient Through the Additive Path
Example 3: Long-Sequence Gradient Decay
GRU Equations
GRU Philosophy: Do More With Less
GRU vs. LSTM: Mathematical Comparison
Gates
Gradient Flow in GRU

Worked Examples

Example 1: Forward Pass of a 2D GRU

Problem: A GRU with d=2, d_in=1. W_r = [[0.5,0.1,0.1],[0.1,0.5,0.1]], W_z = same, W = [[0.2,0.3,0.1],[0.3,0.2,0.1]]. All biases zero. h_{t-1}=[0.5, 0.5], x_t=1. Compute r_t, z_t, h̃_t, h_t.

Solution: [h_{t-1} ∥ x_t] = [0.5, 0.5, 1]

r_preact = W_r · [0.5, 0.5, 1] = [0.25+0.05+0.1, 0.05+0.25+0.1] = [0.40, 0.40] r_t = [σ(0.40), σ(0.40)] = [0.5987, 0.5987]

z_preact = same = [0.40, 0.40] z_t = [0.5987, 0.5987]

r_t⊙h_{t-1} = [0.2993, 0.2993] h̃_preact = W · [0.2993, 0.2993, 1] = [0.0599+0.0898+0.1, 0.0898+0.0599+0.1] = [0.2497, 0.2497] h̃_t = [tanh(0.2497), tanh(0.2497)] = [0.2446, 0.2446]

h_t = (1−0.5987)·0.5 + 0.5987·0.2446 = [0.2007+0.1464, 0.2007+0.1464] = [0.3471, 0.3471]

Example 2: Gradient Through the Additive Path

Problem: If ∂L/∂h_t = [1, 0] at time t and z_t = [0.2, 0.8], what is the direct contribution to ∂L/∂h_{t-1}?

Solution: The direct path: h_t = (1−z_t)⊙h_{t-1} + ... So ∂h_t/∂h_{t-1} (direct) = diag(1−z_t) = diag([0.8, 0.2])

∂L/∂h_{t-1} (direct) = diag(1−z_t)ᵀ · [1,0] = [0.8·1, 0.2·0] = [0.8, 0]

The first dimension preserves 80% of the gradient; the second dimension (which was heavily updated) only preserves 20%.

Example 3: Long-Sequence Gradient Decay

Problem: A GRU dimension consistently has z_t = 0.1 for 100 time steps (rarely updated — strong memory). What fraction of gradient survives the direct path?

Solution: ∏ (1−0.1) = 0.9^100 ≈ 2.66 × 10^{-5}

This is surprisingly small! Even with z=0.1 (90% preserved each step), after 100 steps only 0.003% survives. This shows that even gated architectures aren't magic — very long sequences still challenge them, which is one reason Transformers with direct attention won.

Quiz

Q1: How many gates does a GRU have, and what are they?

A) 4 gates: forget, input, output, and candidate B) 2 gates: reset gate r_t and update gate z_t C) 3 gates: reset, update, and output D) 1 gate: the update gate

Answer & Explanation

**B** — The GRU has exactly two gates: r_t (reset gate) and z_t (update gate). The LSTM has 4 gates. The update gate combines the roles of LSTM's forget and input gates into a single interpolation mechanism.

Q2: What creates the gradient highway in a GRU?

A) The tanh activation in the candidate computation B) The term (1 − z_t) ⊙ h_{t−1} in the hidden state update C) The reset gate multiplying h_{t−1} D) The concatenation of h_{t−1} and x_t

Answer & Explanation

**B** — h_t = (1 − z_t) ⊙ h_{t−1} + z_t ⊙ h̃_t. The term (1 − z_t) ⊙ h_{t−1} gives ∂h_t/∂h_{t−1} containing diag(1 − z_t) — an additive term without weight matrix multiplication. When z_t ≈ 0, this is near-identity.

Q3: What happens when the reset gate r_t ≈ 0?

A) The hidden state is copied verbatim from the previous step B) The candidate h̃t is computed as if h{t−1} were all zeros — the GRU "resets" C) The update gate is forced to 0 D) The GRU switches to LSTM mode

Answer & Explanation

**B** — h̃_t = tanh(W · [r_t ⊙ h_{t−1} ∥ x_t] + b). When r_t ≈ 0, r_t ⊙ h_{t−1} ≈ 0, so the candidate depends only on x_t. A describes z_t ≈ 0 behavior.

Q4: How many recurrent weight matrices does a GRU have versus an LSTM?

A) GRU: 2, LSTM: 4 B) GRU: 3, LSTM: 4 C) GRU: 4, LSTM: 4 D) GRU: 2, LSTM: 2

Answer & Explanation

**B** — GRU has W_r, W_z, W (3 matrices). LSTM has W_f, W_i, W_C, W_o (4 matrices). GRU saves ~25% on recurrent parameters (3d(d+d_in) vs 4d(d+d_in)).

Q5: What is a limitation of the GRU compared to the LSTM?

A) The GRU cannot handle sequences longer than 10 steps B) The GRU cannot separately control what to remember and what to output, since it has a single hidden state C) The GRU cannot use sigmoid gates D) The GRU requires more memory than the LSTM

Answer & Explanation

**B** — LSTM has separate c_t (memory) and h_t = o_t ⊙ tanh(c_t) (output). If o_t ≈ 0, information is stored in c_t without exposure. GRU's single state h_t serves as both memory and output — everything remembered is always exposed.

Practice Problems

Problem 1

Write the GRU update equation and identify which part creates the gradient highway.

Answer

h_t = (1−z_t)⊙h_{t-1} + z_t⊙h̃_t. The term (1−z_t)⊙h_{t-1} creates the gradient highway — it's additive and doesn't multiply h_{t-1} by a weight matrix. ∂h_t/∂h_{t-1} includes diag(1−z_t) from this path.

Problem 2

How many weight matrices does a GRU have, and what are they?

Answer

Three: W_r (reset gate), W_z (update gate), and W (candidate hidden state). Each is d × (d+d_in).

Problem 3

What happens when z_t = 0 for all t?

Answer

h_t = 1·h_{t-1} + 0·h̃_t = h_{t-1}. The hidden state becomes constant (h_t = h_0 for all t). The GRU ignores all inputs — perfect memory, zero learning.

Problem 4

Derive ∂h_t/∂r_t (gradient of hidden state w.r.t reset gate).

Answer

h_t depends on r_t only through h̃_t = tanh(W[r_t⊙h_{t-1} ∥ x_t] + b). So: ∂h_t/∂r_t = z_t ⊙ (∂h̃_t/∂r_t) = z_t ⊙ (tanh'(preact) ⊙ (W_{:, :d} · h_{t-1})) where W_{:, :d} are the columns of W that multiply r_t⊙h_{t-1}, and the product involves broadcasting tanh' across the hidden dimensions.

Problem 5

Compare GRU and LSTM: which can "remember without revealing"? Explain mathematically.

Answer

LSTM can, GRU can't. LSTM has separate c_t (memory) and h_t = o_t⊙tanh(c_t) (output). If o_t ≈ 0, h_t ≈ 0 regardless of c_t — the LSTM hides its memory. GRU has one state h_t that serves as both memory and output. If h_t ≈ 0, the memory is lost. If h_t is nonzero, the information is exposed.

Summary

GRU simplifies LSTM with 2 gates instead of 4: update gate z_t (interpolation weight) and reset gate r_t (controls history influence on candidate)
The update h_t = (1−z_t)⊙h_{t-1} + z_t⊙h̃t is a linear interpolation creating a gradient highway: ∂h_t/∂h{t-1} contains diag(1−z_t)
When z_t ≈ 0, gradients flow freely (near-identity Jacobian); when z_t ≈ 1, the GRU fully updates
GRU has 25% fewer recurrent parameters than LSTM (3d(d+d_in) vs 4d(d+d_in))
The single-state design means GRU can't "remember without revealing" like LSTM can with its output gate

Pitfalls

Assuming GRU always matches LSTM performance. While GRU and LSTM perform similarly on most benchmarks, GRU can underperform on tasks requiring the "remember without revealing" capability — LSTM's output gate o_t lets it store information in c_t while outputting h_t ≈ 0. The GRU's single state h_t cannot separately control memory retention and information exposure.
Misunderstanding the coupling between z_t and (1 − z_t). Since the update gate z_t controls both how much to retain (1 − z_t) and how much to update (z_t) through a single sigmoid output, you cannot independently set "keep 90%" and "add 50%." This coupling is the GRU's fundamental architectural constraint relative to LSTM.
Expecting perfect gradient flow over very long sequences (1000+ steps). Even with z_t = 0.1 (90% preserved per step), gradient survival is 0.9^1000 ≈ 10^−46. Additive gating helps enormously but doesn't create infinite memory — Transformers with direct attention ultimately won for very long-range dependencies.
Confusing the reset gate r_t with the update gate z_t. r_t only affects the candidate computation h̃_t — when r_t ≈ 0, the candidate ignores history entirely. z_t controls the final interpolation between old state and candidate. A common debugging mistake: expecting r_t to control the output when it only gates information flow into h̃_t.
Treating GRU as strictly "simpler than LSTM." While GRU has fewer gates (3 weight matrices vs. 4), the single-state design means all gradient paths funnel through one state vector. The gradient dynamics through gate computations are actually more entangled than in LSTM, where the cell state and hidden state provide partially decoupled paths.

Next Steps

Continue to 17-05 — Residual Connections to learn about skip connections and how they enable very deep networks.

Progress

Phases

17-04 — GRU Mathematics

Learning Objectives

Core Content

1. GRU Philosophy: Do More With Less

2. GRU Equations

3. The Two-Gate Design

4. Gradient Flow in GRU

5. GRU vs. LSTM: Mathematical Comparison

6. Parameter Count Example

7. The Reset Gate's Role in Gradient Flow

8. Empirical Notes

Key Terms

Worked Examples

Example 1: Forward Pass of a 2D GRU

Example 2: Gradient Through the Additive Path

Example 3: Long-Sequence Gradient Decay

Quiz

Q1: How many gates does a GRU have, and what are they?

Q2: What creates the gradient highway in a GRU?

Q3: What happens when the reset gate r_t ≈ 0?

Q4: How many recurrent weight matrices does a GRU have versus an LSTM?

Q5: What is a limitation of the GRU compared to the LSTM?

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Next Steps