Math graphic
πŸ“ Concept diagram

17-03 β€” LSTM Mathematics

Phase: 17 β€” Deep Learning Architectures (Math) Subject: 17-03 Prerequisites: 17-02 (RNNs and BPTT), 16-02 (Activation Functions β€” sigmoid, tanh), 16-05 (Backpropagation) Next subject: 17-04 β€” GRU Mathematics


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the full LSTM gate equations and explain the role of each gate in controlling information flow
  2. Explain mathematically how the cell state's additive update prevents vanishing gradients
  3. Compute the gradient of the loss with respect to the cell state and show why it doesn't vanish exponentially
  4. Compare LSTM gradient flow to vanilla RNN gradient flow analytically
  5. Identify failure modes: saturation of gates and cell state unbounded growth

Core Content

1. The Problem LSTMs Solve

Recall from 17-02: the vanilla RNN gradient path involves repeated multiplication by the same Jacobian:

βˆ‚L/βˆ‚h₁ = ... ∏ diag(tanh')Β·W_hh

The key insight: if we could make the state transition linear (tanh derivative = 1 everywhere), the gradient wouldn't vanish. But we need non-linearity for expressive power.

LSTM solution: Use gating to create a "gradient highway" β€” an additive cell state update with identity connections that let gradients flow unimpeded.

2. LSTM Architecture and Equations

The LSTM maintains TWO states: - Cell state c_t ∈ ℝ^d β€” the "long-term memory" (gradient highway) - Hidden state h_t ∈ ℝ^d β€” the "working memory" (output)

At each time step, given input x_t and previous states (h_{t-1}, c_{t-1}):

Forget gate f_t: Controls what to REMOVE from previous cell state

f_t = Οƒ(W_f Β· [h_{t-1} βˆ₯ x_t] + b_f)

Input gate i_t: Controls what NEW information to ADD

i_t = Οƒ(W_i Β· [h_{t-1} βˆ₯ x_t] + b_i)

Candidate cell state C̃_t: The new information we MIGHT add

CΜƒ_t = tanh(W_C Β· [h_{t-1} βˆ₯ x_t] + b_C)

Cell state update c_t: COMBINE forget + add (THE KEY EQUATION)

c_t = f_t βŠ™ c_{t-1} + i_t βŠ™ CΜƒ_t

Output gate o_t: Controls what to EXPOSE from cell state

o_t = Οƒ(W_o Β· [h_{t-1} βˆ₯ x_t] + b_o)

Hidden state h_t: Filtered version of cell state

h_t = o_t βŠ™ tanh(c_t)

Where: - Οƒ = sigmoid, outputting values in (0,1) for gating - βŠ™ = element-wise (Hadamard) product - [h_{t-1} βˆ₯ x_t] = concatenation, dimension d + d_in - Each W_ ∈ ℝ^(d Γ— (d+d_in)), each b_ ∈ ℝ^d

⚠️ THIS IS CRITICAL β€” The cell state update c_t = f_tβŠ™c_{t-1} + i_tβŠ™CΜƒt is ADDITIVE. This creates a direct path where c{t-1} flows into c_t without being multiplied by a weight matrix. The gradient βˆ‚c_t/βˆ‚c_{t-1} = diag(f_t), which is bounded and doesn't involve W at all.

3. Why LSTMs Solve Vanishing Gradients

Let's trace the gradient through the cell state. The loss L depends on c_t through h_t:

βˆ‚L/βˆ‚c_t = βˆ‚L/βˆ‚h_t Β· βˆ‚h_t/βˆ‚c_t = Ξ΄_t^h βŠ™ (o_t βŠ™ tanh'(c_t))

Now, how does c_t depend on c_{t-1}?

βˆ‚c_t/βˆ‚c_{t-1} = diag(f_t)

This is the magic. The Jacobian βˆ‚c_t/βˆ‚c_{t-1} is a DIAGONAL matrix whose diagonal entries are f_t (the forget gate values). It does NOT involve the recurrent weight matrix W at all!

The gradient from c_t back to c_{t-k}:

βˆ‚c_t/βˆ‚c_{t-k} = ∏{j=0}^{k-1} diag(f{t-j})

This is a product of k DIAGONAL matrices, each with entries in (0,1). The eigenvalues are the products of forget gate values.

If f_j β‰ˆ 1 (the LSTM "remembers"), then ∏ f β‰ˆ 1 and gradients don't vanish. If f_j β‰ˆ 0 (the LSTM "forgets"), then ∏ f β‰ˆ 0 and gradients vanish β€” but that's correct behavior, because we WANT to forget.

Contrast with vanilla RNN: βˆ‚h_t/βˆ‚h_{t-1} = diag(tanh')Β·W_hh β€” involves the full matrix W_hh, whose eigenvalues can systematically be < 1.

The LSTM's trick: The recurrent WEIGHTS W_f, W_i, W_C, W_o still suffer from vanishing gradient through the hidden state path (h_{t-1} β†’ gates β†’ c_t). But the CELL STATE itself provides a parallel, additive gradient highway.

4. Full Gradient Analysis

The hidden state gradient still involves W_hh-like terms:

βˆ‚h_t/βˆ‚h_{t-1} = βˆ‚h_t/βˆ‚c_t Β· βˆ‚c_t/βˆ‚h_{t-1} + direct paths through gates

But the cell state gradient is clean:

βˆ‚c_t/βˆ‚c_{t-1} = diag(f_t)

And importantly:

βˆ‚c_t/βˆ‚c_{t-k} = ∏{j=0}^{k-1} diag(f{t-j})

This is element-wise: each dimension of the cell state can have its OWN forget rate. The LSTM can learn to remember some features for long periods (f β‰ˆ 1) while forgetting others (f β‰ˆ 0).

5. Gate Dynamics and Interpretations

Forget gate f_t β‰ˆ 0: The LSTM completely erases the corresponding dimension of cell state. Previous information in that dimension doesn't affect future outputs.

Forget gate f_t β‰ˆ 1: Perfect memory β€” the cell state dimension passes through unchanged. Gradient flows freely.

Input gate i_t β‰ˆ 1: The LSTM stores the candidate CΜƒ_t into the cell state.

Output gate o_t β‰ˆ 0: The cell state information is hidden from the output. The LSTM knows it but doesn't use it yet.

Typical initialization: b_f is initialized to a positive value (e.g., 1) so that the LSTM starts with a bias toward remembering (f_t β‰ˆ Οƒ(1) β‰ˆ 0.73 initially). This gives the LSTM a "prior" favoring long-term memory.

6. Parameter Count

Each gate has its own weight matrix of size d Γ— (d + d_in). Four gates β†’ 4d(d + d_in) parameters for the recurrent part.

Example: d=512, d_in=256 β†’ 4Β·512Β·768 = 1,572,864 parameters per LSTM layer.

Compare to vanilla RNN: d(d + d_in) = 512Β·768 = 393,216 parameters. LSTM has 4Γ— the parameters.

7. Gradient Through Gates

Backprop through the forget gate:

βˆ‚L/βˆ‚f_t = βˆ‚L/βˆ‚c_t βŠ™ c_{t-1} (from c_t = f_tβŠ™c_{t-1} + ...)

Then through the sigmoid:

βˆ‚L/βˆ‚(W_f[h_{t-1},x_t]) = βˆ‚L/βˆ‚f_t βŠ™ f_t βŠ™ (1 βˆ’ f_t)

The sigmoid derivative f_t(1βˆ’f_t) can saturate β€” if f_t β‰ˆ 0 or f_t β‰ˆ 1, the gradient through the gate is tiny. This is a potential issue if the gates get "stuck" at extremes.



Key Terms

Worked Examples

Example 1: Tracking a Single Cell State Dimension

Problem: For one dimension of cell state, f = [0.9, 0.5, 0.8], i = [0.1, 0.3, 0.2], CΜƒ = [1, 2, 3] over three time steps, with cβ‚€ = 0. Compute c₁, cβ‚‚, c₃.

Solution: c₁ = 0.9Β·0 + 0.1Β·1 = 0.1 cβ‚‚ = 0.5Β·0.1 + 0.3Β·2 = 0.05 + 0.6 = 0.65 c₃ = 0.8Β·0.65 + 0.2Β·3 = 0.52 + 0.6 = 1.12

Example 2: Gradient Flow Over Time

Problem: For the cell state from Example 1, if βˆ‚L/βˆ‚c₃ = 1, compute βˆ‚L/βˆ‚c₁ (ignoring gate dependencies on c₁).

Solution: βˆ‚L/βˆ‚c₁ = βˆ‚L/βˆ‚c₃ Β· βˆ‚c₃/βˆ‚cβ‚‚ Β· βˆ‚cβ‚‚/βˆ‚c₁ = 1 Β· f₃ Β· fβ‚‚ = 1 Β· 0.8 Β· 0.5 = 0.40

The gradient is 0.40 β€” attenuated, but not vanished! Over 3 steps with average f β‰ˆ 0.73, it's manageable. Compare to vanilla RNN with W_hh = 0.5: (0.5)Β² = 0.25 of the gradient survives for only the weight part, plus tanh attenuation.

Example 3: Gate Saturation Analysis

Problem: An LSTM learns such that f_t β†’ 1.0 for all t for some dimension. What happens to: (a) the cell state, (b) the gradient?

Solution: (a) c_t = 1Β·c_{t-1} + i_tΒ·CΜƒt = c{t-1} + i_tΒ·CΜƒt. The cell state grows unboundedly if i_tΒ·CΜƒ_t is consistently positive β€” it's a running sum. (b) βˆ‚L/βˆ‚c{t-k} has factor ∏ f = 1^k = 1. Gradients don't vanish. But βˆ‚f/βˆ‚(preactivation) = f(1-f) β‰ˆ 0 β€” the forget gate stops learning! This is the saturation tradeoff: perfect gradient flow through c, but no learning in the gate.


Quiz

Q1: What does the concept of End-of-Subject Quiz primarily refer to in this subject?

A) A historical anecdote about End-of-Subject Quiz B) The definition and application of End-of-Subject Quiz C) A computational error related to End-of-Subject Quiz D) A visual representation of End-of-Subject Quiz

Correct: B)

Q2: What is the primary purpose of Full Gradient Analysis?

A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It is used to full gradient analysis in mathematical analysis D) It replaces all other methods in this domain

Correct: C)

Q3: Which statement about Gate Dynamics and Interpretations is TRUE?

A) Gate Dynamics and Interpretations is an advanced topic beyond this subject's scope B) Gate Dynamics and Interpretations is a fundamental concept covered in this subject C) Gate Dynamics and Interpretations is not related to this subject D) Gate Dynamics and Interpretations is mentioned only as a historical footnote

Correct: B)

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) βˆ‚L/βˆ‚c_t βŠ™ CΜƒ_t (element-wise product of the incoming gradient and the C) An unrelated numerical value D) A different result from a common mistake

Correct: B)

Q5: How are Gate Dynamics and Interpretations and Gradient Through Gates related?

A) Gate Dynamics and Interpretations is the inverse of Gradient Through Gates B) Gate Dynamics and Interpretations and Gradient Through Gates are closely related concepts C) Gate Dynamics and Interpretations is a special case of Gradient Through Gates D) Gate Dynamics and Interpretations and Gradient Through Gates are completely unrelated topics

Correct: B)

Q6: What is a common pitfall when working with LSTM Architecture and Equations?

A) LSTM Architecture and Equations has no common misconceptions B) The main error with LSTM Architecture and Equations is using it when it is not needed C) A common mistake is confusing LSTM Architecture and Equations with a similar concept D) LSTM Architecture and Equations is always computed the same way in all contexts

Correct: C)

Q7: When should you apply Parameter Count?

A) Avoid Parameter Count unless explicitly instructed B) Parameter Count is not practically useful C) Use Parameter Count only in pure mathematics contexts D) Apply Parameter Count to solve problems in this subject's domain

Correct: D)

Practice Problems

Problem 1

What are the dimensions of W_f in an LSTM with d_h=256, d_in=128?

Answer 256 Γ— (256 + 128) = 256 Γ— 384. Each LSTM weight matrix maps the concatenation [h_{t-1}, x_t] (dimension 384) to the hidden dimension (256).

Problem 2

If f_t = 0 for all t, what is c_t? What does the gradient βˆ‚c_t/βˆ‚c₁ look like?

Answer c_t = 0Β·c_{t-1} + i_tΒ·CΜƒ_t = i_tΒ·CΜƒ_t. The cell state has no memory β€” becomes a purely feedforward computation. βˆ‚c_t/βˆ‚c₁ = 0 β€” no gradient flows to earlier states, which is correct because earlier states were deliberately forgotten.

Problem 3

Derive βˆ‚L/βˆ‚i_t in terms of βˆ‚L/βˆ‚c_t and CΜƒ_t.

Answer From c_t = f_tβŠ™c_{t-1} + i_tβŠ™CΜƒ_t: βˆ‚L/βˆ‚i_t = βˆ‚L/βˆ‚c_t βŠ™ CΜƒ_t (element-wise product of the incoming gradient and the candidate values).

Problem 4

Why is b_f typically initialized to 1 (or another positive value)?

Answer With b_f = 1, the forget gate activation is Οƒ(1) β‰ˆ 0.73, biasing the LSTM toward remembering. This prevents the LSTM from forgetting everything at the start of training, allowing it to discover long-range dependencies before learning to forget selectively.

Problem 5

An LSTM processes a sequence of length 100. For a particular cell state dimension, f_t = 0.95 for all t. What fraction of the gradient at t=100 reaches t=1 through the cell state path?

Answer 0.95^99 β‰ˆ 0.0062 β€” about 0.6% survives. Compare to vanilla RNN with W_hh = 0.5: 0.5^99 β‰ˆ 1.6 Γ— 10^{-30} β€” completely gone. Even with suboptimal gates (0.95 vs 1.0), LSTM is orders of magnitude better.

Summary

  1. LSTM adds a cell state c_t with additive update: c_t = f_tβŠ™c_{t-1} + i_tβŠ™CΜƒ_t β€” creating a gradient highway
  2. The forget gate f_t gives element-wise control over memory retention; βˆ‚c_t/βˆ‚c_{t-1} = diag(f_t) β€” no recurrent weight matrix involved
  3. The cell state gradient βˆ‚c_t/βˆ‚c_{t-k} = ∏ diag(f_{t-j}) is a product of diagonal matrices β€” each dimension independently modulated
  4. The hidden state h_t = o_tβŠ™tanh(c_t) gates what the LSTM outputs without affecting the memory cell
  5. Saturation tradeoff: f_t β†’ 1 gives perfect gradient flow but stops gate learning; f_t β†’ 0 prevents gradient flow (correctly, since we chose to forget)

Pitfalls



Next Steps

Continue to 17-04 β€” GRU Mathematics to see a simpler gated architecture that achieves similar performance with fewer parameters.