📐 Concept diagram

17-03 — LSTM Mathematics

Phase: 17 — Deep Learning Architectures (Math) Subject: 17-03 Prerequisites: 17-02 (RNNs and BPTT), 16-02 (Activation Functions — sigmoid, tanh), 16-05 (Backpropagation) Next subject: 17-04 — GRU Mathematics

Learning Objectives

By the end of this subject, you will be able to:

Derive the full LSTM gate equations and explain the role of each gate in controlling information flow
Explain mathematically how the cell state's additive update prevents vanishing gradients
Compute the gradient of the loss with respect to the cell state and show why it doesn't vanish exponentially
Compare LSTM gradient flow to vanilla RNN gradient flow analytically
Identify failure modes: saturation of gates and cell state unbounded growth

Core Content

1. The Problem LSTMs Solve

Recall from 17-02: the vanilla RNN gradient path involves repeated multiplication by the same Jacobian:

∂L/∂h₁ = ... ∏ diag(tanh')·W_hh

The key insight: if we could make the state transition linear (tanh derivative = 1 everywhere), the gradient wouldn't vanish. But we need non-linearity for expressive power.

LSTM solution: Use gating to create a "gradient highway" — an additive cell state update with identity connections that let gradients flow unimpeded.

2. LSTM Architecture and Equations

The LSTM maintains TWO states: - Cell state c_t ∈ ℝ^d — the "long-term memory" (gradient highway) - Hidden state h_t ∈ ℝ^d — the "working memory" (output)

At each time step, given input x_t and previous states (h_{t-1}, c_{t-1}):

Forget gate f_t: Controls what to REMOVE from previous cell state

f_t = σ(W_f · [h_{t-1} ∥ x_t] + b_f)

Input gate i_t: Controls what NEW information to ADD

i_t = σ(W_i · [h_{t-1} ∥ x_t] + b_i)

Candidate cell state C̃_t: The new information we MIGHT add

C̃_t = tanh(W_C · [h_{t-1} ∥ x_t] + b_C)

Cell state update c_t: COMBINE forget + add (THE KEY EQUATION)

c_t = f_t ⊙ c_{t-1} + i_t ⊙ C̃_t

Output gate o_t: Controls what to EXPOSE from cell state

o_t = σ(W_o · [h_{t-1} ∥ x_t] + b_o)

Hidden state h_t: Filtered version of cell state

h_t = o_t ⊙ tanh(c_t)

Where: - σ = sigmoid, outputting values in (0,1) for gating - ⊙ = element-wise (Hadamard) product - [h_{t-1} ∥ x_t] = concatenation, dimension d + d_in - Each W_ ∈ ℝ^(d × (d+d_in)), each b_ ∈ ℝ^d

⚠️ THIS IS CRITICAL — The cell state update c_t = f_t⊙c_{t-1} + i_t⊙C̃t is ADDITIVE. This creates a direct path where c{t-1} flows into c_t without being multiplied by a weight matrix. The gradient ∂c_t/∂c_{t-1} = diag(f_t), which is bounded and doesn't involve W at all.

3. Why LSTMs Solve Vanishing Gradients

Let's trace the gradient through the cell state. The loss L depends on c_t through h_t:

∂L/∂c_t = ∂L/∂h_t · ∂h_t/∂c_t = δ_t^h ⊙ (o_t ⊙ tanh'(c_t))

Now, how does c_t depend on c_{t-1}?

∂c_t/∂c_{t-1} = diag(f_t)

This is the magic. The Jacobian ∂c_t/∂c_{t-1} is a DIAGONAL matrix whose diagonal entries are f_t (the forget gate values). It does NOT involve the recurrent weight matrix W at all!

The gradient from c_t back to c_{t-k}:

∂c_t/∂c_{t-k} = ∏{j=0}^{k-1} diag(f{t-j})

This is a product of k DIAGONAL matrices, each with entries in (0,1). The eigenvalues are the products of forget gate values.

If f_j ≈ 1 (the LSTM "remembers"), then ∏ f ≈ 1 and gradients don't vanish. If f_j ≈ 0 (the LSTM "forgets"), then ∏ f ≈ 0 and gradients vanish — but that's correct behavior, because we WANT to forget.

Contrast with vanilla RNN: ∂h_t/∂h_{t-1} = diag(tanh')·W_hh — involves the full matrix W_hh, whose eigenvalues can systematically be < 1.

The LSTM's trick: The recurrent WEIGHTS W_f, W_i, W_C, W_o still suffer from vanishing gradient through the hidden state path (h_{t-1} → gates → c_t). But the CELL STATE itself provides a parallel, additive gradient highway.

4. Full Gradient Analysis

The hidden state gradient still involves W_hh-like terms:

∂h_t/∂h_{t-1} = ∂h_t/∂c_t · ∂c_t/∂h_{t-1} + direct paths through gates

But the cell state gradient is clean:

∂c_t/∂c_{t-1} = diag(f_t)

And importantly:

∂c_t/∂c_{t-k} = ∏{j=0}^{k-1} diag(f{t-j})

This is element-wise: each dimension of the cell state can have its OWN forget rate. The LSTM can learn to remember some features for long periods (f ≈ 1) while forgetting others (f ≈ 0).

5. Gate Dynamics and Interpretations

Forget gate f_t ≈ 0: The LSTM completely erases the corresponding dimension of cell state. Previous information in that dimension doesn't affect future outputs.

Forget gate f_t ≈ 1: Perfect memory — the cell state dimension passes through unchanged. Gradient flows freely.

Input gate i_t ≈ 1: The LSTM stores the candidate C̃_t into the cell state.

Output gate o_t ≈ 0: The cell state information is hidden from the output. The LSTM knows it but doesn't use it yet.

Typical initialization: b_f is initialized to a positive value (e.g., 1) so that the LSTM starts with a bias toward remembering (f_t ≈ σ(1) ≈ 0.73 initially). This gives the LSTM a "prior" favoring long-term memory.

6. Parameter Count

Each gate has its own weight matrix of size d × (d + d_in). Four gates → 4d(d + d_in) parameters for the recurrent part.

Example: d=512, d_in=256 → 4·512·768 = 1,572,864 parameters per LSTM layer.

Compare to vanilla RNN: d(d + d_in) = 512·768 = 393,216 parameters. LSTM has 4× the parameters.

7. Gradient Through Gates

Backprop through the forget gate:

∂L/∂f_t = ∂L/∂c_t ⊙ c_{t-1} (from c_t = f_t⊙c_{t-1} + ...)

Then through the sigmoid:

∂L/∂(W_f[h_{t-1},x_t]) = ∂L/∂f_t ⊙ f_t ⊙ (1 − f_t)

The sigmoid derivative f_t(1−f_t) can saturate — if f_t ≈ 0 or f_t ≈ 1, the gradient through the gate is tiny. This is a potential issue if the gates get "stuck" at extremes.

Key Terms

17 03 Lstm Mathematics
Cell state c_t
End-of-Subject Quiz
Example 1: Tracking a Single Cell State Dimension
Example 2: Gradient Flow Over Time
Example 3: Gate Saturation Analysis
Full Gradient Analysis
Gate Dynamics and Interpretations
Gradient Through Gates
Hidden state h_t
LSTM Architecture and Equations
Parameter Count

Worked Examples

Example 1: Tracking a Single Cell State Dimension

Problem: For one dimension of cell state, f = [0.9, 0.5, 0.8], i = [0.1, 0.3, 0.2], C̃ = [1, 2, 3] over three time steps, with c₀ = 0. Compute c₁, c₂, c₃.

Solution: c₁ = 0.9·0 + 0.1·1 = 0.1 c₂ = 0.5·0.1 + 0.3·2 = 0.05 + 0.6 = 0.65 c₃ = 0.8·0.65 + 0.2·3 = 0.52 + 0.6 = 1.12

Example 2: Gradient Flow Over Time

Problem: For the cell state from Example 1, if ∂L/∂c₃ = 1, compute ∂L/∂c₁ (ignoring gate dependencies on c₁).

Solution: ∂L/∂c₁ = ∂L/∂c₃ · ∂c₃/∂c₂ · ∂c₂/∂c₁ = 1 · f₃ · f₂ = 1 · 0.8 · 0.5 = 0.40

The gradient is 0.40 — attenuated, but not vanished! Over 3 steps with average f ≈ 0.73, it's manageable. Compare to vanilla RNN with W_hh = 0.5: (0.5)² = 0.25 of the gradient survives for only the weight part, plus tanh attenuation.

Example 3: Gate Saturation Analysis

Problem: An LSTM learns such that f_t → 1.0 for all t for some dimension. What happens to: (a) the cell state, (b) the gradient?

Solution: (a) c_t = 1·c_{t-1} + i_t·C̃t = c{t-1} + i_t·C̃t. The cell state grows unboundedly if i_t·C̃_t is consistently positive — it's a running sum. (b) ∂L/∂c{t-k} has factor ∏ f = 1^k = 1. Gradients don't vanish. But ∂f/∂(preactivation) = f(1-f) ≈ 0 — the forget gate stops learning! This is the saturation tradeoff: perfect gradient flow through c, but no learning in the gate.

Quiz

Q1: What does the concept of End-of-Subject Quiz primarily refer to in this subject?

A) A historical anecdote about End-of-Subject Quiz B) The definition and application of End-of-Subject Quiz C) A computational error related to End-of-Subject Quiz D) A visual representation of End-of-Subject Quiz

Correct: B)

If you chose A: This is incorrect. End-of-Subject Quiz is defined as: the definition and application of end-of-subject quiz. The other options describe different aspects that are not the primary focus.
If you chose B: End-of-Subject Quiz is defined as: the definition and application of end-of-subject quiz. The other options describe different aspects that are not the primary focus. Correct!
If you chose C: This is incorrect. End-of-Subject Quiz is defined as: the definition and application of end-of-subject quiz. The other options describe different aspects that are not the primary focus.
If you chose D: This is incorrect. End-of-Subject Quiz is defined as: the definition and application of end-of-subject quiz. The other options describe different aspects that are not the primary focus.

Q2: What is the primary purpose of Full Gradient Analysis?

A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It is used to full gradient analysis in mathematical analysis D) It replaces all other methods in this domain

Correct: C)

If you chose A: This is incorrect. Full Gradient Analysis serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. Full Gradient Analysis serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: Full Gradient Analysis serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose D: This is incorrect. Full Gradient Analysis serves the purpose described in the correct answer. The other options misrepresent its role.

Q3: Which statement about Gate Dynamics and Interpretations is TRUE?

A) Gate Dynamics and Interpretations is an advanced topic beyond this subject's scope B) Gate Dynamics and Interpretations is a fundamental concept covered in this subject C) Gate Dynamics and Interpretations is not related to this subject D) Gate Dynamics and Interpretations is mentioned only as a historical footnote

Correct: B)

If you chose A: This is incorrect. Gate Dynamics and Interpretations is a fundamental concept covered in this subject. This subject covers Gate Dynamics and Interpretations as part of its core content.
If you chose B: Gate Dynamics and Interpretations is a fundamental concept covered in this subject. This subject covers Gate Dynamics and Interpretations as part of its core content. Correct!
If you chose C: This is incorrect. Gate Dynamics and Interpretations is a fundamental concept covered in this subject. This subject covers Gate Dynamics and Interpretations as part of its core content.
If you chose D: This is incorrect. Gate Dynamics and Interpretations is a fundamental concept covered in this subject. This subject covers Gate Dynamics and Interpretations as part of its core content.

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) ∂L/∂c_t ⊙ C̃_t (element-wise product of the incoming gradient and the C) An unrelated numerical value D) A different result from a common mistake

Correct: B)

If you chose A: This is incorrect. The worked examples show that the result is ∂L/∂c_t ⊙ C̃_t (element-wise product of the incoming gradient and the. The other options represent common errors.
If you chose B: The worked examples show that the result is ∂L/∂c_t ⊙ C̃_t (element-wise product of the incoming gradient and the. The other options represent common errors. Correct!
If you chose C: This is incorrect. The worked examples show that the result is ∂L/∂c_t ⊙ C̃_t (element-wise product of the incoming gradient and the. The other options represent common errors.
If you chose D: This is incorrect. The worked examples show that the result is ∂L/∂c_t ⊙ C̃_t (element-wise product of the incoming gradient and the. The other options represent common errors.

Q5: How are Gate Dynamics and Interpretations and Gradient Through Gates related?

A) Gate Dynamics and Interpretations is the inverse of Gradient Through Gates B) Gate Dynamics and Interpretations and Gradient Through Gates are closely related concepts C) Gate Dynamics and Interpretations is a special case of Gradient Through Gates D) Gate Dynamics and Interpretations and Gradient Through Gates are completely unrelated topics

Correct: B)

If you chose A: This is incorrect. Both Gate Dynamics and Interpretations and Gradient Through Gates are covered in this subject as interconnected topics.
If you chose B: Both Gate Dynamics and Interpretations and Gradient Through Gates are covered in this subject as interconnected topics. Correct!
If you chose C: This is incorrect. Both Gate Dynamics and Interpretations and Gradient Through Gates are covered in this subject as interconnected topics.
If you chose D: This is incorrect. Both Gate Dynamics and Interpretations and Gradient Through Gates are covered in this subject as interconnected topics.

Q6: What is a common pitfall when working with LSTM Architecture and Equations?

A) LSTM Architecture and Equations has no common misconceptions B) The main error with LSTM Architecture and Equations is using it when it is not needed C) A common mistake is confusing LSTM Architecture and Equations with a similar concept D) LSTM Architecture and Equations is always computed the same way in all contexts

Correct: C)

If you chose A: This is incorrect. Students often confuse LSTM Architecture and Equations with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse LSTM Architecture and Equations with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: Students often confuse LSTM Architecture and Equations with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose D: This is incorrect. Students often confuse LSTM Architecture and Equations with similar-sounding or related concepts. Pay attention to the precise definitions.

Q7: When should you apply Parameter Count?

A) Avoid Parameter Count unless explicitly instructed B) Parameter Count is not practically useful C) Use Parameter Count only in pure mathematics contexts D) Apply Parameter Count to solve problems in this subject's domain

Correct: D)

If you chose A: This is incorrect. Parameter Count is a practical tool used throughout this subject to solve relevant problems.
If you chose B: This is incorrect. Parameter Count is a practical tool used throughout this subject to solve relevant problems.
If you chose C: This is incorrect. Parameter Count is a practical tool used throughout this subject to solve relevant problems.
If you chose D: Parameter Count is a practical tool used throughout this subject to solve relevant problems. Correct!

Practice Problems

Problem 1

What are the dimensions of W_f in an LSTM with d_h=256, d_in=128?

Answer

256 × (256 + 128) = 256 × 384. Each LSTM weight matrix maps the concatenation [h_{t-1}, x_t] (dimension 384) to the hidden dimension (256).

Problem 2

If f_t = 0 for all t, what is c_t? What does the gradient ∂c_t/∂c₁ look like?

Answer

c_t = 0·c_{t-1} + i_t·C̃_t = i_t·C̃_t. The cell state has no memory — becomes a purely feedforward computation. ∂c_t/∂c₁ = 0 — no gradient flows to earlier states, which is correct because earlier states were deliberately forgotten.

Problem 3

Derive ∂L/∂i_t in terms of ∂L/∂c_t and C̃_t.

Answer

From c_t = f_t⊙c_{t-1} + i_t⊙C̃_t: ∂L/∂i_t = ∂L/∂c_t ⊙ C̃_t (element-wise product of the incoming gradient and the candidate values).

Problem 4

Why is b_f typically initialized to 1 (or another positive value)?

Answer

With b_f = 1, the forget gate activation is σ(1) ≈ 0.73, biasing the LSTM toward remembering. This prevents the LSTM from forgetting everything at the start of training, allowing it to discover long-range dependencies before learning to forget selectively.

Problem 5

An LSTM processes a sequence of length 100. For a particular cell state dimension, f_t = 0.95 for all t. What fraction of the gradient at t=100 reaches t=1 through the cell state path?

Answer

0.95^99 ≈ 0.0062 — about 0.6% survives. Compare to vanilla RNN with W_hh = 0.5: 0.5^99 ≈ 1.6 × 10^{-30} — completely gone. Even with suboptimal gates (0.95 vs 1.0), LSTM is orders of magnitude better.

Summary

LSTM adds a cell state c_t with additive update: c_t = f_t⊙c_{t-1} + i_t⊙C̃_t — creating a gradient highway
The forget gate f_t gives element-wise control over memory retention; ∂c_t/∂c_{t-1} = diag(f_t) — no recurrent weight matrix involved
The cell state gradient ∂c_t/∂c_{t-k} = ∏ diag(f_{t-j}) is a product of diagonal matrices — each dimension independently modulated
The hidden state h_t = o_t⊙tanh(c_t) gates what the LSTM outputs without affecting the memory cell
Saturation tradeoff: f_t → 1 gives perfect gradient flow but stops gate learning; f_t → 0 prevents gradient flow (correctly, since we chose to forget)

Pitfalls

Assuming LSTMs completely solve vanishing gradients. The cell state gradient highway (∂c_t/∂c_{t-1} = diag(f_t)) helps immensely, but gradients flowing through the gate computations (W_f, W_i, W_C, W_o) still pass through sigmoid and tanh derivatives and can vanish over long sequences. LSTMs extend the effective range but don't grant infinite memory — 500+ step dependencies still challenge them.
Initializing the forget gate bias b_f to zero. With b_f = 0, the forget gate activation starts at σ(0) = 0.5, biasing the LSTM toward rapid forgetting. Always initialize b_f to a positive value (typically 1.0, giving σ(1) ≈ 0.73) so the LSTM starts with a "remember first, learn to forget later" prior.
Forgetting that the cell state can grow unbounded. Unlike the hidden state h_t = o_t ⊙ tanh(c_t) which is bounded by tanh, the cell state c_t has no inherent bounding mechanism. With f_t ≈ 1 and consistently positive i_t ⊙ C̃_t, c_t becomes an unbounded running sum. This doesn't break the forward pass but can cause numerical issues in long sequences.
Confusing the roles of forget gate f_t and output gate o_t. f_t controls what information is retained in the cell state c_t (memory management). o_t controls what the LSTM exposes from c_t to the outside world via h_t (information release). They serve fundamentally different purposes — an LSTM can remember information (f_t ≈ 1, c_t large) while choosing not to reveal it (o_t ≈ 0, h_t ≈ 0).
Stacking too many LSTM layers. Each LSTM layer already has significant depth through time. Stacking more than 3–4 LSTM layers rarely improves performance and dramatically increases training time and memory usage due to the 4× parameter count per layer.

Next Steps

Continue to 17-04 — GRU Mathematics to see a simpler gated architecture that achieves similar performance with fewer parameters.

Progress

Phases

17-03 — LSTM Mathematics

Learning Objectives

Core Content

1. The Problem LSTMs Solve

2. LSTM Architecture and Equations

3. Why LSTMs Solve Vanishing Gradients

4. Full Gradient Analysis

5. Gate Dynamics and Interpretations

6. Parameter Count

7. Gradient Through Gates

Key Terms

Worked Examples

Example 1: Tracking a Single Cell State Dimension

Example 2: Gradient Flow Over Time

Example 3: Gate Saturation Analysis

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Next Steps