17-03 β LSTM Mathematics
Phase: 17 β Deep Learning Architectures (Math) Subject: 17-03 Prerequisites: 17-02 (RNNs and BPTT), 16-02 (Activation Functions β sigmoid, tanh), 16-05 (Backpropagation) Next subject: 17-04 β GRU Mathematics
Learning Objectives
By the end of this subject, you will be able to:
- Derive the full LSTM gate equations and explain the role of each gate in controlling information flow
- Explain mathematically how the cell state's additive update prevents vanishing gradients
- Compute the gradient of the loss with respect to the cell state and show why it doesn't vanish exponentially
- Compare LSTM gradient flow to vanilla RNN gradient flow analytically
- Identify failure modes: saturation of gates and cell state unbounded growth
Core Content
1. The Problem LSTMs Solve
Recall from 17-02: the vanilla RNN gradient path involves repeated multiplication by the same Jacobian:
βL/βhβ = ... β diag(tanh')Β·W_hh
The key insight: if we could make the state transition linear (tanh derivative = 1 everywhere), the gradient wouldn't vanish. But we need non-linearity for expressive power.
LSTM solution: Use gating to create a "gradient highway" β an additive cell state update with identity connections that let gradients flow unimpeded.
2. LSTM Architecture and Equations
The LSTM maintains TWO states: - Cell state c_t β β^d β the "long-term memory" (gradient highway) - Hidden state h_t β β^d β the "working memory" (output)
At each time step, given input x_t and previous states (h_{t-1}, c_{t-1}):
Forget gate f_t: Controls what to REMOVE from previous cell state
f_t = Ο(W_f Β· [h_{t-1} β₯ x_t] + b_f)
Input gate i_t: Controls what NEW information to ADD
i_t = Ο(W_i Β· [h_{t-1} β₯ x_t] + b_i)
Candidate cell state CΜ_t: The new information we MIGHT add
CΜ_t = tanh(W_C Β· [h_{t-1} β₯ x_t] + b_C)
Cell state update c_t: COMBINE forget + add (THE KEY EQUATION)
c_t = f_t β c_{t-1} + i_t β CΜ_t
Output gate o_t: Controls what to EXPOSE from cell state
o_t = Ο(W_o Β· [h_{t-1} β₯ x_t] + b_o)
Hidden state h_t: Filtered version of cell state
h_t = o_t β tanh(c_t)
Where: - Ο = sigmoid, outputting values in (0,1) for gating - β = element-wise (Hadamard) product - [h_{t-1} β₯ x_t] = concatenation, dimension d + d_in - Each W_ β β^(d Γ (d+d_in)), each b_ β β^d
β οΈ THIS IS CRITICAL β The cell state update c_t = f_tβc_{t-1} + i_tβCΜt is ADDITIVE. This creates a direct path where c{t-1} flows into c_t without being multiplied by a weight matrix. The gradient βc_t/βc_{t-1} = diag(f_t), which is bounded and doesn't involve W at all.
3. Why LSTMs Solve Vanishing Gradients
Let's trace the gradient through the cell state. The loss L depends on c_t through h_t:
βL/βc_t = βL/βh_t Β· βh_t/βc_t = Ξ΄_t^h β (o_t β tanh'(c_t))
Now, how does c_t depend on c_{t-1}?
βc_t/βc_{t-1} = diag(f_t)
This is the magic. The Jacobian βc_t/βc_{t-1} is a DIAGONAL matrix whose diagonal entries are f_t (the forget gate values). It does NOT involve the recurrent weight matrix W at all!
The gradient from c_t back to c_{t-k}:
βc_t/βc_{t-k} = β{j=0}^{k-1} diag(f{t-j})
This is a product of k DIAGONAL matrices, each with entries in (0,1). The eigenvalues are the products of forget gate values.
If f_j β 1 (the LSTM "remembers"), then β f β 1 and gradients don't vanish. If f_j β 0 (the LSTM "forgets"), then β f β 0 and gradients vanish β but that's correct behavior, because we WANT to forget.
Contrast with vanilla RNN: βh_t/βh_{t-1} = diag(tanh')Β·W_hh β involves the full matrix W_hh, whose eigenvalues can systematically be < 1.
The LSTM's trick: The recurrent WEIGHTS W_f, W_i, W_C, W_o still suffer from vanishing gradient through the hidden state path (h_{t-1} β gates β c_t). But the CELL STATE itself provides a parallel, additive gradient highway.
4. Full Gradient Analysis
The hidden state gradient still involves W_hh-like terms:
βh_t/βh_{t-1} = βh_t/βc_t Β· βc_t/βh_{t-1} + direct paths through gates
But the cell state gradient is clean:
βc_t/βc_{t-1} = diag(f_t)
And importantly:
βc_t/βc_{t-k} = β{j=0}^{k-1} diag(f{t-j})
This is element-wise: each dimension of the cell state can have its OWN forget rate. The LSTM can learn to remember some features for long periods (f β 1) while forgetting others (f β 0).
5. Gate Dynamics and Interpretations
Forget gate f_t β 0: The LSTM completely erases the corresponding dimension of cell state. Previous information in that dimension doesn't affect future outputs.
Forget gate f_t β 1: Perfect memory β the cell state dimension passes through unchanged. Gradient flows freely.
Input gate i_t β 1: The LSTM stores the candidate CΜ_t into the cell state.
Output gate o_t β 0: The cell state information is hidden from the output. The LSTM knows it but doesn't use it yet.
Typical initialization: b_f is initialized to a positive value (e.g., 1) so that the LSTM starts with a bias toward remembering (f_t β Ο(1) β 0.73 initially). This gives the LSTM a "prior" favoring long-term memory.
6. Parameter Count
Each gate has its own weight matrix of size d Γ (d + d_in). Four gates β 4d(d + d_in) parameters for the recurrent part.
Example: d=512, d_in=256 β 4Β·512Β·768 = 1,572,864 parameters per LSTM layer.
Compare to vanilla RNN: d(d + d_in) = 512Β·768 = 393,216 parameters. LSTM has 4Γ the parameters.
7. Gradient Through Gates
Backprop through the forget gate:
βL/βf_t = βL/βc_t β c_{t-1} (from c_t = f_tβc_{t-1} + ...)
Then through the sigmoid:
βL/β(W_f[h_{t-1},x_t]) = βL/βf_t β f_t β (1 β f_t)
The sigmoid derivative f_t(1βf_t) can saturate β if f_t β 0 or f_t β 1, the gradient through the gate is tiny. This is a potential issue if the gates get "stuck" at extremes.
Key Terms
- 17 03 Lstm Mathematics
- Cell state c_t
- End-of-Subject Quiz
- Example 1: Tracking a Single Cell State Dimension
- Example 2: Gradient Flow Over Time
- Example 3: Gate Saturation Analysis
- Full Gradient Analysis
- Gate Dynamics and Interpretations
- Gradient Through Gates
- Hidden state h_t
- LSTM Architecture and Equations
- Parameter Count
Worked Examples
Example 1: Tracking a Single Cell State Dimension
Problem: For one dimension of cell state, f = [0.9, 0.5, 0.8], i = [0.1, 0.3, 0.2], CΜ = [1, 2, 3] over three time steps, with cβ = 0. Compute cβ, cβ, cβ.
Solution: cβ = 0.9Β·0 + 0.1Β·1 = 0.1 cβ = 0.5Β·0.1 + 0.3Β·2 = 0.05 + 0.6 = 0.65 cβ = 0.8Β·0.65 + 0.2Β·3 = 0.52 + 0.6 = 1.12
Example 2: Gradient Flow Over Time
Problem: For the cell state from Example 1, if βL/βcβ = 1, compute βL/βcβ (ignoring gate dependencies on cβ).
Solution: βL/βcβ = βL/βcβ Β· βcβ/βcβ Β· βcβ/βcβ = 1 Β· fβ Β· fβ = 1 Β· 0.8 Β· 0.5 = 0.40
The gradient is 0.40 β attenuated, but not vanished! Over 3 steps with average f β 0.73, it's manageable. Compare to vanilla RNN with W_hh = 0.5: (0.5)Β² = 0.25 of the gradient survives for only the weight part, plus tanh attenuation.
Example 3: Gate Saturation Analysis
Problem: An LSTM learns such that f_t β 1.0 for all t for some dimension. What happens to: (a) the cell state, (b) the gradient?
Solution: (a) c_t = 1Β·c_{t-1} + i_tΒ·CΜt = c{t-1} + i_tΒ·CΜt. The cell state grows unboundedly if i_tΒ·CΜ_t is consistently positive β it's a running sum. (b) βL/βc{t-k} has factor β f = 1^k = 1. Gradients don't vanish. But βf/β(preactivation) = f(1-f) β 0 β the forget gate stops learning! This is the saturation tradeoff: perfect gradient flow through c, but no learning in the gate.
Quiz
Q1: What does the concept of End-of-Subject Quiz primarily refer to in this subject?
A) A historical anecdote about End-of-Subject Quiz B) The definition and application of End-of-Subject Quiz C) A computational error related to End-of-Subject Quiz D) A visual representation of End-of-Subject Quiz
Correct: B)
- If you chose A: This is incorrect. End-of-Subject Quiz is defined as: the definition and application of end-of-subject quiz. The other options describe different aspects that are not the primary focus.
- If you chose B: End-of-Subject Quiz is defined as: the definition and application of end-of-subject quiz. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. End-of-Subject Quiz is defined as: the definition and application of end-of-subject quiz. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. End-of-Subject Quiz is defined as: the definition and application of end-of-subject quiz. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of Full Gradient Analysis?
A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It is used to full gradient analysis in mathematical analysis D) It replaces all other methods in this domain
Correct: C)
- If you chose A: This is incorrect. Full Gradient Analysis serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Full Gradient Analysis serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: Full Gradient Analysis serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose D: This is incorrect. Full Gradient Analysis serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about Gate Dynamics and Interpretations is TRUE?
A) Gate Dynamics and Interpretations is an advanced topic beyond this subject's scope B) Gate Dynamics and Interpretations is a fundamental concept covered in this subject C) Gate Dynamics and Interpretations is not related to this subject D) Gate Dynamics and Interpretations is mentioned only as a historical footnote
Correct: B)
- If you chose A: This is incorrect. Gate Dynamics and Interpretations is a fundamental concept covered in this subject. This subject covers Gate Dynamics and Interpretations as part of its core content.
- If you chose B: Gate Dynamics and Interpretations is a fundamental concept covered in this subject. This subject covers Gate Dynamics and Interpretations as part of its core content. Correct!
- If you chose C: This is incorrect. Gate Dynamics and Interpretations is a fundamental concept covered in this subject. This subject covers Gate Dynamics and Interpretations as part of its core content.
- If you chose D: This is incorrect. Gate Dynamics and Interpretations is a fundamental concept covered in this subject. This subject covers Gate Dynamics and Interpretations as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) βL/βc_t β CΜ_t (element-wise product of the incoming gradient and the C) An unrelated numerical value D) A different result from a common mistake
Correct: B)
- If you chose A: This is incorrect. The worked examples show that the result is βL/βc_t β CΜ_t (element-wise product of the incoming gradient and the. The other options represent common errors.
- If you chose B: The worked examples show that the result is βL/βc_t β CΜ_t (element-wise product of the incoming gradient and the. The other options represent common errors. Correct!
- If you chose C: This is incorrect. The worked examples show that the result is βL/βc_t β CΜ_t (element-wise product of the incoming gradient and the. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is βL/βc_t β CΜ_t (element-wise product of the incoming gradient and the. The other options represent common errors.
Q5: How are Gate Dynamics and Interpretations and Gradient Through Gates related?
A) Gate Dynamics and Interpretations is the inverse of Gradient Through Gates B) Gate Dynamics and Interpretations and Gradient Through Gates are closely related concepts C) Gate Dynamics and Interpretations is a special case of Gradient Through Gates D) Gate Dynamics and Interpretations and Gradient Through Gates are completely unrelated topics
Correct: B)
- If you chose A: This is incorrect. Both Gate Dynamics and Interpretations and Gradient Through Gates are covered in this subject as interconnected topics.
- If you chose B: Both Gate Dynamics and Interpretations and Gradient Through Gates are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both Gate Dynamics and Interpretations and Gradient Through Gates are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Gate Dynamics and Interpretations and Gradient Through Gates are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with LSTM Architecture and Equations?
A) LSTM Architecture and Equations has no common misconceptions B) The main error with LSTM Architecture and Equations is using it when it is not needed C) A common mistake is confusing LSTM Architecture and Equations with a similar concept D) LSTM Architecture and Equations is always computed the same way in all contexts
Correct: C)
- If you chose A: This is incorrect. Students often confuse LSTM Architecture and Equations with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse LSTM Architecture and Equations with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse LSTM Architecture and Equations with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse LSTM Architecture and Equations with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply Parameter Count?
A) Avoid Parameter Count unless explicitly instructed B) Parameter Count is not practically useful C) Use Parameter Count only in pure mathematics contexts D) Apply Parameter Count to solve problems in this subject's domain
Correct: D)
- If you chose A: This is incorrect. Parameter Count is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Parameter Count is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Parameter Count is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: Parameter Count is a practical tool used throughout this subject to solve relevant problems. Correct!
Practice Problems
Problem 1
What are the dimensions of W_f in an LSTM with d_h=256, d_in=128?
Answer
256 Γ (256 + 128) = 256 Γ 384. Each LSTM weight matrix maps the concatenation [h_{t-1}, x_t] (dimension 384) to the hidden dimension (256).Problem 2
If f_t = 0 for all t, what is c_t? What does the gradient βc_t/βcβ look like?
Answer
c_t = 0Β·c_{t-1} + i_tΒ·CΜ_t = i_tΒ·CΜ_t. The cell state has no memory β becomes a purely feedforward computation. βc_t/βcβ = 0 β no gradient flows to earlier states, which is correct because earlier states were deliberately forgotten.Problem 3
Derive βL/βi_t in terms of βL/βc_t and CΜ_t.
Answer
From c_t = f_tβc_{t-1} + i_tβCΜ_t: βL/βi_t = βL/βc_t β CΜ_t (element-wise product of the incoming gradient and the candidate values).Problem 4
Why is b_f typically initialized to 1 (or another positive value)?
Answer
With b_f = 1, the forget gate activation is Ο(1) β 0.73, biasing the LSTM toward remembering. This prevents the LSTM from forgetting everything at the start of training, allowing it to discover long-range dependencies before learning to forget selectively.Problem 5
An LSTM processes a sequence of length 100. For a particular cell state dimension, f_t = 0.95 for all t. What fraction of the gradient at t=100 reaches t=1 through the cell state path?
Answer
0.95^99 β 0.0062 β about 0.6% survives. Compare to vanilla RNN with W_hh = 0.5: 0.5^99 β 1.6 Γ 10^{-30} β completely gone. Even with suboptimal gates (0.95 vs 1.0), LSTM is orders of magnitude better.Summary
- LSTM adds a cell state c_t with additive update: c_t = f_tβc_{t-1} + i_tβCΜ_t β creating a gradient highway
- The forget gate f_t gives element-wise control over memory retention; βc_t/βc_{t-1} = diag(f_t) β no recurrent weight matrix involved
- The cell state gradient βc_t/βc_{t-k} = β diag(f_{t-j}) is a product of diagonal matrices β each dimension independently modulated
- The hidden state h_t = o_tβtanh(c_t) gates what the LSTM outputs without affecting the memory cell
- Saturation tradeoff: f_t β 1 gives perfect gradient flow but stops gate learning; f_t β 0 prevents gradient flow (correctly, since we chose to forget)
Pitfalls
- Assuming LSTMs completely solve vanishing gradients. The cell state gradient highway (βc_t/βc_{t-1} = diag(f_t)) helps immensely, but gradients flowing through the gate computations (W_f, W_i, W_C, W_o) still pass through sigmoid and tanh derivatives and can vanish over long sequences. LSTMs extend the effective range but don't grant infinite memory β 500+ step dependencies still challenge them.
- Initializing the forget gate bias b_f to zero. With b_f = 0, the forget gate activation starts at Ο(0) = 0.5, biasing the LSTM toward rapid forgetting. Always initialize b_f to a positive value (typically 1.0, giving Ο(1) β 0.73) so the LSTM starts with a "remember first, learn to forget later" prior.
- Forgetting that the cell state can grow unbounded. Unlike the hidden state h_t = o_t β tanh(c_t) which is bounded by tanh, the cell state c_t has no inherent bounding mechanism. With f_t β 1 and consistently positive i_t β CΜ_t, c_t becomes an unbounded running sum. This doesn't break the forward pass but can cause numerical issues in long sequences.
- Confusing the roles of forget gate f_t and output gate o_t. f_t controls what information is retained in the cell state c_t (memory management). o_t controls what the LSTM exposes from c_t to the outside world via h_t (information release). They serve fundamentally different purposes β an LSTM can remember information (f_t β 1, c_t large) while choosing not to reveal it (o_t β 0, h_t β 0).
- Stacking too many LSTM layers. Each LSTM layer already has significant depth through time. Stacking more than 3β4 LSTM layers rarely improves performance and dramatically increases training time and memory usage due to the 4Γ parameter count per layer.
Next Steps
Continue to 17-04 β GRU Mathematics to see a simpler gated architecture that achieves similar performance with fewer parameters.