Math graphic
📐 Concept diagram

25-09 — Continual Learning

Phase: 25 — Frontiers & Active Research Areas Subject: 25-09 Prerequisites: 16 (Neural Networks), 24-01 (Fisher Information), 20 (Training & Fine-tuning) Next subject: 25-10 — Multimodal Models Mathematics


Learning Objectives

By the end of this subject, you will be able to:

  1. Define catastrophic forgetting and explain why standard SGD causes it
  2. Derive Elastic Weight Consolidation (EWC) from a Bayesian perspective using the Laplace approximation
  3. Implement experience replay and explain the stability-plasticity dilemma
  4. Compare regularisation-based, replay-based, and architectural approaches to continual learning
  5. Analyse the trade-offs between forward and backward transfer in continual learning benchmarks

Core Content

1. The Catastrophic Forgetting Problem

When a neural network is trained sequentially on multiple tasks, learning task $B$ degrades performance on previously-learned task $A$ — sometimes catastrophically, dropping to chance level. This is catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999).

⚠️ CRITICAL: Standard SGD with no continual learning mechanisms optimises only the current task's loss. The parameter updates that reduce loss on task $B$ typically move parameters out of the region where task $A$'s loss is low — there is no mechanism to preserve prior knowledge.

The stability-plasticity dilemma (Grossberg, 1987): - Stability: The ability to retain knowledge of previous tasks - Plasticity: The ability to learn new tasks quickly

A system needs both — too much stability prevents learning, too much plasticity causes forgetting. Continual learning algorithms navigate this trade-off.

Formal problem statement: Given a sequence of tasks $\mathcal{T}1, \mathcal{T}_2, \ldots, \mathcal{T}_T$, each with its own data distribution $P_t(\mathbf{x}, y)$, learn a model $f\theta$ that performs well on all tasks seen so far, without storing all previous data and without knowing task boundaries in advance.

Continual learning comes in three scenarios: - Task-incremental: Task identity is known at both training and test time (easiest) - Domain-incremental: Task identity is NOT known at test time, but the output structure is the same - Class-incremental: New classes are added over time — the model must distinguish all classes seen so far

2. Elastic Weight Consolidation (EWC)

EWC (Kirkpatrick et al., 2017) is a regularisation-based approach that penalises changes to parameters that are important for previous tasks.

Bayesian derivation: After learning task $A$, the posterior over parameters is:

$$\log p(\boldsymbol{\theta} | D_A) = \log p(D_A | \boldsymbol{\theta}) + \log p(\boldsymbol{\theta}) - \text{const}$$

For task $B$, the posterior given both datasets is:

$$\log p(\boldsymbol{\theta} | D_A, D_B) = \log p(D_B | \boldsymbol{\theta}) + \log p(\boldsymbol{\theta} | D_A) - \text{const}$$

The term $\log p(\boldsymbol{\theta} | D_A)$ is intractable — EWC approximates it with a Gaussian centred at the task $A$ optimum $\boldsymbol{\theta}_A^*$, with precision given by the Fisher information matrix:

$$p(\boldsymbol{\theta} | D_A) \approx \mathcal{N}(\boldsymbol{\theta}_A^*, F^{-1})$$

where $F = \mathbb{E}{(\mathbf{x},y)\sim D_A}[(\nabla\theta \log p(y|\mathbf{x}, \boldsymbol{\theta}_A^))(\nabla_\theta \log p(y|\mathbf{x}, \boldsymbol{\theta}_A^))^\top]$ is the Fisher.

This is a Laplace approximation — the log-posterior is Taylor-expanded around the mode to second order, and the Hessian is approximated by the Fisher.

EWC loss for task $B$:

$$L_{\text{EWC}}(\boldsymbol{\theta}) = L_B(\boldsymbol{\theta}) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{A,i}^*)^2$$

where $F_i$ is the $i$-th diagonal entry of the Fisher (for computational tractability, the diagonal approximation is used). Parameters with high Fisher information (important for task $A$) are heavily penalised from moving.

Extending to $T$ tasks: Maintain the Fisher diagonals and optimal parameters for all previous tasks, or use online EWC which maintains a single running Fisher and reference parameter set.

3. Experience Replay

Experience replay stores a small subset of previous-task data and interleaves it with new-task data during training.

Reservoir sampling: Maintain a memory buffer $\mathcal{M}$ of fixed size $M$. For each new example, with probability $M/(\text{seen examples})$, replace a random existing example.

Training with replay: On each SGD step, sample a mini-batch that mixes current-task data with replay data:

$$L(\boldsymbol{\theta}) = \mathbb{E}{(\mathbf{x},y)\sim D{\text{current}}}[L(f_\theta(\mathbf{x}), y)] + \alpha \cdot \mathbb{E}{(\mathbf{x},y)\sim \mathcal{M}}[L(f\theta(\mathbf{x}), y)]$$

The replay term prevents forgetting by keeping the loss on previous tasks in the optimisation objective.

⚠️ Trade-off: Replay is simple and effective but requires storing data — which may violate privacy constraints or memory limitations. Regularisation methods like EWC avoid storing data but are generally less effective at preventing forgetting.

4. Progressive Neural Networks

Instead of modifying a single model, progressive networks (Rusu et al., 2016) add new capacity for each task:

$$h_i^{(k)} = f\left(W_i^{(k)} h_{i-1}^{(k)} + \sum_{j < k} U_i^{(k,j)} h_{i-1}^{(j)}\right)$$

Advantage: Zero forgetting — all previous knowledge is preserved. Disadvantage: Linear growth in parameters with the number of tasks.

5. Other Approaches

Synaptic Intelligence (SI): Similar to EWC but estimates parameter importance online during training by accumulating the contribution of each parameter to the loss decrease — no Fisher computation needed.

Memory Aware Synapses (MAS): Estimates importance by measuring the sensitivity of the learned function to parameter changes — $\Omega_i = \mathbb{E}{\mathbf{x}\sim D}[|\partial f\theta(\mathbf{x})/\partial \theta_i|]$, accumulated during training.

Learning without Forgetting (LwF): Uses knowledge distillation — the old model's predictions on new-task data serve as soft targets, encouraging the new model to preserve old-task responses.



Key Terms

Worked Examples

Example 1: EWC Weight Importance

Problem: A 3-parameter model has been trained on task A. The Fisher diagonals are $F = [0.01, 5.2, 0.3]$ and optimal parameters $\boldsymbol{\theta}_A^* = [1.0, -0.5, 2.0]$. During task B training, the SGD update (without EWC) would change parameters to $[1.2, 2.1, 1.8]$. Compute the EWC penalty and the effective gradient correction for $\lambda = 10$.

Solution:

EWC penalty: $L_{\text{EWC}} = \frac{\lambda}{2}\sum_i F_i(\theta_i - \theta_{A,i}^*)^2$

At the proposed point: - $(1.2-1.0)^2 \cdot 0.01 \cdot 5 = 0.04 \cdot 0.05 = 0.002$ - $(2.1-(-0.5))^2 \cdot 5.2 \cdot 5 = 6.76 \cdot 26 = 175.8$ - $(1.8-2.0)^2 \cdot 0.3 \cdot 5 = 0.04 \cdot 1.5 = 0.06$

Total penalty = 175.9 — dominated by parameter 2, whose Fisher (5.2) indicates high importance. The gradient of the EWC penalty for parameter 2 is $\lambda F_2(\theta_2 - \theta_{2,A}^*) = 10 \cdot 5.2 \cdot 2.6 = 135.2$, strongly opposing the proposed change.

Parameter 1 (Fisher 0.01) is essentially free to move — the penalty gradient is only $10 \cdot 0.01 \cdot 0.2 = 0.02$.

Example 2: Progressive Network Lateral Connections

Problem: A progressive network has 2 columns. Column 1 (trained on MNIST) has hidden layer $h_1^{(1)} = \text{ReLU}(W_1^{(1)}\mathbf{x})$. Column 2 (trained on Fashion-MNIST) receives lateral input from column 1. If $W_1^{(2)} \in \mathbb{R}^{128 \times 784}$ and $U_1^{(2,1)} \in \mathbb{R}^{128 \times 64}$ where column 1's hidden size is 64, write the forward pass for column 2's first hidden layer and compute its parameter count.

Solution:

Forward pass: $h_1^{(2)} = \text{ReLU}(W_1^{(2)}\mathbf{x} + U_1^{(2,1)}h_1^{(1)})$

Parameter count: $W_1^{(2)}$ has $128 \times 784 = 100,352$ params, $U_1^{(2,1)}$ has $128 \times 64 = 8,192$ params. Total for this layer: 108,544. The lateral connection adds only 8% overhead compared to the standard weight matrix, while providing access to column 1's learned features. Note that $W_1^{(1)}$ is NOT counted — it's frozen.

Example 3: Stability-Plasticity Measurement

Problem: After sequential training on tasks $A \to B \to C$, evaluate the model on all three tasks. Results: $A = 0.92 \to 0.31 \to 0.28$, $B = \text{N/A} \to 0.88 \to 0.42$, $C = \text{N/A} \to \text{N/A} \to 0.85$. Compute the forgetting measure for task A after learning task B, and for task B after learning task C.

Solution:

Forgetting = (best performance on task) − (current performance)

Task A after B: $0.92 - 0.31 = 0.61$ (61% forgetting — catastrophic) Task A after C: $0.92 - 0.28 = 0.64$ (64% total forgetting) Task B after C: $0.88 - 0.42 = 0.46$ (46% forgetting)

Average forgetting = $(0.64 + 0.46)/2 = 0.55$ — the model forgets over half its prior knowledge. Forward transfer from A→B: $B_{\text{with A}}/B_{\text{scratch}}$ — not computable without scratch performance.


Practice Problems

Problem 1: Derive the EWC regularisation term from the Bayesian posterior. Show how the Laplace approximation converts $\log p(\boldsymbol{\theta}|D_A)$ into the quadratic penalty.

Problem 2: Compare the memory cost of experience replay (storing $M$ examples) vs EWC (storing Fisher diagonals). For ImageNet-sized models (25M parameters), which is more memory-efficient?

Problem 3: Explain why progressive networks avoid catastrophic forgetting entirely. What is the cost?

Problem 4: The online EWC algorithm merges Fisher matrices: $\tilde{F}t = \gamma \tilde{F}{t-1} + F_t$, where $F_t$ is the Fisher after task $t$. Explain the role of $\gamma < 1$ and what happens as $t \to \infty$.

Problem 5: Design a continual learning benchmark that distinguishes between a method that genuinely retains knowledge and one that simply learns a single representation that works for all tasks. What metrics would you use?

Answers (click to expand) **Problem 1:** $\log p(\boldsymbol{\theta}|D_A, D_B) = \log p(D_B|\boldsymbol{\theta}) + \log p(\boldsymbol{\theta}|D_A) + C$. Taylor expand $\log p(\boldsymbol{\theta}|D_A)$ around $\boldsymbol{\theta}_A^*$ (the mode): $\log p(\boldsymbol{\theta}|D_A) \approx \log p(\boldsymbol{\theta}_A^*|D_A) + \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}_A^*)^\top H(\boldsymbol{\theta} - \boldsymbol{\theta}_A^*)$ where $H = \nabla^2 \log p(\boldsymbol{\theta}|D_A)|_{\boldsymbol{\theta}_A^*}$. For maximum likelihood, $H \approx -\mathbb{E}[\nabla\log p \cdot \nabla\log p^\top] = -F$ (Fisher information). Thus the regulariser is $\frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}_A^*)^\top(\lambda F)(\boldsymbol{\theta} - \boldsymbol{\theta}_A^*)$, which with diagonal $F$ gives the EWC penalty. **Problem 2:** Replay: $M$ images × image dimensions. For CIFAR-10 (32×32×3 × 1 byte = 3KB/image), storing 1000 images = 3MB. For ImageNet (224×224×3 = 150KB/image), 1000 images = 150MB. EWC: 2 floats per parameter (Fisher diagonal + optimal value) = $2 \times 25\text{M} \times 4\text{ bytes} = 200\text{MB}$. EWC is actually more expensive for large models unless $M$ is large. With Fisher diagonal compression or parameter importance pruning, EWC can be reduced. **Problem 3:** Progressive networks freeze previous columns — their parameters never change, so their outputs (and thus task performance) are immutable. Cost: each new task adds a full network column, leading to linear parameter growth. After $T$ tasks with $P$ parameters per column: $O(T^2 P)$ parameters including lateral connections. This becomes infeasible for many tasks. **Problem 4:** $\gamma < 1$ implements exponential decay of older Fisher contributions — recent tasks get higher importance weight. As $t \to \infty$, $\tilde{F} = \sum_{i=1}^t \gamma^{t-i}F_i$. The effective weight of task $i$ is $\gamma^{t-i}$, so very old tasks are effectively forgotten (their Fisher contribution decays to zero). This is a pragmatic compromise — it's usually acceptable to gradually "retire" very old tasks' importance, analogous to biological forgetting of distant memories. **Problem 5:** Benchmark: sequence of 5 diverse tasks (e.g., CIFAR-10, SVHN, Fashion-MNIST, notMNIST, CIFAR-100 subset). Metrics: - **Average accuracy** after all tasks: mean test accuracy across all 5 tasks - **Backward transfer (forgetting):** accuracy on task $i$ after task $T$ minus best accuracy on task $i$ — negative values indicate forgetting - **Forward transfer:** accuracy on task $i$ when trained after previous tasks vs trained from scratch — positive = beneficial transfer - **Memory size:** storage required at test time A method with 90% average accuracy but 60% forgetting may just be learning a universal representation; a true CL method should have low forgetting.

Summary

  1. Catastrophic forgetting is the fundamental challenge of continual learning — new task training overwrites parameters crucial for old tasks.
  2. EWC uses a Bayesian Laplace approximation to penalise changes to parameters with high Fisher information — it estimates parameter importance from the curvature of the loss.
  3. Experience replay stores and interleaves past data, providing direct supervision on old tasks — effective but requires memory and may violate privacy.
  4. Progressive networks eliminate forgetting by freezing old columns and adding new ones — zero forgetting at the cost of linear parameter growth.
  5. The stability-plasticity dilemma governs all CL approaches — methods trade off knowledge retention against learning capacity.

Pitfalls

  1. Diagonal Fisher approximation: EWC uses only diagonal entries of the Fisher matrix, ignoring off-diagonal interactions between parameters. This is computationally necessary but loses information about correlated parameter importance.
  2. Task boundary knowledge: Many CL methods assume known task boundaries. In realistic scenarios, data arrives continuously without clear boundaries — this is the harder "task-free" continual learning setting.
  3. Fixed memory replay: Reservoir sampling is unbiased but discards examples. Prioritised replay weights storage toward examples the model is most likely to forget, improving retention for the same memory budget.

Quiz

Question 1: Catastrophic forgetting occurs when:

A. The model runs out of GPU memory during training B. Learning a new task degrades performance on previously learned tasks, sometimes to chance level C. The learning rate is set too low for convergence D. The training dataset is too small to learn anything useful

Correct Answer: B

Explanation - **If you chose A:** Memory exhaustion is a resource issue, not catastrophic forgetting — which is about knowledge retention, not memory capacity. - **If you chose B:** Correct. Catastrophic forgetting means that SGD updates for task B overwrite parameters crucial for task A, causing performance on A to plummet. - **If you chose C:** A low learning rate may slow training but doesn't cause forgetting of previous tasks. - **If you chose D:** Small datasets cause underfitting, not forgetting of previously learned tasks.

Question 2: Elastic Weight Consolidation (EWC) penalizes changes to parameters based on:

A. The absolute magnitude of each parameter B. The Fisher information matrix — parameters with high Fisher information (important for previous tasks) are heavily constrained from moving C. Random selection of parameters to freeze D. The initialization values of the parameters

Correct Answer: B

Explanation - **If you chose A:** EWC uses the Fisher diagonal $F_i$, not raw parameter magnitude. A large parameter may have low importance if the loss is insensitive to it. - **If you chose B:** Correct. The EWC penalty is $\frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{A,i}^*)^2$. Parameters with high $F_i$ contributed strongly to task A's loss curvature and are penalized heavily from moving. - **If you chose C:** EWC uses a principled Bayesian estimate (Fisher information), not random selection. - **If you chose D:** Initialization values are irrelevant — EWC anchors to the *optimal* parameters $\boldsymbol{\theta}_A^*$ after training on task A.

Question 3: Why does EWC use the diagonal approximation of the Fisher information matrix rather than the full matrix?

A. The diagonal is more accurate for measuring parameter importance B. The full Fisher is $O(P^2)$ in size for $P$ parameters — storing and inverting it is infeasible for modern models with millions of parameters C. All off-diagonal entries are exactly zero by definition D. The diagonal provides stronger regularization

Correct Answer: B

Explanation - **If you chose A:** The full Fisher captures parameter correlations (off-diagonal entries) that the diagonal misses — the diagonal is a *cruder* approximation. - **If you chose B:** Correct. For $P = 25$M parameters, the full Fisher would be a $25\text{M} \times 25\text{M}$ matrix (625 trillion entries). The diagonal requires only $P$ values — a practical necessity. - **If you chose C:** Off-diagonal entries represent correlations between parameters and are generally non-zero. - **If you chose D:** The full Fisher would provide *stronger* constraints by accounting for parameter correlations; the diagonal is a weaker but tractable substitute.

Question 4: Experience replay prevents catastrophic forgetting by:

A. Freezing all model parameters after each task B. Storing a subset of past-task data and interleaving it with current-task data during training, providing direct supervision on old tasks C. Adding L2 weight decay to the loss function D. Increasing the model size for each new task

Correct Answer: B

Explanation - **If you chose A:** Freezing parameters prevents any new learning — that's progressive networks, not experience replay. - **If you chose B:** Correct. By mixing replay data $\mathcal{M}$ with current data $D_{\text{current}}$ in each mini-batch, the loss on previous tasks remains in the optimization objective, preventing the parameters from drifting away. - **If you chose C:** L2 regularization penalizes large weights but doesn't specifically preserve task-specific knowledge. - **If you chose D:** Increasing model size alone doesn't prevent forgetting — new parameters can still overwrite old knowledge.

Question 5: Progressive neural networks achieve zero catastrophic forgetting by:

A. Storing all data from all previous tasks B. Freezing previously trained columns and adding new columns with lateral connections for each new task — previous parameters are never modified C. Using extremely high learning rates to rapidly encode new tasks D. Training all tasks simultaneously in one batch

Correct Answer: B

Explanation - **If you chose A:** Progressive networks store no data — they freeze parameters instead. - **If you chose B:** Correct. Each task gets its own column $k$. Previous columns $\{1,\ldots,k-1\}$ are frozen (immutable). New column $k$ receives lateral connections from all previous columns, enabling knowledge transfer without risking forgetting. - **If you chose C:** High learning rates would exacerbate forgetting, not prevent it. - **If you chose D:** Simultaneous training is multi-task learning, not continual learning — the tasks don't arrive sequentially.

Question 6: The stability-plasticity dilemma refers to the trade-off between:

A. Model size and inference speed B. Retaining knowledge of previous tasks (stability) vs. learning new tasks effectively (plasticity) C. Training time and final test accuracy D. Batch size and learning rate selection

Correct Answer: B

Explanation - **If you chose A:** This is an engineering trade-off, not the stability-plasticity dilemma. - **If you chose B:** Correct. Too much stability (e.g., freezing all parameters) prevents learning new tasks. Too much plasticity (e.g., standard SGD) causes catastrophic forgetting. Continual learning algorithms navigate this fundamental tension. - **If you chose C:** Training time vs. accuracy is a compute trade-off, unrelated to sequential task learning. - **If you chose D:** Batch size and learning rate are standard hyperparameter choices, not the dilemma that defines continual learning.


Next Steps

Final subject: 25-10 — Multimodal Models Mathematics — the endpoint of the entire LLM Researcher Mathematics Curriculum.