📐 Concept diagram

25-02 — Sparse Autoencoders (SAEs)

Phase: 25 — Frontiers & Active Research Areas Subject: 25-02 Prerequisites: 25-01 Mechanistic Interpretability, Phase 22 (VAEs) Next subject: 25-03 — Grokking

Learning Objectives

By the end of this subject, you will be able to:

Derive the SAE architecture and explain how it differs from standard autoencoders
Understand the role of the L1 sparsity penalty in producing interpretable features
Implement feature dictionary learning via SAEs on neural network activations
Diagnose and handle dead features and the resampling procedure
Evaluate SAE quality using reconstruction fidelity and interpretability metrics

Core Content

Why Sparse Autoencoders?

In 25-01, we saw that neural network activations contain interpretable features represented as directions in activation space, often in superposition. The key challenge: how do we automatically discover these feature directions?

A Sparse Autoencoder (SAE) is an unsupervised method that decomposes a model's activations into a sparse linear combination of learned feature vectors. Each learned vector corresponds (ideally) to a single interpretable feature.

The SAE was popularized by Anthropic (Bricken et al., 2023; Cunningham et al., 2023) as a core tool for mechanistic interpretability.

SAE Architecture

An SAE is a single-hidden-layer autoencoder trained on model activations $\mathbf{x} \in \mathbb{R}^d$ (the residual stream or MLP output at some layer):

Encoder: $$\mathbf{f} = \text{ReLU}(W_{\text{enc}} \cdot (\mathbf{x} - \mathbf{b}{\text{dec}}) + \mathbf{b}{\text{enc}})$$

Decoder: $$\hat{\mathbf{x}} = W_{\text{dec}} \cdot \mathbf{f} + \mathbf{b}_{\text{dec}}$$

Where: - $\mathbf{x} \in \mathbb{R}^d$ — input activation vector (e.g., residual stream at a token position) - $\mathbf{f} \in \mathbb{R}^h$ — learned feature activations; $h \gg d$ (the expansion factor; typically 4× to 32×) - $W_{\text{enc}} \in \mathbb{R}^{h \times d}$ — encoder weight matrix - $W_{\text{dec}} \in \mathbb{R}^{d \times h}$ — decoder weight matrix (each column is a feature direction in activation space) - $\mathbf{b}{\text{enc}} \in \mathbb{R}^h$ — encoder bias - $\mathbf{b}{\text{dec}} \in \mathbb{R}^d$ — decoder bias (often initialized as the mean of $\mathbf{x}$) - $\hat{\mathbf{x}} \in \mathbb{R}^d$ — reconstructed activation

⚠️ CRITICAL: The decoder bias $\mathbf{b}_{\text{dec}}$ is subtracted before encoding and added back after decoding. This centering trick is essential — it allows the encoder to focus on deviations from the mean, which is where the interesting features live.

The Loss Function

The SAE is trained to minimize:

$$\mathcal{L}(\mathbf{x}) = \underbrace{|\mathbf{x} - \hat{\mathbf{x}}|2^2}{\text{Reconstruction}} + \lambda \underbrace{|\mathbf{f}|1}{\text{Sparsity}}$$

The reconstruction term $|\mathbf{x} - \hat{\mathbf{x}}|_2^2$ ensures the SAE faithfully represents the activation space
The L1 sparsity penalty $\lambda|\mathbf{f}|_1$ encourages each input to be represented by only a few active features
$\lambda$ is the sparsity coefficient — a crucial hyperparameter

The L1 penalty pushes most feature activations to zero, creating a sparse representation. This is motivated by the superposition hypothesis: features are naturally sparse (only a few are active per input), and sparsity allows the SAE to disentangle them.

Feature Dictionary Learning

The SAE can be understood as dictionary learning on activations:

Dictionary: The columns of $W_{\text{dec}}$ are $h$ feature vectors (directions in $\mathbb{R}^d$)
Sparse code: The vector $\mathbf{f} \in \mathbb{R}^h$ is a sparse coefficient vector — only a few entries are nonzero
Reconstruction: The activation is approximated as a sparse combination: $\hat{\mathbf{x}} \approx \sum_i f_i \cdot W_{\text{dec}}[:, i]$

This is analogous to compressed sensing / sparse coding, but applied to learned representations from language models rather than natural signals.

The Tied-Weights Variant

A common design choice: tie the encoder and decoder weights.

$$W_{\text{enc}} = W_{\text{dec}}^T$$

Combined with unit-norm decoder columns ($|W_{\text{dec}}[:, i]| = 1$), this gives:

$$\mathbf{f} = \text{ReLU}(W_{\text{dec}}^T \cdot (\mathbf{x} - \mathbf{b}{\text{dec}}) + \mathbf{b}{\text{enc}})$$

This ties the encoding and decoding directions — each feature has one direction used for both reading (encoder) and writing (decoder). While this reduces capacity, it often produces cleaner features and is the standard in practice.

Dead Features and Resampling

A dead feature is one whose activation $f_i$ is zero across the entire training dataset (or nearly so — e.g., active on < 0.01% of tokens). Dead features waste capacity because they never contribute to reconstructions.

Why do features die? - The ReLU activation outputs zero for any input below the bias threshold - If $\mathbf{b}_{\text{enc}}[i]$ drifts too negative, the feature never activates - Features initialized in "unlucky" directions may never receive gradient signal

Resampling procedure: 1. After every $K$ training steps (e.g., every 25K steps), identify dead features 2. For each dead feature $i$, reinitialize its decoder vector $W_{\text{dec}}[:, i]$ as a random direction scaled to unit norm 3. Set its encoder bias $\mathbf{b}_{\text{enc}}[i]$ to zero 4. The optimizer then fine-tunes the reinitialized features

This ensures that all $h$ features in the SAE eventually become useful. Without resampling, SAEs with large expansion factors can lose 50%+ of features to death.

Interpreting SAE Features

Once trained, each active feature $i$ in the SAE corresponds to a direction $W_{\text{dec}}[:, i]$ in activation space. To interpret what a feature represents:

Top-activating examples: Run the SAE on a large corpus and find the tokens/sentences where feature $i$ fires most strongly
Activation patching: Ablate or boost feature $i$'s activation and observe the effect on model output
Feature visualization: For vision models, optimize an input that maximally activates feature $i$

Evaluation Metrics

L0 sparsity: Average number of active features per token: $\mathbb{E}_{\mathbf{x}}[|\mathbf{f}|_0]$
Reconstruction loss: $\mathbb{E}[|\mathbf{x} - \hat{\mathbf{x}}|_2^2]$ — must be low enough that the model's behavior is preserved when using $\hat{\mathbf{x}}$ instead of $\mathbf{x}$
Fraction of variance unexplained (FVU): $\mathbb{E}[|\mathbf{x} - \hat{\mathbf{x}}|_2^2] / \mathbb{E}[|\mathbf{x}|_2^2]$
Feature density: Fraction of tokens where feature $i$ is active — interpretable features typically have density 0.01%–5%
Dead feature fraction: Percentage of features with zero activations

Key Terms

Dead features
Decoder columns
SAEs
Sparse Autoencoder (SAE)

Worked Examples

Example 1: Toy SAE on Synthetic Data

Problem: You have 2D activations from a model, and you suspect they encode two features. Given the following: - Feature A appears alone in inputs $\mathbf{x}_1 = [2, 0]^T$, $\mathbf{x}_2 = [3, 0.2]^T$ - Feature B appears alone in inputs $\mathbf{x}_3 = [0, 2]^T$, $\mathbf{x}_4 = [0.1, 2.5]^T$ - Both features appear together in $\mathbf{x}_5 = [2, 2]^T$

Train a toy SAE with $d=2$, $h=4$, $\lambda=0.1$ by hand for one gradient step (just set up the architecture). What should the decoder columns converge to?

Solution: 1. The activation space is $\mathbb{R}^2$, so $d=2$. With $h=4$, we overcomplete ×2. 2. Set $\mathbf{b}{\text{dec}}$ to the mean: $\bar{\mathbf{x}} = [1.02, 0.94]^T$. 3. Center the data: $\mathbf{x}_1' = [0.98, -0.94]^T$, $\mathbf{x}_2' = [1.98, -0.74]^T$, $\mathbf{x}_3' = [-1.02, 1.06]^T$, $\mathbf{x}_4' = [-0.92, 1.56]^T$, $\mathbf{x}_5' = [0.98, 1.06]^T$. 4. The data varies along two main directions: $[1, 0]^T$ (Feature A) and $[0, 1]^T$ (Feature B). 5. Decoder columns should converge to unit vectors along these axes: $\mathbf{d}_1 \approx [1, 0]^T$, $\mathbf{d}_2 \approx [0, 1]^T$, with $\mathbf{d}_3, \mathbf{d}_4$ either dying or capturing residual variance. 6. For $\mathbf{x}_5$, the SAE should output $f_1 \approx 2$, $f_2 \approx 2$, $f_3=f_4=0$, and reconstruction $\hat{\mathbf{x}}_5 \approx 2\mathbf{d}_1 + 2\mathbf{d}_2 + \mathbf{b}{\text{dec}} \approx [3.02, 2.94]^T$, which approximates $[2, 2]^T + \mathbf{b}_{\text{dec}}$.

Click for answer

The decoder columns converge to the feature directions $[1,0]^T$ and $[0,1]^T$ (or rotations thereof). With $h=4 > d=2$, the SAE can learn a sparse overcomplete representation where each feature gets its own dictionary element. The extra 2 columns likely capture noise or die.

Example 2: Dead Feature Detection

Problem: An SAE with $h = 8192$ features is trained on GPT-2 layer 6 activations ($d=768$). After training, feature activations are collected over 100K tokens. The following feature activity statistics are computed:

Feature 0: active on 0 tokens (0%)
Feature 127: active on 3 tokens (0.003%)
Feature 5000: active on 15,000 tokens (15%)
Feature 7200: active on 0 tokens (0%)

The threshold for "alive" is 0.01% (10 tokens out of 100K). Which features are dead? What's the dead feature fraction?

Solution: - Feature 0: 0 tokens → dead (0% < 0.01%) - Feature 127: 3 tokens, 0.003% → dead (below threshold) - Feature 5000: 15,000 tokens, 15% → alive (well above threshold) - Feature 7200: 0 tokens → dead

Dead feature fraction: 3/4 = 75%. In a real SAE, this would trigger resampling.

Click for answer

Features 0, 127, and 7200 are dead. Dead fraction = 75%. After resampling, their decoder directions would be randomly reinitialized and encoder biases reset to zero. Feature 5000 is unusually dense (15%) — it may represent a very common feature or be an artifact.

Example 3: Reconstruction Error Analysis

Problem: An SAE is evaluated on 1000 test activations $\mathbf{x}_i$. The reconstruction errors are: - Mean squared error (MSE): 0.004 - Mean $|\mathbf{x}|^2$: 0.10

Compute the FVU (fraction of variance unexplained). If replacing activations with SAE reconstructions changes the model's next-token prediction accuracy from 75% to 74.2%, is the SAE good enough?

Solution: $$\text{FVU} = \frac{\mathbb{E}[|\mathbf{x} - \hat{\mathbf{x}}|^2]}{\mathbb{E}[|\mathbf{x}|^2]} = \frac{0.004}{0.10} = 0.04 = 4\%$$

This means the SAE explains 96% of activation variance — excellent. The model accuracy drops only 0.8 percentage points (75% → 74.2%), suggesting the SAE faithfully preserves model behavior.

Click for answer

FVU = 4%. This is very good reconstruction quality. The 0.8pp accuracy drop is small enough that downstream interpretability analyses using the SAE are likely valid. Typically, FVU < 10% and accuracy drops < 2% are considered acceptable.

Practice Problems

Problem 1: Derive the gradient of the SAE loss with respect to $W_{\text{dec}}[:, i]$ for a single input $\mathbf{x}$. Assume $\mathbf{f}$ has already been computed (no gradient through the encoder).

Answer (click to expand)

Let $\mathcal{L} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2 + \lambda\|\mathbf{f}\|_1$ where $\hat{\mathbf{x}} = W_{\text{dec}}\mathbf{f} + \mathbf{b}_{\text{dec}}$. The sparsity term doesn't depend on $W_{\text{dec}}$, so: $$\frac{\partial \mathcal{L}}{\partial W_{\text{dec}}[:, i]} = \frac{\partial}{\partial W_{\text{dec}}[:, i]} \|\mathbf{x} - W_{\text{dec}}\mathbf{f} - \mathbf{b}_{\text{dec}}\|^2$$ Let $\mathbf{r} = \mathbf{x} - \hat{\mathbf{x}}$ (the residual). Then: $$\frac{\partial \mathcal{L}}{\partial W_{\text{dec}}[:, i]} = -2 f_i \cdot \mathbf{r}$$ So the update for column $i$ is proportional to the feature activation $f_i$ times the reconstruction error — features that are more active get larger gradients, and the gradient points in the direction that would reduce the error.

Problem 2: An SAE has encoder bias $\mathbf{b}{\text{enc}}$. For a particular feature $i$, $\mathbf{b}{\text{enc}}[i] = -2.0$ and $W_{\text{enc}}[i, :] \cdot (\mathbf{x} - \mathbf{b}_{\text{dec}}) = 1.5$ for a typical input. Does feature $i$ activate? What does the bias value represent?

Answer (click to expand)

The pre-activation is $1.5 + (-2.0) = -0.5$. Since ReLU(-0.5) = 0, the feature does **not** activate. The encoder bias acts as a **threshold**: the feature only activates when the dot-product signal exceeds $-\mathbf{b}_{\text{enc}}[i] = 2.0$. In this example, the signal (1.5) is below the threshold (2.0). A more negative bias means the feature is harder to activate (higher threshold). Biases that are too negative cause dead features.

Problem 3: Explain why the expansion factor $h/d$ matters. What happens if $h = d$ (no expansion)? What if $h \gg d$?

Answer (click to expand)

- **$h = d$ (no expansion):** The SAE has exactly as many features as dimensions. In the no-superposition regime, this could work — one feature per dimension. But if the model uses superposition (representing $n > d$ features), the SAE can't disentangle them. Features will be polysemantic or mixed. - **$h \gg d$ (large expansion, e.g., 8×–32×):** The SAE has enough dictionary elements to assign each underlying feature its own direction. This enables disentanglement even when the model uses superposition. However, more features → more training cost, more dead features, and larger dictionaries to interpret. - **Sweet spot:** Typically $h/d \in [4, 16]$ works well in practice, balancing disentanglement quality and computational cost.

Problem 4: You train two SAEs on the same activations with different $\lambda$ values. SAE-A ($\lambda=0.001$) achieves FVU=2%, L0=45. SAE-B ($\lambda=0.01$) achieves FVU=8%, L0=8. Which is likely better for interpretability research? Why?

Answer (click to expand)

SAE-B (L0=8, FVU=8%) is likely better for interpretability research. While SAE-A has better reconstruction, its L0=45 means an average of 45 features are active per token — too many to easily interpret. SAE-B's sparser representation (8 features/token) makes it much easier to understand which features contribute to each prediction. The slightly higher FVU (8% vs 2%) is an acceptable tradeoff for interpretability. The goal of SAEs is understanding, not perfect reconstruction.

Problem 5: After training, you discover that feature 42's top activating examples all contain the token "the." Is this an interpretable feature? How would you investigate further?

Answer (click to expand)

Probably not interpretable — "the" is so common that a feature firing for it may be a junk/stopword artifact. To investigate further: 1. Look at examples where "the" appears but feature 42 does NOT activate — what's different? 2. Look at the strongest (highest activation) examples specifically — the top-10 may reveal a subtler pattern (e.g., "the" + legal context, or "the" at sentence start) 3. Check if the feature activates for inputs without "the" — if so, something else is going on 4. If it truly only fires for "the" and nothing else, it's a monosemantic "the" feature — interpretable but not very interesting 5. The feature may be part of a positional or syntactic circuit that correlates heavily with "the"

Summary

SAEs automatically discover interpretable features from neural network activations by learning a sparse overcomplete decomposition
The architecture is a single-layer autoencoder with an L1 sparsity penalty that forces sparse feature activations: $\mathcal{L} = |\mathbf{x} - \hat{\mathbf{x}}|^2 + \lambda|\mathbf{f}|_1$
Decoder columns are the learned feature directions in activation space; each column corresponds to one dictionary element
Dead features occur when a feature never activates — resampling periodically reinitializes them to prevent capacity waste
The expansion factor $h/d$ controls capacity for disentangling superposition: larger $h$ enables better disentanglement at computational cost
SAE quality is measured by reconstruction fidelity (FVU) and sparsity (L0); the tradeoff between them is controlled by $\lambda$

Quiz

Question 1: In the SAE architecture, what is the purpose of subtracting $\mathbf{b}_{\text{dec}}$ before encoding?

A. It adds nonlinearity to the encoding B. It centers the activations so the encoder focuses on deviations from the mean C. It's a regularization trick with no semantic meaning D. It ensures the decoder weights are orthogonal

Correct Answer: B

Explanation

- **If you chose A:** The centering is linear; the nonlinearity comes from ReLU. - **If you chose B:** Correct. Subtracting the mean (decoder bias) centers the data, making it easier for the encoder to learn feature directions that capture meaningful deviations. - **If you chose C:** It has clear semantic meaning — it's mean-centering. - **If you chose D:** Orthogonality is not enforced by centering; it's enforced (sometimes) by unit-norm constraints.

Question 2: An SAE with $d=512$ uses an expansion factor of 8. How many features does it learn?

A. 512 B. 4,096 C. 64 D. 8

Correct Answer: B

Explanation

- **If you chose A:** That's the input dimension, not the number of features. - **If you chose B:** Correct. $h = 8 \times 512 = 4,096$ features. - **If you chose C:** That would be the compression, not expansion. - **If you chose D:** This is the expansion factor, not the feature count.

Question 3: What does a dead feature in an SAE mean?

A. The feature fires for a very specific concept (like a "dead ringer") B. The feature's decoder vector has zero norm C. The feature never (or almost never) activates on any input during training D. The feature's reconstruction error is zero

Correct Answer: C

Explanation

- **If you chose A:** Dead features don't fire at all — they represent nothing. - **If you chose B:** The decoder vector might have non-zero norm; the encoder threshold is the issue. - **If you chose C:** Correct. Dead features contribute nothing to reconstructions. They're "dead weight." - **If you chose D:** A feature with zero error would be perfectly useful, not dead.

Question 4: The L1 sparsity penalty $\lambda|\mathbf{f}|_1$ in the SAE loss:

A. Encourages all features to be active on every input B. Encourages each feature to have the same activation magnitude C. Encourages only a small number of features to be active on each input D. Penalizes large decoder weights

Correct Answer: C

Explanation

- **If you chose A:** The opposite — L1 encourages sparsity (few active features), not density. - **If you chose B:** L1 encourages many features to be exactly zero, not uniform magnitudes. - **If you chose C:** Correct. The L1 norm $\sum_i |f_i|$ is minimized when most $f_i = 0$ and only a few are nonzero. - **If you chose D:** L1 is applied to the activations $\mathbf{f}$, not the weights. Weight decay would penalize weights.

Question 5: Why is the expansion factor $h/d > 1$ important for mechanistic interpretability?

A. It makes the SAE computationally cheaper to train B. It ensures the model's activations can be perfectly reconstructed C. It provides enough dictionary elements to disentangle features that are in superposition in the $d$-dimensional activation space D. It guarantees that all learned features will be monosemantic

Correct Answer: C

Explanation

- **If you chose A:** Larger $h$ means more parameters and higher computational cost. - **If you chose B:** Perfect reconstruction is possible with $h \geq d$, but the goal is disentanglement, not perfect reconstruction. - **If you chose C:** Correct. The expansion provides extra capacity to assign separate dictionary elements to features that were mixed together (in superposition) in the original $d$-dimensional space. - **If you chose D:** There's no guarantee — features can still be polysemantic even with large expansion, especially if sparsity is too weak.

Question 6: What happens during resampling of a dead feature?

A. The feature is permanently removed from the SAE B. The feature's decoder vector is reinitialized randomly and its encoder bias is set to zero C. The entire SAE is retrained from scratch D. The sparsity coefficient $\lambda$ is increased

Correct Answer: B

Explanation

- **If you chose A:** Features are never removed — capacity is preserved. - **If you chose B:** Correct. The dead feature gets a fresh random direction and a neutral bias, giving it a chance to learn a useful feature. - **If you chose C:** Retraining from scratch would be wasteful; resampling is a lightweight intervention. - **If you chose D:** Increasing $\lambda$ would make sparsity *worse* (more dead features), not fix it.

Pitfalls

Choosing $\lambda$ too large: Overly aggressive sparsity causes many features to die and reconstruction quality to plummet. Start small and increase gradually.
Ignoring dead features: Without resampling, SAEs waste a huge fraction of their capacity. Always implement resampling.
Over-interpreting L0: Low L0 is good, but if FVU is high, the SAE isn't faithfully representing the model. Need both.
Forgetting the decoder bias: The $\mathbf{b}_{\text{dec}}$ subtraction is essential for good performance; omitting it leads to poor feature quality.

Pitfalls

Choosing the wrong expansion factor $h/d$: Too small ($h/d \approx 1$) and the SAE can't disentangle superposition — features remain mixed. Too large ($h/d > 32$) and training becomes expensive, many features die, and the dictionary becomes hard to interpret. The sweet spot in practice is $h/d \in [4, 16]$, with larger expansions needed for layers known to have heavy superposition.
Evaluating SAEs on reconstruction alone: Low FVU (Fraction of Variance Unexplained) is necessary but not sufficient. An SAE with FVU = 1% but L0 = 200 (200 active features per token) is nearly useless for interpretability — you can't understand predictions from 200 features. Always evaluate the Pareto frontier of reconstruction vs. sparsity, not either metric in isolation.
Forgetting to normalize decoder columns: Without unit-norm constraints on $W_{\text{dec}}$ columns, the L1 penalty can be circumvented by making some feature vectors have tiny norms and large activations, or vice versa. Always normalize decoder columns ($|W_{\text{dec}}[:, i]| = 1$) and adjust the encoder accordingly. Most implementations enforce this via explicit renormalization after each gradient step.
Interpreting SAE features without ground-truth validation: Finding that feature 42 activates for "the" doesn't tell you much. Interpreting features requires: (a) top-activating examples from a diverse corpus, (b) activation patching to verify causal effects on the model's output, and (c) ideally comparison with known linguistic/semantic categories. A feature that fires for seemingly random tokens may be encoding something subtle (position, syntax, context) that single-token analysis misses.

Next Steps

25-03 — Grokking — where you'll explore the fascinating phenomenon of delayed generalization, a phase transition in training dynamics that reveals deep structure in how neural networks learn.

Progress

Phases

25-02 — Sparse Autoencoders (SAEs)

Learning Objectives

Core Content

Why Sparse Autoencoders?

SAE Architecture

The Loss Function

Feature Dictionary Learning

The Tied-Weights Variant

Dead Features and Resampling

Interpreting SAE Features

Evaluation Metrics

Key Terms

Worked Examples

Example 1: Toy SAE on Synthetic Data

Example 2: Dead Feature Detection

Example 3: Reconstruction Error Analysis

Practice Problems

Summary

Quiz

Pitfalls

Pitfalls

Next Steps