Math graphic
📐 Concept diagram

25-01 — Mechanistic Interpretability

Phase: 25 — Frontiers & Active Research Areas Subject: 25-01 Prerequisites: Phase 18–19 (LLM Math), Phase 14 (Optimization) Next subject: 25-02 — Sparse Autoencoders (SAEs)


Learning Objectives

By the end of this subject, you will be able to:

  1. Define mechanistic interpretability and distinguish it from behavioral interpretability
  2. Explain the superposition hypothesis and its implications for neural network representations
  3. Understand features, circuits, and the distinction between polysemantic and monosemantic neurons
  4. Apply causal scrubbing to validate mechanistic hypotheses about model behavior
  5. Connect mechanistic interpretability to the broader agenda of AI alignment and safety

Core Content

What Is Mechanistic Interpretability?

Mechanistic interpretability is the study of reverse-engineering neural networks into human-understandable algorithms. Unlike behavioral interpretability (which treats the model as a black box and probes input-output relationships), mechanistic interpretability aims to understand the internal computations — the circuits of weights and activations that implement specific behaviors.

The central question: Can we decompose a neural network's computation into components we understand, in the same way we can decompile a binary into source code?

⚠️ CRITICAL: Mechanistic interpretability is fundamentally an empirical science of a mathematical object. A trained neural network is a deterministic function defined by its architecture and weights. There is a ground-truth answer to "how does this model compute output $y$ from input $x$?" — we just don't know how to extract it yet.

Features and Directions

A feature is a human-interpretable property of the input that the model represents internally. In the linear representation hypothesis, features correspond to directions in activation space:

$$\text{feature}_i(\mathbf{x}) = \mathbf{w}_i^T \cdot \mathbf{a}(\mathbf{x})$$

where $\mathbf{a}(\mathbf{x}) \in \mathbb{R}^d$ is the activation vector at some layer for input $\mathbf{x}$, and $\mathbf{w}_i \in \mathbb{R}^d$ is the feature direction.

Key empirical finding: In transformer language models, many interpretable features exist as nearly-linear directions. For example, you can find a direction that encodes "is the text in French" and intervene by adding/subtracting it.

The Superposition Hypothesis

The superposition hypothesis (Anthropic, 2022) posits that neural networks represent more features than they have dimensions, by exploiting a mathematical fact: sparse vectors in high dimensions can be almost-orthogonal.

If a model has $d$ neurons in a layer, classical thinking says it can represent at most $d$ independent features. But if features are sparse (only a few are active at once), the model can cram $n \gg d$ features into $d$-dimensional space using almost-orthogonal vectors.

Why this works: In $\mathbb{R}^d$, you can pack exponentially many almost-orthogonal vectors — the Johnson-Lindenstrauss lemma tells us there exist $\exp(O(d))$ vectors whose pairwise dot products are bounded by $\varepsilon$. When features are sparse, interference (dot products between active features) is small, even if the vectors aren't perfectly orthogonal.

Mathematical model of superposition (toy setting):

Given $n$ features with sparsity $S$ (fraction active), we embed them into $\mathbb{R}^d$ with $d < n$:

$$\mathbf{a}(\mathbf{x}) = \mathbf{W}^T \mathbf{f}(\mathbf{x}) \quad \text{where} \quad \mathbf{W} \in \mathbb{R}^{n \times d}, \ |\mathbf{w}_i| = 1$$

The reconstruction of feature $i$ from the activation is:

$$\hat{f}_i(\mathbf{x}) = \text{ReLU}(\mathbf{w}_i \cdot \mathbf{a}(\mathbf{x}) + b_i)$$

The model learns $\mathbf{W}$ to minimize reconstruction error subject to the bottleneck $d < n$. The optimal arrangement pushes feature vectors to be near-orthogonal, with interference manifesting as small-but-nonzero dot products.

Polysemanticity emerges naturally from superposition: when a single neuron participates in multiple feature directions, it responds to multiple seemingly-unrelated concepts. A polysemantic neuron activates for several distinct features; a monosemantic neuron activates primarily for one.

Circuits

A circuit is a subgraph of the neural network's computational graph that implements a specific behavior. Formally, a circuit $\mathcal{C}$ for behavior $B$ is a subset of edges (weights) such that:

$$\text{Model}(x){\text{with } \mathcal{C} \text{ ablated}} \neq \text{Model}(x){\text{full}} \quad \text{for inputs relevant to } B$$

while ablating edges outside $\mathcal{C}$ preserves behavior $B$.

Example: Induction heads (Olsson et al., 2022). In transformers, a two-layer circuit composed of a "previous token head" and a "copy head" implements the pattern-matching behavior: "If I see [A][B]..., then [A] again, predict [B]." Mathematically:

$$\text{Induction}(x_{1:t}) = \text{softmax}\left(\frac{Q_{\text{prev}} \cdot K_{\text{copy}}^T}{\sqrt{d_k}}\right) V_{\text{token}}$$

The K-composition from previous-token head feeds into the Q of the copy head, creating a circuit that attends to the token after each occurrence of the current token.

Universality Hypothesis

The universality hypothesis states that neural networks trained on similar data/tasks converge to similar internal circuits, regardless of random initialization. Evidence includes:

  1. Feature universality: The same interpretable features appear across different training runs
  2. Circuit universality: Induction heads, copy mechanisms, and other circuits emerge consistently
  3. Representation similarity metrics: CKA (Centered Kernel Alignment) shows high similarity between different runs

Mathematically, if two models $f_\theta$ and $f_\phi$ are trained on the same distribution, there often exists an orthogonal transformation $R$ such that activations align: $\mathbf{a}\theta(\mathbf{x}) \approx R \cdot \mathbf{a}\phi(\mathbf{x})$.

Causal Scrubbing

Causal scrubbing is a methodology for rigorously testing mechanistic hypotheses. The idea:

  1. Hypothesize a circuit $\mathcal{C}$ that implements behavior $B$
  2. Scrub: Replace activations in the model with activations from different inputs that should be equivalent under the hypothesis
  3. Measure: If the hypothesis is correct, the scrubbed model should produce the same output

Formally, given a hypothesis $H$ that identifies which activations encode what information, a scrubbing function $s: \mathcal{X} \to \mathcal{X}$ maps inputs to "equivalent" inputs. For each activation $a$ hypothesized to encode information $I(a)$, we replace it with the activation from $s(x)$ which (under $H$) encodes the same $I(a)$.

$$\text{Scrubbed output} = \text{Model}_{\text{with replaced activations}}(x)$$

If $H$ is correct: $\text{Model}(x) \approx \text{Scrubbed output}$.



Key Terms

Worked Examples

Example 1: Finding a Feature Direction via Linear Probing

Problem: In a sentiment-classification transformer, you suspect a direction encodes "positive sentiment." The model's last-layer activations for two sentences are:

Find a candidate feature direction $\mathbf{w} \in \mathbb{R}^3$ that separates these.

Solution: 1. The vector difference points from negative to positive: $\mathbf{d} = \mathbf{a}_1 - \mathbf{a}_2 = [1.4, -0.5, 0.9]^T$ 2. Normalize: $|\mathbf{d}| = \sqrt{1.96 + 0.25 + 0.81} = \sqrt{3.02} \approx 1.738$ 3. $\mathbf{w} = \mathbf{d} / |\mathbf{d}| \approx [0.805, -0.288, 0.518]^T$ 4. Verify: $\mathbf{w}^T\mathbf{a}_1 \approx 0.644 + 0.086 + 0.259 = 0.989$ (positive) 5. $\mathbf{w}^T\mathbf{a}_2 \approx -0.483 - 0.058 - 0.207 = -0.748$ (negative) 6. The direction cleanly separates the two sentiments.

Click for answer Feature direction: $\mathbf{w} \approx [0.805, -0.288, 0.518]^T$. Positive sentiment projects to $\approx +0.99$, negative to $\approx -0.75$. In practice, you'd use many examples and logistic regression (linear probe) to find the optimal separating hyperplane.

Example 2: Detecting Polysemanticity

Problem: A neuron has the following top-5 activating dataset examples: 1. "The cat sat on the mat" (token "cat" → activation 8.2) 2. "Heavy metal music is loud" (token "metal" → activation 7.9) 3. "The bank of the river" (token "bank" → activation 8.5) 4. "I need to deposit money at the bank" (token "bank" → activation 8.3) 5. "My favorite cat breed is Siamese" (token "cat" → activation 7.8)

Is this neuron monosemantic or polysemantic?

Solution: The neuron activates strongly for: "cat" (animal), "metal" (music genre), and "bank" (both riverbank and financial bank). These concepts are semantically unrelated — there's no single concept that encompasses "cats, heavy metal music, and riverbanks." Therefore, this neuron is polysemantic, representing at least 3-4 distinct features in superposition.

Click for answer Polysemantic. The neuron fires for at least 3-4 unrelated concepts (cat/feline, metal music, riverbank, financial bank). This is exactly what the superposition hypothesis predicts — a single neuron participates in representing multiple sparse features.

Example 3: Causal Scrubbing of an Induction Circuit

Problem: You hypothesize that attention head L1.H3 implements the pattern $[A] \to [B]$ (attend to the token after the current token's last occurrence). Design a scrubbing experiment to test this. Given the input sequence "The cat sat on the mat. The cat", the model predicts "sat". What would scrubbing look like?

Solution: 1. Hypothesis: L1.H3 attends from "cat" (position 6) to the token after the previous "cat" (position 1), which is "sat" at position 2. The value at position 2 is then used to predict "sat." 2. Scrubbing design: Replace the activations of L1.H3 at position 6 with activations from a different input where the hypothesis predicts the same behavior. For instance, consider "A dog ran in the park. The dog" — here L1.H3 would similarly attend from "dog" (pos 6) to "ran" (pos 2). If L1.H3's output encodes "the token after the previous identical token," scrubbing should preserve behavior. 3. Execution: Run the model twice. First, normally on "The cat sat on the mat. The cat." Second, with L1.H3's output at position 6 replaced by L1.H3's output from "A dog ran...the dog" at the corresponding position. 4. Prediction under correct hypothesis: The model should still predict "sat" because L1.H3's output from the dog example encodes the same algorithmic signal (attend to the token after the match), and "sat" is retrieved from the unscrubbed residual stream. 5. If hypothesis is wrong: The prediction would change because the replaced activation encodes different information than hypothesized.

Click for answer Scrubbing tests whether L1.H3 truly implements "attend to token-after-previous-match." By replacing its output with activations from a different sequence that should produce equivalent algorithmic output, we test whether the hypothesis captures all the information L1.H3 contributes. If the prediction remains "sat," the hypothesis is supported.

Practice Problems

Problem 1: A 3-dimensional activation space has features $\mathbf{f}_1 = [1, 0, 0]^T$, $\mathbf{f}_2 = [0, 1, 0]^T$, $\mathbf{f}_3 = [0, 0, 1]^T$, and $\mathbf{f}_4 = [0.6, 0.6, 0.528]^T$. Compute all pairwise dot products and determine if feature 4 can coexist with the others in superposition with minimal interference.

Answer (click to expand) $\langle\mathbf{f}_1, \mathbf{f}_4\rangle = 0.6$, $\langle\mathbf{f}_2, \mathbf{f}_4\rangle = 0.6$, $\langle\mathbf{f}_3, \mathbf{f}_4\rangle = 0.528$, and $\|\mathbf{f}_4\|^2 = 0.36 + 0.36 + 0.279 = 0.999 \approx 1$. The base 3 features are orthogonal (dot products 0). Feature 4 has non-trivial dot products with all three base features — it's in superposition with them. If features are sparse enough (only 1-2 active at a time), the interference $\mathbf{f}_4^T\mathbf{f}_{\text{active}}$ stays manageable (0.6 is moderate). This is a 4-in-3 superposition.

Problem 2: In a probing experiment, you train a linear classifier to predict whether a sentence contains a negation from a model's layer-3 activations. The probe achieves 99% accuracy. Does this prove the model uses a negation feature for its predictions? Why or why not?

Answer (click to expand) No, it does not prove the model *uses* the negation feature. A linear probe only shows that the information is *present* in the activations — it could be an epiphenomenon (correlated but not causally used). To prove usage, you need causal intervention: if you *remove* the negation direction from activations, does the model's behavior change on negation-dependent tasks? This is the correlation-vs-causation distinction at the core of mechanistic interpretability.

Problem 3: Suppose you discover that two attention heads, H1 and H2, form a circuit for factual recall from a subject-relation-object triplet. Describe the minimum evidence needed to claim you've "understood" this circuit.

Answer (click to expand) Minimum evidence for a circuit claim: 1. **Knockout:** Ablating either H1 or H2 degrades factual recall performance (necessity) 2. **Sufficiency:** Activating H1→H2 on a different input transfers the behavior 3. **Composition:** H1's output feeds into H2's input (structural connection) 4. **Specificity:** The circuit handles factual recall specifically, not general language modeling 5. **Causal scrubbing:** A hypothesis about what information flows through each edge survives scrubbing tests Collectively, this establishes necessity, sufficiency, and mechanistic understanding.

Problem 4: A layer has dimensionality $d = 512$ and represents features that are each active with probability $p = 0.01$ (sparsity $S=0.01$, or 1% of features active per input). Estimate the maximum number of features that can be packed into this layer, assuming features are random unit vectors in $\mathbb{R}^d$ and interference $\geq 0.5$ becomes problematic.

Answer (click to expand) For random unit vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^d$, the expected dot product is $0$ with variance $1/d$. The expected squared dot product is $\mathbb{E}[(\mathbf{u}^T\mathbf{v})^2] = 1/d$. With $n$ features and $k = pn$ active on average, the interference a feature experiences is roughly $\sum_{j \neq i, \text{active}} |\mathbf{w}_i^T \mathbf{w}_j|$. With random vectors, each active interfering dot product has magnitude $\approx 1/\sqrt{d}$. So total interference $\approx k / \sqrt{d}$. Setting $k / \sqrt{d} < 0.5$: $k < 0.5\sqrt{d} = 0.5\sqrt{512} \approx 11.3$. With $p = 0.01$: $n = k/p \approx 11.3 / 0.01 = 1130$. So roughly **~1,000 features** can be packed into 512 dimensions at 1% sparsity before interference becomes severe. This illustrates superposition scaling: $n \propto d \cdot (\text{sparsity factor})$.

Problem 5: Causal scrubbing: You hypothesize that a neuron N encodes "the subject is plural." Design two scrubbing experiments — one that should pass if your hypothesis is correct, and one that should fail (as a control).

Answer (click to expand) **Experiment 1 (should pass):** For input "The cats are sleeping," scrub N's activation by replacing it with N's activation from "The dogs are barking." Both have plural subjects, so the scrubbed activation should encode the same "plural subject" signal. If the hypothesis is correct, model output should still predict a plural-agreeing verb. **Experiment 2 (should fail — control):** For input "The cats are sleeping," scrub N's activation with N's activation from "The cat is sleeping" (singular subject). If N truly encodes plurarity, replacing it with a singular signal should cause the model to produce singular-agreeing output — showing that scrubbing breaks behavior when the hypothesis says it should. Passing the first and failing the second provides strong causal evidence for the hypothesis.

Summary


Quiz

Question 1: What is the superposition hypothesis?

A. Neural networks can represent at most $d$ features in a $d$-dimensional layer B. Neural networks can represent $n > d$ features in $d$ dimensions by exploiting sparsity and near-orthogonality C. All features in a network are orthogonal to each other by construction D. Superposition only occurs in attention layers, not MLP layers

Correct Answer: B

Explanation - **If you chose A:** This is the classical (pre-superposition) view. The superposition hypothesis says networks can exceed this limit. - **If you chose B:** Correct. Sparse features can be represented with almost-orthogonal vectors, allowing $n \gg d$. - **If you chose C:** Feature vectors are near-orthogonal, not perfectly orthogonal. Perfect orthogonality would limit you to $d$ features. - **If you chose D:** Superposition occurs in both MLP and attention layers; it's a general property of high-dimensional representations.

Question 2: A neuron fires strongly for tokens "dog," "puppy," and "canine" — and nothing else. This neuron is:

A. Polysemantic B. Monosemantic C. Dead D. A superposition artifact

Correct Answer: B

Explanation - **If you chose A:** Polysemantic means firing for *unrelated* concepts. Dog/puppy/canine are all semantically related (dogs). - **If you chose B:** Correct. All activating concepts share the feature "dog-related" — this is a monosemantic neuron. - **If you chose C:** Dead neurons never fire. This neuron clearly fires. - **If you chose D:** This neuron represents a single clear feature — it's not an artifact of superposition.

Question 3: In causal scrubbing, what does it mean if scrubbing an activation does NOT change the model's output?

A. The activation is irrelevant to the model's computation B. The scrubbing function was incorrectly designed C. The hypothesis about what the activation encodes is supported (the activation was correctly replaced with equivalent information) D. The model is too robust to be interpretable

Correct Answer: C

Explanation - **If you chose A:** Not necessarily — the activation could be crucial but the scrubbing provided equivalent information. - **If you chose B:** This *could* be true, but it's not the default interpretation of a successful scrub. - **If you chose C:** Correct. If the hypothesis says "this activation encodes X," and you replace it with an activation from a different input that also encodes X, the model should behave the same. This validates the hypothesis. - **If you chose D:** Robustness is not directly tested by scrubbing.

Question 4: The universality hypothesis suggests:

A. Every training run produces a completely different set of internal circuits B. Neural networks from different architectures always converge to identical weights C. Models trained on similar data tend to develop similar internal features and circuits across different random seeds D. Interpretability methods work equally well on all architectures

Correct Answer: C

Explanation - **If you chose A:** This contradicts the universality hypothesis, which claims convergence to similar structure. - **If you chose B:** Weights can differ up to permutation/symmetry — the claim is about functional similarity, not weight identity. - **If you chose C:** Correct. Empirical evidence (e.g., CKA similarity, induction heads) supports this. - **If you chose D:** Universality is about the models, not the interpretability methods.

Question 5: Which of these is an example of a circuit in mechanistic interpretability?

A. A single neuron that fires for cat images B. A learned weight matrix in an attention layer C. A specific subgraph of attention heads and MLP neurons that together implement indirect object identification D. The entire forward pass of the model

Correct Answer: C

Explanation - **If you chose A:** That's a feature, not a circuit. A circuit involves multiple interacting components. - **If you chose B:** A single weight matrix is a component, not a circuit. - **If you chose C:** Correct. A circuit is a subgraph of interacting components (heads, neurons) that together implement an algorithm. - **If you chose D:** The whole model contains many circuits, not a single one.

Question 6: Why is linear probing alone insufficient to establish that a model uses a feature?

A. Linear probes are computationally too expensive B. Linear probes can detect information that is present but not causally used by the model C. Linear probes only work on convolutional networks D. Linear probes always overfit

Correct Answer: B

Explanation - **If you chose A:** Linear probes are computationally cheap. - **If you chose B:** Correct. A probe finding that "activation direction X correlates with concept Y" doesn't mean the model uses Y for its predictions — it could be an epiphenomenon. Causal intervention is needed. - **If you chose C:** Linear probes work on any architecture. - **If you chose D:** Linear probes don't inherently overfit more than other methods.

Pitfalls

  1. Confusing correlation with causation: Finding a direction that encodes information ≠ the model uses that information. Always follow probing with causal experiments.
  2. Over-interpreting individual neurons: In the superposition regime, single neurons are entangled. Looking at neurons in isolation gives an incomplete picture.
  3. Assuming perfect orthogonality: Feature vectors are almost orthogonal, not perfectly orthogonal. The small dot products create interference patterns that matter.
  4. Neglecting the residual stream: In transformers, the residual stream is the primary communication channel. Many interpretability efforts focus too narrowly on attention patterns.

Pitfalls

  1. Confusing correlation with causation in probing experiments: Just because a linear probe can decode "French" from layer 5 activations doesn't mean the model uses French representations for its own predictions. Always follow probing with causal intervention (activation patching, ablation). This is the single most common error in interpretability research.

  2. Focusing exclusively on attention patterns: While attention patterns are visually interpretable, the residual stream and MLP layers carry substantial computation. Many circuits involve MLP neurons that transform information in ways attention patterns alone can't capture. Ignoring MLPs gives an incomplete picture of model computation.

  3. Over-relying on neuron-level analysis in the superposition regime: When features are in superposition, individual neurons are polysemantic — they respond to mixtures of features. Analyzing neurons one at a time, without considering their interactions, can lead to misleading conclusions about what the model "knows." Use dictionary learning (SAEs, 25-02) to disentangle features.

  4. Neglecting the residual stream as the primary communication channel: In transformers, attention heads and MLPs write to and read from the residual stream. It's the shared memory. Many interpretability efforts treat attention outputs as directly used by the next layer, missing that the residual stream accumulates information across layers. Circuits often span the residual stream, not just attention-to-attention connections.


Next Steps

25-02 — Sparse Autoencoders (SAEs) — where you'll learn how to automatically discover interpretable features from activations, providing a key tool for the mechanistic interpretability agenda.