📐 Concept diagram

17-06 — Attention Mechanism (General)

Phase: 17 — Deep Learning Architectures (Math) Subject: 17-06 Prerequisites: 16-02 (Activation Functions — especially softmax), 16-05 (Backpropagation), 8-10 (inner products, dot products) Next subject: 17-07 — Scaled Dot-Product Attention

Learning Objectives

By the end of this subject, you will be able to:

Derive the Query-Key-Value (QKV) abstraction from the concept of soft content-based lookup
Explain how attention weights are computed as normalized similarity scores between queries and keys
Compare additive (Bahdanau) and multiplicative (Luong) attention scoring functions mathematically
Derive the gradient of attention output with respect to the keys and queries
Explain the computational complexity of attention: O(n²d) and why it limits sequence length

Core Content

1. Motivation: Beyond Fixed-Length Context

RNNs (17-02) compress an entire input sequence into a fixed-size hidden state. This creates a bottleneck — information from early in the sequence must survive many recurrent steps to affect later outputs.

Attention solves this by giving the model DIRECT access to ALL positions in the input sequence. Instead of compressing everything into one vector, we compute a weighted sum where the weights are learned dynamically based on relevance.

2. The Query-Key-Value Abstraction

Attention is modeled after soft content-based lookup in a key-value database:

Query (Q): "What am I looking for?" — represents the current position's needs
Keys (K): "What do I contain?" — labels/indexes for each position in the source
Values (V): "What information do I provide?" — the actual content at each position

The process:

Compute similarity between Query and each Key
Normalize similarities to weights (attention distribution)
Compute weighted sum of Values

Mathematically:

α_i = score(q, k_i) — scalar similarity for each position i a = softmax(α) — attention weight vector (sums to 1) output = Σ_i a_i · v_i — weighted sum of value vectors

⚠️ THIS IS CRITICAL — Attention is differentiable soft lookup. Instead of selecting ONE key-value pair, we compute a weighted average over ALL pairs, where the weights come from query-key similarity. The softmax makes this fully differentiable, enabling end-to-end training.

3. Scoring Functions

The score function measures compatibility between query and key vectors. Two classic approaches:

Additive Attention (Bahdanau, 2015):

score(q, k) = vᵀ · tanh(W₁·q + W₂·k)

Where v ∈ ℝ^d_a is a learned weight vector, W₁, W₂ ∈ ℝ^(d_a × d) project query and key to the attention dimension d_a.

Additive attention projects Q and K into a shared space, adds them, applies non-linearity, then projects to a scalar with v. The addition and tanh provide expressive non-linear scoring.

Multiplicative/Dot-Product Attention (Luong, 2015):

score(q, k) = qᵀ k (dot product)

Or with a learned compatibility matrix:

score(q, k) = qᵀ W_a k (general/bilinear attention)

Comparison: - Additive: O(d_a) per pair, can model complex non-linear relationships - Dot-product: O(d) per pair, much faster (matrix-multiply friendly), but assumes dot product captures similarity adequately

In practice, dot-product attention is dramatically faster on GPUs and became the dominant approach for Transformers.

4. Full Attention Computation (Batch Form)

For a sequence of n positions, with query vectors q_i, key vectors k_i, value vectors v_i:

Stack into matrices: Q ∈ ℝ^(n×d_q), K ∈ ℝ^(n×d_k), V ∈ ℝ^(n×d_v)

For self-attention (Q, K, V all come from the same sequence): - Typically d_q = d_k = d_v = d

The attention computation:

S = score_matrix(Q, K) — n×n matrix of raw scores A = softmax(S, dim=1) — each ROW sums to 1 (softmax over keys for each query) output = A · V — n×d weighted sum

Each row of A is the attention distribution for one query position.

5. Attention as a Differentiable Dictionary

Think of attention as:

Q provides n search queries
K provides n dictionary keys
V provides n dictionary values
A[i,j] = how much query i "attends to" key/value j
Output[i] = Σ_j A[i,j] · V[j] — weighted average of values

When A[i,j] ≈ 1 for some j (sharp attention): output ≈ V[j] (hard lookup) When A[i,:] is diffuse (all entries ≈ 1/n): output ≈ mean(V) (averaging)

The model learns to make attention sharp when specific information is needed and diffuse when context aggregation is sufficient.

6. Backpropagation Through Attention

Let's derive gradients for dot-product attention:

S = QKᵀ, A = softmax(S), O = AV

Given ∂L/∂O (gradient from loss through output), we need ∂L/∂Q, ∂L/∂K, ∂L/∂V.

Gradient w.r.t. V:

∂L/∂V = Aᵀ · ∂L/∂O

Direct: O = AV, so ∂L/∂V = Aᵀ·∂L/∂O (standard matrix multiplication gradient).

Gradient w.r.t. A:

∂L/∂A = ∂L/∂O · Vᵀ

Gradient w.r.t. S (through softmax):

∂L/∂S = A ⊙ (∂L/∂A − (A ⊙ ∂L/∂A) · 1₁ᵀ)

This is the softmax gradient: ∂a_i/∂s_j = a_i(δ_{ij} − a_j). In matrix form:

∂L/∂S = A ⊙ (∂L/∂A − (A ⊙ ∂L/∂A) · 1₁ᵀ)

Or equivalently: ∂L/∂S[i,j] = A[i,j]·(∂L/∂A[i,j] − Σ_k A[i,k]·∂L/∂A[i,k])

Gradient w.r.t. Q and K:

S = QKᵀ, so:

∂L/∂Q = ∂L/∂S · K ∂L/∂K = (∂L/∂S)ᵀ · Q

7. Computational Complexity

For sequence length n and dimension d: - Computing S = QKᵀ: O(n²d) — n queries × n keys × d dimensions - Computing A = softmax(S): O(n²) — normalize each row - Computing AV: O(n²d) — n×n matrix times n×d matrix - Total: O(n²d)

The quadratic dependence on n is the fundamental limitation of full attention. For n=1000, n²=1M; for n=100K, n²=10B — infeasible.

This motivates sparse attention, linear attention, and other efficient variants (covered in Phase 18-19).

8. Self-Attention vs Cross-Attention

Self-attention: Q, K, V all come from the SAME sequence. Each position attends to all positions (including itself). Used in encoder layers of Transformers.

Cross-attention: Q comes from one sequence (e.g., decoder), K and V come from another (e.g., encoder output). Used in encoder-decoder attention.

The mathematical operations are identical — only the source of K and V changes.

Key Terms

Attention

Worked Examples

Example 1: Computing Simple Attention

Problem: Q = [[1,0],[0,1]] (2 queries, d=2), K = [[1,0],[0,1],[1,1]] (3 keys, d=2), V = [[1,2],[3,4],[5,6]] (3 values, d=2). Compute attention output with dot-product scoring.

Solution: S = QKᵀ = [[1,0],[0,1]] · [[1,0,1],[0,1,1]] = [[1,0,1],[0,1,1]]

Row 0: exp([1,0,1]) = [2.718, 1.0, 2.718], sum = 6.436 A[0] = [2.718/6.436, 1/6.436, 2.718/6.436] = [0.4223, 0.1554, 0.4223]

Row 1: exp([0,1,1]) = [1.0, 2.718, 2.718], sum = 6.436 A[1] = [1/6.436, 2.718/6.436, 2.718/6.436] = [0.1554, 0.4223, 0.4223]

O = AV = [[0.4223·1+0.1554·3+0.4223·5, 0.4223·2+0.1554·4+0.4223·6], [0.1554·1+0.4223·3+0.4223·5, 0.1554·2+0.4223·4+0.4223·6]] = [[3.000, 4.000], [3.533, 4.533]]

Example 2: Attention Weight Interpretation

Problem: For the attention weights from Example 1, which key does Q₁=[0,1] attend to most? Why?

Solution: A[1] = [0.1554, 0.4223, 0.4223]. Q₁ attends equally to K₂=[0,1] and K₃=[1,1], both with weight 0.4223. K₁=[1,0] gets 0.1554.

Q₁·K₁ = 0, Q₁·K₂ = 1, Q₁·K₃ = 1. Keys 2 and 3 have equal dot products with Q₁, so they split the attention mass. Key 1 is orthogonal to Q₁ (dot product = 0), so it gets less attention (but still non-zero because softmax never gives zero).

Example 3: Gradient Through Attention

Problem: For Example 1, given ∂L/∂O = [[1,0],[0,1]], compute ∂L/∂V.

Solution: ∂L/∂V = Aᵀ · ∂L/∂O Aᵀ = [[0.4223, 0.1554],[0.1554, 0.4223],[0.4223, 0.4223]]

∂L/∂V = [[0.4223·1+0.1554·0, 0.4223·0+0.1554·1], [0.1554·1+0.4223·0, 0.1554·0+0.4223·1], [0.4223·1+0.4223·0, 0.4223·0+0.4223·1]] = [[0.4223, 0.1554],[0.1554, 0.4223],[0.4223, 0.4223]]

Note: each key's value gradient is the attention-weighted sum of query gradients.

Quiz

Q1: In the attention mechanism, what are the three components and their roles?

A) Query: what to output; Key: what to forget; Value: what to remember B) Query: "what am I looking for"; Key: "what does each position contain"; Value: "what information to retrieve" C) Query: attention weights; Key: hidden states; Value: gradients D) Query: the input sequence; Key: the output sequence; Value: the loss

Answer & Explanation

**B** — The QKV abstraction models differentiable soft lookup: Query represents needs, Keys are indexes for matching, Values are content to retrieve. Attention computes Q-K similarity then takes a weighted sum of V.

Q2: What is the computational complexity of full attention for sequence length n and dimension d?

A) O(n·d) B) O(n·d²) C) O(n²·d) D) O(n²·d²)

Answer & Explanation

**C) O(n²d)** — Computing S = QKᵀ is n×d @ d×n → O(n²d), and AV is n×n @ n×d → O(n²d). The n² factor from all pairwise query-key interactions is the fundamental scaling bottleneck.

Q3: How does self-attention differ from cross-attention?

A) Self-attention uses dot-product; cross-attention uses additive scoring B) Self-attention is faster C) In self-attention, Q, K, V come from the same sequence; in cross-attention, K and V come from a different sequence than Q D) Self-attention uses softmax; cross-attention does not

Answer & Explanation

**C** — The mathematical operations are identical. Only the data source differs: self-attention has Q=K=V from one sequence; cross-attention takes Q from one sequence and K,V from another.

Q4: Why is softmax used on attention scores?

A) To make scores larger B) To normalize scores into a probability distribution (sums to 1) that is differentiable C) To reduce computational cost D) Because softmax is faster than other normalizations

Answer & Explanation

**B** — Softmax converts raw scores into a distribution (non-negative, sums to 1), allowing output to be a weighted average. Crucially, softmax is differentiable (unlike argmax), enabling gradient-based training.

Q5: In additive (Bahdanau) attention, what role does the learned vector v play?

A) It provides the final output B) It projects the tanh output to a scalar score C) It acts as the query D) It normalizes the attention weights

Answer & Explanation

**B** — score(q,k) = vᵀ · tanh(W₁q + W₂k). v ∈ ℝ^d_a maps the hidden representation to a scalar compatibility score. A = Values (V). C = Q. D = softmax.

Practice Problems

Problem 1

What are the dimensions of S = QKᵀ if Q ∈ ℝ^(5×64), K ∈ ℝ^(7×64)?

Answer

S ∈ ℝ^(5×7). Q has 5 queries, K has 7 keys; dot products produce a 5×7 matrix.

Problem 2

Explain why attention complexity is O(n²d) and which term dominates for large n.

Answer

Computing QKᵀ is n×d @ d×n = O(n²d). The softmax is O(n²). AV is n×n @ n×d = O(n²d). For large n, O(n²d) dominates. The n² comes from all pairwise query-key interactions.

Problem 3

What is the advantage of additive attention over dot-product attention? What is the disadvantage?

Answer

Advantage: additive can model non-linear query-key relationships via tanh(W₁q + W₂k). Disadvantage: slower — requires a separate feedforward per (q,k) pair instead of a single matrix multiply QKᵀ.

Problem 4

If attention weight A[i,j] = 1 and A[i,k≠j] = 0, what is the output at position i?

Answer

O[i] = Σ_k A[i,k]·V[k] = V[j]. Sharp attention = hard selection of one value. All other values are completely ignored.

Problem 5

Derive ∂L/∂Q given ∂L/∂S and K, where S = QKᵀ.

Answer

S = QKᵀ means S[i,j] = Σ_d Q[i,d]·K[j,d]. So ∂S[i,j]/∂Q[i,d] = K[j,d] and ∂S[i,j]/∂Q[p≠i,d] = 0. Therefore ∂L/∂Q = ∂L/∂S · K, where ∂L/∂S ∈ ℝ^(n×n) and K ∈ ℝ^(n×d).

Summary

Attention is differentiable soft lookup: output = Σ softmax(score(Q,K)) · V — weighted average of values based on query-key similarity
The QKV abstraction separates "what to look for" (Q), "what's available" (K), and "what to retrieve" (V)
Scoring functions: additive (vᵀtanh(W₁q+W₂k)) is more expressive; dot-product (qᵀk) is faster and GPU-friendly
Attention complexity is O(n²d) — the n² from pairwise Q-K interactions is the fundamental scaling bottleneck
Self-attention (Q=K=V from same sequence) vs. cross-attention (Q from one seq, K,V from another) differ only in data source, not computation

Pitfalls

Confusing the roles of Q, K, and V. The query is "what I'm looking for," keys are "what each position offers" (indexes), and values are "what information to retrieve." A common mistake is feeding the wrong tensor as K vs V — the attention weights are computed from Q-K similarity, but the weighted sum is taken over V. Swapping K and V produces numerically different outputs even though they may have the same shape.
Using unscaled dot-product attention in high dimensions. Without scaling by √d_k, the dot products qᵀk grow in variance like d_k. For d_k = 64, scores already reach magnitudes where softmax saturates into near-one-hot distributions, killing gradients for all but the "winning" key. Always use scaled dot-product attention (17-07) unless d_k is trivially small.
Assuming attention weights are interpretable or sparse. Softmax never outputs exactly zero, so every position attends to every other position with some positive weight. Interpreting attention weights as "this token looks at that token" is tempting but often misleading — the weights are a weighted average mechanism, not a causal explanation.
Ignoring the O(n²d) memory cost when choosing sequence length. The attention matrix A is n × n. For n = 2048, that's ~4M entries (32 MB in float32 for ONE head at ONE layer). For n = 100K, it's 10B entries (40 GB). Full attention is computationally infeasible at long context lengths without sparse or linear attention variants.
Forgetting that the attention gradient flows through the softmax Jacobian. When attention is nearly one-hot (sharp), the softmax Jacobian has entries ≈ 0 for all non-winning positions. Those key-value pairs stop receiving meaningful gradients, even though they might become relevant later. The √d_k scaling (17-07) keeps attention in a regime where gradients remain informative.

Next Steps

Continue to 17-07 — Scaled Dot-Product Attention to understand why scaling by √d_k is mathematically necessary and how masking enables autoregressive generation.

Progress

Phases

17-06 — Attention Mechanism (General)

Learning Objectives

Core Content

1. Motivation: Beyond Fixed-Length Context

2. The Query-Key-Value Abstraction

3. Scoring Functions

4. Full Attention Computation (Batch Form)

5. Attention as a Differentiable Dictionary

6. Backpropagation Through Attention

7. Computational Complexity

8. Self-Attention vs Cross-Attention

Key Terms

Worked Examples

Example 1: Computing Simple Attention

Example 2: Attention Weight Interpretation

Example 3: Gradient Through Attention

Quiz

Q1: In the attention mechanism, what are the three components and their roles?

Q2: What is the computational complexity of full attention for sequence length n and dimension d?

Q3: How does self-attention differ from cross-attention?

Q4: Why is softmax used on attention scores?

Q5: In additive (Bahdanau) attention, what role does the learned vector v play?

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Next Steps