Math graphic
📐 Concept diagram

17-06 — Attention Mechanism (General)

Phase: 17 — Deep Learning Architectures (Math) Subject: 17-06 Prerequisites: 16-02 (Activation Functions — especially softmax), 16-05 (Backpropagation), 8-10 (inner products, dot products) Next subject: 17-07 — Scaled Dot-Product Attention


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the Query-Key-Value (QKV) abstraction from the concept of soft content-based lookup
  2. Explain how attention weights are computed as normalized similarity scores between queries and keys
  3. Compare additive (Bahdanau) and multiplicative (Luong) attention scoring functions mathematically
  4. Derive the gradient of attention output with respect to the keys and queries
  5. Explain the computational complexity of attention: O(n²d) and why it limits sequence length

Core Content

1. Motivation: Beyond Fixed-Length Context

RNNs (17-02) compress an entire input sequence into a fixed-size hidden state. This creates a bottleneck — information from early in the sequence must survive many recurrent steps to affect later outputs.

Attention solves this by giving the model DIRECT access to ALL positions in the input sequence. Instead of compressing everything into one vector, we compute a weighted sum where the weights are learned dynamically based on relevance.

2. The Query-Key-Value Abstraction

Attention is modeled after soft content-based lookup in a key-value database:

The process:

  1. Compute similarity between Query and each Key
  2. Normalize similarities to weights (attention distribution)
  3. Compute weighted sum of Values

Mathematically:

α_i = score(q, k_i) — scalar similarity for each position i a = softmax(α) — attention weight vector (sums to 1) output = Σ_i a_i · v_i — weighted sum of value vectors

⚠️ THIS IS CRITICAL — Attention is differentiable soft lookup. Instead of selecting ONE key-value pair, we compute a weighted average over ALL pairs, where the weights come from query-key similarity. The softmax makes this fully differentiable, enabling end-to-end training.

3. Scoring Functions

The score function measures compatibility between query and key vectors. Two classic approaches:

Additive Attention (Bahdanau, 2015):

score(q, k) = vᵀ · tanh(W₁·q + W₂·k)

Where v ∈ ℝ^d_a is a learned weight vector, W₁, W₂ ∈ ℝ^(d_a × d) project query and key to the attention dimension d_a.

Additive attention projects Q and K into a shared space, adds them, applies non-linearity, then projects to a scalar with v. The addition and tanh provide expressive non-linear scoring.

Multiplicative/Dot-Product Attention (Luong, 2015):

score(q, k) = qk (dot product)

Or with a learned compatibility matrix:

score(q, k) = qW_a k (general/bilinear attention)

Comparison: - Additive: O(d_a) per pair, can model complex non-linear relationships - Dot-product: O(d) per pair, much faster (matrix-multiply friendly), but assumes dot product captures similarity adequately

In practice, dot-product attention is dramatically faster on GPUs and became the dominant approach for Transformers.

4. Full Attention Computation (Batch Form)

For a sequence of n positions, with query vectors q_i, key vectors k_i, value vectors v_i:

Stack into matrices: Q ∈ ℝ^(n×d_q), K ∈ ℝ^(n×d_k), V ∈ ℝ^(n×d_v)

For self-attention (Q, K, V all come from the same sequence): - Typically d_q = d_k = d_v = d

The attention computation:

S = score_matrix(Q, K) — n×n matrix of raw scores A = softmax(S, dim=1) — each ROW sums to 1 (softmax over keys for each query) output = A · V — n×d weighted sum

Each row of A is the attention distribution for one query position.

5. Attention as a Differentiable Dictionary

Think of attention as:

  1. Q provides n search queries
  2. K provides n dictionary keys
  3. V provides n dictionary values
  4. A[i,j] = how much query i "attends to" key/value j
  5. Output[i] = Σ_j A[i,j] · V[j] — weighted average of values

When A[i,j] ≈ 1 for some j (sharp attention): output ≈ V[j] (hard lookup) When A[i,:] is diffuse (all entries ≈ 1/n): output ≈ mean(V) (averaging)

The model learns to make attention sharp when specific information is needed and diffuse when context aggregation is sufficient.

6. Backpropagation Through Attention

Let's derive gradients for dot-product attention:

S = QKᵀ, A = softmax(S), O = AV

Given ∂L/∂O (gradient from loss through output), we need ∂L/∂Q, ∂L/∂K, ∂L/∂V.

Gradient w.r.t. V:

∂L/∂V = Aᵀ · ∂L/∂O

Direct: O = AV, so ∂L/∂V = Aᵀ·∂L/∂O (standard matrix multiplication gradient).

Gradient w.r.t. A:

∂L/∂A = ∂L/∂O · V

Gradient w.r.t. S (through softmax):

∂L/∂S = A ⊙ (∂L/∂A − (A ⊙ ∂L/∂A) · 1₁ᵀ)

This is the softmax gradient: ∂a_i/∂s_j = a_i(δ_{ij} − a_j). In matrix form:

∂L/∂S = A ⊙ (∂L/∂A − (A ⊙ ∂L/∂A) · 1₁ᵀ)

Or equivalently: ∂L/∂S[i,j] = A[i,j]·(∂L/∂A[i,j] − Σ_k A[i,k]·∂L/∂A[i,k])

Gradient w.r.t. Q and K:

S = QKᵀ, so:

∂L/∂Q = ∂L/∂S · K ∂L/∂K = (∂L/∂S)ᵀ · Q

7. Computational Complexity

For sequence length n and dimension d: - Computing S = QKᵀ: O(n²d) — n queries × n keys × d dimensions - Computing A = softmax(S): O(n²) — normalize each row - Computing AV: O(n²d) — n×n matrix times n×d matrix - Total: O(n²d)

The quadratic dependence on n is the fundamental limitation of full attention. For n=1000, n²=1M; for n=100K, n²=10B — infeasible.

This motivates sparse attention, linear attention, and other efficient variants (covered in Phase 18-19).

8. Self-Attention vs Cross-Attention

Self-attention: Q, K, V all come from the SAME sequence. Each position attends to all positions (including itself). Used in encoder layers of Transformers.

Cross-attention: Q comes from one sequence (e.g., decoder), K and V come from another (e.g., encoder output). Used in encoder-decoder attention.

The mathematical operations are identical — only the source of K and V changes.



Key Terms

Worked Examples

Example 1: Computing Simple Attention

Problem: Q = [[1,0],[0,1]] (2 queries, d=2), K = [[1,0],[0,1],[1,1]] (3 keys, d=2), V = [[1,2],[3,4],[5,6]] (3 values, d=2). Compute attention output with dot-product scoring.

Solution: S = QKᵀ = [[1,0],[0,1]] · [[1,0,1],[0,1,1]] = [[1,0,1],[0,1,1]]

Row 0: exp([1,0,1]) = [2.718, 1.0, 2.718], sum = 6.436 A[0] = [2.718/6.436, 1/6.436, 2.718/6.436] = [0.4223, 0.1554, 0.4223]

Row 1: exp([0,1,1]) = [1.0, 2.718, 2.718], sum = 6.436 A[1] = [1/6.436, 2.718/6.436, 2.718/6.436] = [0.1554, 0.4223, 0.4223]

O = AV = [[0.4223·1+0.1554·3+0.4223·5, 0.4223·2+0.1554·4+0.4223·6], [0.1554·1+0.4223·3+0.4223·5, 0.1554·2+0.4223·4+0.4223·6]] = [[3.000, 4.000], [3.533, 4.533]]

Example 2: Attention Weight Interpretation

Problem: For the attention weights from Example 1, which key does Q₁=[0,1] attend to most? Why?

Solution: A[1] = [0.1554, 0.4223, 0.4223]. Q₁ attends equally to K₂=[0,1] and K₃=[1,1], both with weight 0.4223. K₁=[1,0] gets 0.1554.

Q₁·K₁ = 0, Q₁·K₂ = 1, Q₁·K₃ = 1. Keys 2 and 3 have equal dot products with Q₁, so they split the attention mass. Key 1 is orthogonal to Q₁ (dot product = 0), so it gets less attention (but still non-zero because softmax never gives zero).

Example 3: Gradient Through Attention

Problem: For Example 1, given ∂L/∂O = [[1,0],[0,1]], compute ∂L/∂V.

Solution: ∂L/∂V = Aᵀ · ∂L/∂O Aᵀ = [[0.4223, 0.1554],[0.1554, 0.4223],[0.4223, 0.4223]]

∂L/∂V = [[0.4223·1+0.1554·0, 0.4223·0+0.1554·1], [0.1554·1+0.4223·0, 0.1554·0+0.4223·1], [0.4223·1+0.4223·0, 0.4223·0+0.4223·1]] = [[0.4223, 0.1554],[0.1554, 0.4223],[0.4223, 0.4223]]

Note: each key's value gradient is the attention-weighted sum of query gradients.


Quiz

Q1: In the attention mechanism, what are the three components and their roles?

A) Query: what to output; Key: what to forget; Value: what to remember B) Query: "what am I looking for"; Key: "what does each position contain"; Value: "what information to retrieve" C) Query: attention weights; Key: hidden states; Value: gradients D) Query: the input sequence; Key: the output sequence; Value: the loss

Answer & Explanation **B** — The QKV abstraction models differentiable soft lookup: Query represents needs, Keys are indexes for matching, Values are content to retrieve. Attention computes Q-K similarity then takes a weighted sum of V.

Q2: What is the computational complexity of full attention for sequence length n and dimension d?

A) O(n·d) B) O(n·d²) C) O(n²·d) D) O(n²·d²)

Answer & Explanation **C) O(n²d)** — Computing S = QKᵀ is n×d @ d×n → O(n²d), and AV is n×n @ n×d → O(n²d). The n² factor from all pairwise query-key interactions is the fundamental scaling bottleneck.

Q3: How does self-attention differ from cross-attention?

A) Self-attention uses dot-product; cross-attention uses additive scoring B) Self-attention is faster C) In self-attention, Q, K, V come from the same sequence; in cross-attention, K and V come from a different sequence than Q D) Self-attention uses softmax; cross-attention does not

Answer & Explanation **C** — The mathematical operations are identical. Only the data source differs: self-attention has Q=K=V from one sequence; cross-attention takes Q from one sequence and K,V from another.

Q4: Why is softmax used on attention scores?

A) To make scores larger B) To normalize scores into a probability distribution (sums to 1) that is differentiable C) To reduce computational cost D) Because softmax is faster than other normalizations

Answer & Explanation **B** — Softmax converts raw scores into a distribution (non-negative, sums to 1), allowing output to be a weighted average. Crucially, softmax is differentiable (unlike argmax), enabling gradient-based training.

Q5: In additive (Bahdanau) attention, what role does the learned vector v play?

A) It provides the final output B) It projects the tanh output to a scalar score C) It acts as the query D) It normalizes the attention weights

Answer & Explanation **B** — score(q,k) = vᵀ · tanh(W₁q + W₂k). v ∈ ℝ^d_a maps the hidden representation to a scalar compatibility score. A = Values (V). C = Q. D = softmax.

Practice Problems

Problem 1

What are the dimensions of S = QKᵀ if Q ∈ ℝ^(5×64), K ∈ ℝ^(7×64)?

Answer S ∈ ℝ^(5×7). Q has 5 queries, K has 7 keys; dot products produce a 5×7 matrix.

Problem 2

Explain why attention complexity is O(n²d) and which term dominates for large n.

Answer Computing QKᵀ is n×d @ d×n = O(n²d). The softmax is O(n²). AV is n×n @ n×d = O(n²d). For large n, O(n²d) dominates. The n² comes from all pairwise query-key interactions.

Problem 3

What is the advantage of additive attention over dot-product attention? What is the disadvantage?

Answer Advantage: additive can model non-linear query-key relationships via tanh(W₁q + W₂k). Disadvantage: slower — requires a separate feedforward per (q,k) pair instead of a single matrix multiply QKᵀ.

Problem 4

If attention weight A[i,j] = 1 and A[i,k≠j] = 0, what is the output at position i?

Answer O[i] = Σ_k A[i,k]·V[k] = V[j]. Sharp attention = hard selection of one value. All other values are completely ignored.

Problem 5

Derive ∂L/∂Q given ∂L/∂S and K, where S = QKᵀ.

Answer S = QKᵀ means S[i,j] = Σ_d Q[i,d]·K[j,d]. So ∂S[i,j]/∂Q[i,d] = K[j,d] and ∂S[i,j]/∂Q[p≠i,d] = 0. Therefore ∂L/∂Q = ∂L/∂S · K, where ∂L/∂S ∈ ℝ^(n×n) and K ∈ ℝ^(n×d).

Summary

  1. Attention is differentiable soft lookup: output = Σ softmax(score(Q,K)) · V — weighted average of values based on query-key similarity
  2. The QKV abstraction separates "what to look for" (Q), "what's available" (K), and "what to retrieve" (V)
  3. Scoring functions: additive (vᵀtanh(W₁q+W₂k)) is more expressive; dot-product (qᵀk) is faster and GPU-friendly
  4. Attention complexity is O(n²d) — the n² from pairwise Q-K interactions is the fundamental scaling bottleneck
  5. Self-attention (Q=K=V from same sequence) vs. cross-attention (Q from one seq, K,V from another) differ only in data source, not computation

Pitfalls


Next Steps

Continue to 17-07 — Scaled Dot-Product Attention to understand why scaling by √d_k is mathematically necessary and how masking enables autoregressive generation.