18-04 — Rotary Position Embeddings (RoPE) — Deep
Phase: 18 — Large Language Model Mathematics Subject: 18-04 Prerequisites: 18-03 (Positional Encodings), 08-04 (Matrices as Linear Transformations), 02-08 (Trigonometric Functions), 02-10 (Vectors — Dot Product), 17-07 (Scaled Dot-Product Attention) Next subject: 18-05 — Decoder-Only Architecture
Learning Objectives
By the end of this subject, you will be able to:
- Construct the 2D rotation matrix R(θ) and prove it preserves vector norms and dot products
- Build the full RoPE rotation matrix for d-dimensional queries/keys as a block-diagonal of 2D rotations
- Prove rigorously that RoPE encodes relative position: q_mᵀRᵀ(θ,m)R(θ,n)k_n = q_mᵀR(θ, n−m)k_n
- Derive the frequency schedule for RoPE: θ_i = base^(−2i/d) and explain the geometric spacing
- Analyze RoPE extrapolation, including linear interpolation (PI), NTK-aware scaling, and YaRN
Core Content
1. The 2D Rotation Matrix (Building Block)
A rotation in 2D by angle φ is represented by:
R(φ) = [[cos(φ), −sin(φ)], [sin(φ), cos(φ)]]
Properties:
- Orthogonal: R(φ)ᵀR(φ) = I (R is a rotation; its transpose is its inverse: R(−φ))
- Determinant 1: det(R(φ)) = cos²(φ) + sin²(φ) = 1
- Norm preserving: ||R(φ)·v|| = ||v|| for any vector v
- Additive: R(φ₁)·R(φ₂) = R(φ₁ + φ₂) — composing rotations adds angles
- Dot product preserving: (R(φ)u)ᵀ(R(φ)v) = uᵀR(φ)ᵀR(φ)v = uᵀv
2. Applying Rotation to Vector Pairs
RoPE applies rotations in pairs of dimensions. For a d-dimensional vector x, reshape into d/2 pairs:
x = [x₀, x₁, x₂, x₃, ..., x_{d−2}, x_{d−1}]
Treat each consecutive pair (x_{2i}, x_{2i+1}) as a 2D vector and rotate it.
For a single pair with angle θ:
[x'{2i}, x'{2i+1}]ᵀ = R(θ) · [x_{2i}, x_{2i+1}]ᵀ = [x_{2i}·cos(θ) − x_{2i+1}·sin(θ), x_{2i}·sin(θ) + x_{2i+1}·cos(θ)]
3. The Full RoPE Rotation Matrix
For d dimensions with per-pair angles θ₀, θ₁, ..., θ_{d/2−1}, the full rotation matrix is block-diagonal:
R(Θ) = [[R(θ₀), 0, ..., 0], [ 0, R(θ₁), ..., 0], [ ..., ..., ..., ...], [ 0, 0, ..., R(θ_{d/2−1})]]
Each block is a 2×2 rotation matrix. This is a d×d orthogonal matrix.
RoPE applies position-dependent rotation: For position pos, the angle for dimension pair i is:
θ_{i,pos} = pos · ω_i
where ω_i = base^(−2i/d) and typically base = 10000 (but modern LLMs often use larger bases like 500,000 or 1,000,000).
So R_Θ(pos) has blocks R(pos·ω_i).
4. The RoPE Operation
⚠️ THIS IS CRITICAL — RoPE does NOT modify the input embeddings. It modifies the QUERY and KEY vectors AFTER the linear projections, BEFORE computing attention scores.
For a query vector q at position m and a key vector k at position n:
q'_m = R_Θ(m) · q_m k'_n = R_Θ(n) · k_n
The attention score between position m (query) and position n (key):
score(m, n) = q'_mᵀ k'_n
5. Proof: RoPE Encodes Relative Position
This is the central theorem of RoPE. Let's prove it.
score(m, n) = (R_Θ(m)·q_m)ᵀ (R_Θ(n)·k_n) = q_mᵀ R_Θ(m)ᵀ R_Θ(n) k_n
Now, R_Θ(m)ᵀ R_Θ(n): - For each 2×2 block i: R(m·ω_i)ᵀ R(n·ω_i) - R(m·ω_i)ᵀ = R(−m·ω_i) (rotation transpose = reverse rotation) - R(−m·ω_i) · R(n·ω_i) = R((n−m)·ω_i) (compose rotations: subtract angles)
Therefore:
R_Θ(m)ᵀ R_Θ(n) = R_Θ(n − m)
And:
score(m, n) = q_mᵀ R_Θ(n − m) k_n
The score depends on (n−m), the relative position! The position information has been "absorbed" into a single rotation applied to k_n, parameterized by the relative offset.
Expanding the dot product: For dimension pair i:
Contribution = (q_{2i}, q_{2i+1})ᵀ R((n−m)·ω_i) (k_{2i}, k_{2i+1})
= q_{2i}·[k_{2i}·cos((n−m)·ω_i) − k_{2i+1}·sin((n−m)·ω_i)] + q_{2i+1}·[k_{2i}·sin((n−m)·ω_i) + k_{2i+1}·cos((n−m)·ω_i)]
= (q_{2i}·k_{2i} + q_{2i+1}·k_{2i+1})·cos((n−m)·ω_i) + (q_{2i}·k_{2i+1} − q_{2i+1}·k_{2i})·sin((n−m)·ω_i)
The total score is the sum of these contributions over all i.
6. Frequency Schedule (Base Parameter)
The frequencies: ω_i = base^(−2i/d), for i = 0, 1, ..., d/2−1.
Geometric spacing: The ratios ω_i/ω_{i+1} = base^(2/d) are constant. This gives equal spacing on a log scale, covering a wide range of wavelengths.
Effect of base: - base = 10000: λ_max ≈ 2π·10000 ≈ 62,832 positions (original) - base = 500,000: λ_max ≈ 2π·500,000 ≈ 3.14M positions (LLaMA 2 long context, Code Llama) - base = 1,000,000: λ_max ≈ 6.28M positions (LLaMA 3)
Larger base → longer wavelengths → better extrapolation to long sequences. The lowest frequencies are the bottleneck for long-context performance.
7. Why RoPE Works So Well
Preserves dot product geometry: Since R_Θ is orthogonal, ||q'|| = ||q|| and ||k'|| = ||k||. RoPE doesn't change the magnitude of Q and K vectors — it only rotates them. This means the "content-based" attention behavior is preserved while position is injected.
Decays with relative distance: For a fixed (q,k) pair, the score decreases as |n−m| increases because the rotation misaligns the vectors. At specific relative positions where cos((n−m)·ω_i) ≈ 1 for all i, the score is maximal — these are the "resonant" positions.
No separate PE parameters: RoPE is parameter-free (given the base). No learned positional embeddings, no additional parameters.
Works with weight tying: Since RoPE is applied after Q/K projections, it doesn't interfere with input embeddings or output projections.
8. Extrapolation and Scaling
RoPE doesn't naturally extrapolate well past training length because the rotation angles for very large positions may not have been encountered during training. Several scaling methods exist:
Linear Interpolation (Position Interpolation, PI)
Scale positions down: pos' = pos · (L_train / L_target). For a model trained on 2K and evaluated on 4K, all positions are halved before computing RoPE. This compresses the rotations into the trained range but "crowds" nearby positions.
NTK-Aware Scaling
Instead of scaling positions, scale the base: base' = base · α^(d/(d−2)). This stretches the low frequencies (which handle long-range) more than high frequencies (which handle short-range). Avoids crowding nearby positions.
YaRN (Yet another RoPE extensioN)
Combines NTK-aware scaling with a temperature adjustment to the softmax, further improving long-context extrapolation.
Pitfalls
⚠️ Pitfall 1: Applying RoPE to the wrong vectors. RoPE rotates Q and K AFTER their linear projections, NOT the input embeddings. Rotating the embeddings directly would break the position-encoding property because RoPE needs to act on Q and K separately to produce the relative-position dot product.
⚠️ Pitfall 2: Mixing up the rotation direction. R(θ) = [[cosθ, -sinθ], [sinθ, cosθ]] rotates counterclockwise. But R(m)ᵀR(n) = R(n-m) regardless of the sign convention — the transpose reverses the rotation, so the composition always gives the difference. Just be consistent.
⚠️ Pitfall 3: Forgetting that RoPE's "decay" is oscillatory, not monotonic. The attention score with distance involves cos(Δ·ω) and sin(Δ·ω) terms. It oscillates — a token at distance 100 might get MORE attention than one at distance 50, depending on the rotation angle. This is different from ALiBi's monotonic exponential decay.
Key Terms
- 18 04 Rope Deep
- Applying Rotation to Vector Pairs
- Common Pitfalls
- Example 1: 2D RoPE Computation
- Example 2: Building the 4D RoPE Rotation Matrix
- Example 3: RoPE Decay with Distance
- Extrapolation and Scaling
- Frequency Schedule (Base Parameter)
- Pitfall 1: Applying RoPE to the wrong vectors.
- Pitfall 2: Mixing up the rotation direction.
- Problem 1
- Problem 2
Worked Examples
Example 1: 2D RoPE Computation
Problem: For a query vector q = [1.0, 2.0] at position m = 3 and key vector k = [0.5, 1.5] at position n = 7, with ω = 0.5 rad/position, compute the RoPE attention score.
Solution:
Rotation angles: θ_m = 3·0.5 = 1.5 rad θ_n = 7·0.5 = 3.5 rad
Rotated q: q'₀ = 1.0·cos(1.5) − 2.0·sin(1.5) = 1.0·0.0707 − 2.0·0.9975 = 0.0707 − 1.9950 = −1.9243
q'₁ = 1.0·sin(1.5) + 2.0·cos(1.5) = 1.0·0.9975 + 2.0·0.0707 = 0.9975 + 0.1414 = 1.1389
Rotated k: k'₀ = 0.5·cos(3.5) − 1.5·sin(3.5) = 0.5·(−0.9365) − 1.5·(−0.3508) = −0.4682 + 0.5262 = 0.0580
k'₁ = 0.5·sin(3.5) + 1.5·cos(3.5) = 0.5·(−0.3508) + 1.5·(−0.9365) = −0.1754 − 1.4048 = −1.5802
Score = q'·k' = (−1.9243)(0.0580) + (1.1389)(−1.5802) = −0.1116 − 1.7997 = −1.9113
Check with relative position formula: Δθ = (7−3)·0.5 = 2.0 rad
Score = (q₀·k₀ + q₁·k₁)·cos(2.0) + (q₀·k₁ − q₁·k₀)·sin(2.0) = (0.5 + 3.0)·(−0.4161) + (1.5 − 1.0)·(0.9093) = 3.5·(−0.4161) + 0.5·(0.9093) = −1.4564 + 0.4546 = −1.0018
Wait — discrepancy. Let me recompute more carefully.
Content scores: q₀·k₀ = 1.0·0.5 = 0.5, q₁·k₁ = 2.0·1.5 = 3.0, sum = 3.5 Cross: q₀·k₁ = 1.0·1.5 = 1.5, q₁·k₀ = 2.0·0.5 = 1.0, difference = 0.5
Score = 3.5·cos(2.0) + 0.5·sin(2.0) = 3.5·(−0.4161) + 0.5·(0.9093) = −1.4564 + 0.4546 = −1.0018
The relative-position formula gives −1.0018. The direct rotation gave −1.9113. They should be equal — the difference is likely due to rounding in the intermediate trig values.
Let me recompute with more precision: cos(1.5) = cos(3π/2 − 3.212) — actually, let's just use decimal: cos(1.5) = cos(π/2 + ...) = −sin(0.0708).
Let me use the exact relationship: R(θ₁)ᵀR(θ₂) = R(θ₂−θ₁).
q' = R(1.5)·[1,2] k' = R(3.5)·[0.5,1.5]
q'ᵀk' = (R(1.5)·q)ᵀ(R(3.5)·k) = qᵀR(1.5)ᵀR(3.5)·k = qᵀR(3.5−1.5)·k = qᵀR(2.0)·k
qᵀR(2.0)·k = [1,2]ᵀ · [k₀·cos(2) − k₁·sin(2), k₀·sin(2) + k₁·cos(2)]
= 1·(0.5·cos(2) − 1.5·sin(2)) + 2·(0.5·sin(2) + 1.5·cos(2)) = 0.5·cos(2) − 1.5·sin(2) + 1.0·sin(2) + 3.0·cos(2) = 3.5·cos(2) − 0.5·sin(2) = 3.5·(−0.4161) − 0.5·(0.9093) = −1.4564 − 0.4547 = −1.9111
Now it matches! The direct computation was correct. The relative-position formula had the sign wrong on the cross term — the formula should be:
Score = (q₀k₀ + q₁k₁)cos(Δθ) + (q₁k₀ − q₀k₁)sin(Δθ)
Or equivalently: (q₀k₀ + q₁k₁)cos + (q₀k₁ − q₁k₀)(−sin).
Let me check: qᵀR(Δθ)k = q₀(k₀cos−k₁sin) + q₁(k₀sin+k₁cos) = q₀k₀cos − q₀k₁sin + q₁k₀sin + q₁k₁cos = (q₀k₀+q₁k₁)cos + (q₁k₀−q₀k₁)sin = 3.5·cos(2) + (−0.5)·sin(2) = −1.4564 − 0.4547 = −1.9111 ✓
Example 2: Building the 4D RoPE Rotation Matrix
Problem: Construct R_Θ(pos) for d = 4, base = 10000, and pos = 5.
Solution:
d = 4, so i ∈ {0, 1}.
ω₀ = 10000^(−0/4) = 1 ω₁ = 10000^(−2/4) = 10000^(−0.5) = 0.01
Angles: θ₀ = 5·1 = 5 rad θ₁ = 5·0.01 = 0.05 rad
R(5) = [[cos(5), sin(5), 0, 0 ], [−sin(5), cos(5), 0, 0 ], [0, 0, cos(0.05), sin(0.05)], [0, 0, −sin(0.05), cos(0.05)]]
Wait — using the standard convention: R(θ) = [[cosθ, −sinθ], [sinθ, cosθ]].
R(5) = [[cos(5), −sin(5), 0, 0 ], [sin(5), cos(5), 0, 0 ], [0, 0, cos(0.05), −sin(0.05)], [0, 0, sin(0.05), cos(0.05)]]
= [[0.2837, 0.9589, 0, 0 ], [−0.9589, 0.2837, 0, 0 ], [0, 0, 0.9988, −0.0500], [0, 0, 0.0500, 0.9988]]
Note: R(5) is orthogonal — its transpose is its inverse.
Example 3: RoPE Decay with Distance
Problem: For d = 2, ω = 0.5, and content vectors q = [1, 0], k = [1, 0] (perfectly aligned, unit length), compute the attention score as a function of relative distance Δ = n−m.
Solution:
qᵀk = 1·1 + 0·0 = 1 (content score, would be 1 without position)
With RoPE, Δθ = Δ·0.5: score(Δ) = (q₀k₀ + q₁k₁)·cos(Δθ) + (q₁k₀ − q₀k₁)·sin(Δθ) = 1·cos(0.5Δ) + 0·sin(0.5Δ) = cos(0.5Δ)
| Δ | score |
|---|---|
| 0 | cos(0) = 1.000 |
| 1 | cos(0.5) ≈ 0.8776 |
| 2 | cos(1.0) ≈ 0.5403 |
| 3 | cos(1.5) ≈ 0.0707 |
| 4 | cos(2.0) ≈ −0.4161 |
| 5 | cos(2.5) ≈ −0.8011 |
| 6 | cos(3.0) ≈ −0.9900 |
| 7 | cos(3.5) ≈ −0.9365 |
The score oscillates and decays. At Δ ≈ π/ω = π/0.5 ≈ 6.28, the score reaches −1 (complete anti-alignment). This is characteristic of RoPE: attention decays with distance in an oscillatory fashion, not monotonically.
This means RoPE produces a "soft window" — tokens at certain distances receive less attention due to rotational misalignment.
Quiz
Q1: What does the concept of Applying Rotation to Vector Pairs primarily refer to in this subject?
A) A computational error related to Applying Rotation to Vector Pairs B) A historical anecdote about Applying Rotation to Vector Pairs C) A visual representation of Applying Rotation to Vector Pairs D) The definition and application of Applying Rotation to Vector Pairs
Correct: D)
- If you chose A: This is incorrect. Applying Rotation to Vector Pairs is defined as: the definition and application of applying rotation to vector pairs. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Applying Rotation to Vector Pairs is defined as: the definition and application of applying rotation to vector pairs. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Applying Rotation to Vector Pairs is defined as: the definition and application of applying rotation to vector pairs. The other options describe different aspects that are not the primary focus.
- If you chose D: Applying Rotation to Vector Pairs is defined as: the definition and application of applying rotation to vector pairs. The other options describe different aspects that are not the primary focus. Correct!
Q2: What is the primary purpose of Common Pitfalls?
A) It replaces all other methods in this domain B) It is used to common pitfalls in mathematical analysis C) It is primarily a historical notation system D) It is used only in advanced research contexts
Correct: B)
- If you chose A: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about Extrapolation and Scaling is TRUE?
A) Extrapolation and Scaling is mentioned only as a historical footnote B) Extrapolation and Scaling is a fundamental concept covered in this subject C) Extrapolation and Scaling is an advanced topic beyond this subject's scope D) Extrapolation and Scaling is not related to this subject
Correct: B)
- If you chose A: This is incorrect. Extrapolation and Scaling is a fundamental concept covered in this subject. This subject covers Extrapolation and Scaling as part of its core content.
- If you chose B: Extrapolation and Scaling is a fundamental concept covered in this subject. This subject covers Extrapolation and Scaling as part of its core content. Correct!
- If you chose C: This is incorrect. Extrapolation and Scaling is a fundamental concept covered in this subject. This subject covers Extrapolation and Scaling as part of its core content.
- If you chose D: This is incorrect. Extrapolation and Scaling is a fundamental concept covered in this subject. This subject covers Extrapolation and Scaling as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) R_Θ(n − m) B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer
Correct: A)
- If you chose A: The worked examples show that the result is R_Θ(n − m). The other options represent common errors. Correct!
- If you chose B: This is incorrect. The worked examples show that the result is R_Θ(n − m). The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is R_Θ(n − m). The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is R_Θ(n − m). The other options represent common errors.
Q5: How are Extrapolation and Scaling and Frequency Schedule (Base Parameter) related?
A) Extrapolation and Scaling and Frequency Schedule (Base Parameter) are completely unrelated topics B) Extrapolation and Scaling and Frequency Schedule (Base Parameter) are closely related concepts C) Extrapolation and Scaling is a special case of Frequency Schedule (Base Parameter) D) Extrapolation and Scaling is the inverse of Frequency Schedule (Base Parameter)
Correct: B)
- If you chose A: This is incorrect. Both Extrapolation and Scaling and Frequency Schedule (Base Parameter) are covered in this subject as interconnected topics.
- If you chose B: Both Extrapolation and Scaling and Frequency Schedule (Base Parameter) are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both Extrapolation and Scaling and Frequency Schedule (Base Parameter) are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Extrapolation and Scaling and Frequency Schedule (Base Parameter) are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with The 2D Rotation Matrix (Building Block)?
A) A common mistake is confusing The 2D Rotation Matrix (Building Block) with a similar concept B) The 2D Rotation Matrix (Building Block) is always computed the same way in all contexts C) The main error with The 2D Rotation Matrix (Building Block) is using it when it is not needed D) The 2D Rotation Matrix (Building Block) has no common misconceptions
Correct: A)
- If you chose A: Students often confuse The 2D Rotation Matrix (Building Block) with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose B: This is incorrect. Students often confuse The 2D Rotation Matrix (Building Block) with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse The 2D Rotation Matrix (Building Block) with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse The 2D Rotation Matrix (Building Block) with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply The Full Rope Rotation Matrix?
A) Avoid The Full Rope Rotation Matrix unless explicitly instructed B) Apply The Full Rope Rotation Matrix to solve problems in this subject's domain C) Use The Full Rope Rotation Matrix only in pure mathematics contexts D) The Full Rope Rotation Matrix is not practically useful
Correct: B)
- If you chose A: This is incorrect. The Full Rope Rotation Matrix is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: The Full Rope Rotation Matrix is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. The Full Rope Rotation Matrix is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. The Full Rope Rotation Matrix is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
For d = 8, base = 10000, what are the rotation angles for dimension pair i = 2 at position pos = 100?
Answer
ω₂ = 10000^(−4/8) = 10000^(−0.5) = 0.01 θ = pos · ω₂ = 100 · 0.01 = 1.0 rad The 2D vector at dimensions (4,5) is rotated by 1.0 rad.Problem 2
Prove that RoPE preserves the norm of query and key vectors: ||R_Θ(pos)·q|| = ||q||.
Answer
R_Θ(pos) is block-diagonal with each block being a 2D rotation matrix R(θ_i). Each R(θ_i) is orthogonal: R(θ_i)ᵀR(θ_i) = I. The block-diagonal matrix is therefore orthogonal: R_ΘᵀR_Θ = I. ||R_Θ·q||² = (R_Θ·q)ᵀ(R_Θ·q) = qᵀR_ΘᵀR_Θ·q = qᵀ·q = ||q||² Thus ||R_Θ·q|| = ||q||. RoPE rotates but doesn't scale.Problem 3
Explain why larger base values help with long-context extrapolation.
Answer
The lowest frequency is ω_max = base^(−(d−2)/d) ≈ 1/base. This gives wavelength λ_max ≈ 2π·base. For base=10000, λ_max ≈ 62,832. For a sequence of length 100,000, the rotation for some dimension pairs completes fewer than one full cycle over the entire sequence — these low frequencies can still provide useful positional signal. With base=10000, at position 100,000, some pairs have undergone multiple full rotations and become ambiguous. Larger base reduces this problem by making ALL frequencies lower, so even at position 100,000, most dimension pairs haven't wrapped around.Problem 4
Given q = [a, b] and k = [c, d], derive the explicit formula for q'ᵀk' under RoPE at relative distance Δ.
Answer
q' = [a·cos(θ_q) − b·sin(θ_q), a·sin(θ_q) + b·cos(θ_q)] k' = [c·cos(θ_k) − d·sin(θ_k), c·sin(θ_k) + d·cos(θ_k)] q'ᵀk' = (a·cosθ_q − b·sinθ_q)(c·cosθ_k − d·sinθ_k) + (a·sinθ_q + b·cosθ_q)(c·sinθ_k + d·cosθ_k) Expanding and using trig identities: = ac·cos(θ_q−θ_k) + bd·cos(θ_q−θ_k) + ad·sin(θ_k−θ_q) + bc·sin(θ_q−θ_k) = (ac+bd)·cos(Δθ) + (bc−ad)·sin(Δθ) where Δθ = θ_q − θ_k (or Δθ = θ_k − θ_q, depending on sign convention). With θ = pos·ω: Δθ = (n−m)·ω for query at m, key at n.Problem 5
A model trained with RoPE and base=10000 on sequences up to length 2048. You want to fine-tune it for length 8192 using NTK-aware scaling with α = 4. What is the new effective base?
Answer
NTK-aware scaling: base' = base · α^(d/(d−2)) For d = 128 (typical head dim): d/(d−2) = 128/126 ≈ 1.0159 base' = 10000 · 4^1.0159 ≈ 10000 · 4.126 ≈ 41,260 The lowest frequency scales from ~1/10000 to ~1/41260, extending the maximum wavelength from ~62.8K to ~259K positions. High frequencies are barely affected (since the exponent is applied to ω_i which already decays geometrically).Summary
- RoPE applies 2D rotations to Q and K vectors in pairs of dimensions: q'{2i,2i+1} = R(pos·ω_i)·[q{2i}, q_{2i+1}]ᵀ
- The central theorem: q_mᵀR_Θ(m)ᵀR_Θ(n)k_n = q_mᵀR_Θ(n−m)k_n — the dot product depends only on relative position
- Frequencies are geometrically spaced: ω_i = base^(−2i/d), giving wavelengths from ~6 to ~2π·base positions
- RoPE produces oscillatory attention score decay with distance: score(Δ) involves cos(Δ·ω_i) and sin(Δ·ω_i) terms
- Extrapolation requires scaling: position interpolation compresses positions, NTK-aware scaling adjusts the base, and YaRN combines both
Next Steps
Continue to 18-05 — Decoder-Only Architecture to learn how RoPE, causal masking, and next-token prediction combine in the GPT-style decoder-only Transformer.