Math graphic
📐 Concept diagram

18-04 — Rotary Position Embeddings (RoPE) — Deep

Phase: 18 — Large Language Model Mathematics Subject: 18-04 Prerequisites: 18-03 (Positional Encodings), 08-04 (Matrices as Linear Transformations), 02-08 (Trigonometric Functions), 02-10 (Vectors — Dot Product), 17-07 (Scaled Dot-Product Attention) Next subject: 18-05 — Decoder-Only Architecture


Learning Objectives

By the end of this subject, you will be able to:

  1. Construct the 2D rotation matrix R(θ) and prove it preserves vector norms and dot products
  2. Build the full RoPE rotation matrix for d-dimensional queries/keys as a block-diagonal of 2D rotations
  3. Prove rigorously that RoPE encodes relative position: q_mᵀRᵀ(θ,m)R(θ,n)k_n = q_mᵀR(θ, n−m)k_n
  4. Derive the frequency schedule for RoPE: θ_i = base^(−2i/d) and explain the geometric spacing
  5. Analyze RoPE extrapolation, including linear interpolation (PI), NTK-aware scaling, and YaRN

Core Content

1. The 2D Rotation Matrix (Building Block)

A rotation in 2D by angle φ is represented by:

R(φ) = [[cos(φ), −sin(φ)], [sin(φ), cos(φ)]]

Properties:

2. Applying Rotation to Vector Pairs

RoPE applies rotations in pairs of dimensions. For a d-dimensional vector x, reshape into d/2 pairs:

x = [x₀, x₁, x₂, x₃, ..., x_{d−2}, x_{d−1}]

Treat each consecutive pair (x_{2i}, x_{2i+1}) as a 2D vector and rotate it.

For a single pair with angle θ:

[x'{2i}, x'{2i+1}]ᵀ = R(θ) · [x_{2i}, x_{2i+1}]ᵀ = [x_{2i}·cos(θ) − x_{2i+1}·sin(θ), x_{2i}·sin(θ) + x_{2i+1}·cos(θ)]

3. The Full RoPE Rotation Matrix

For d dimensions with per-pair angles θ₀, θ₁, ..., θ_{d/2−1}, the full rotation matrix is block-diagonal:

R(Θ) = [[R(θ₀), 0, ..., 0], [ 0, R(θ₁), ..., 0], [ ..., ..., ..., ...], [ 0, 0, ..., R(θ_{d/2−1})]]

Each block is a 2×2 rotation matrix. This is a d×d orthogonal matrix.

RoPE applies position-dependent rotation: For position pos, the angle for dimension pair i is:

θ_{i,pos} = pos · ω_i

where ω_i = base^(−2i/d) and typically base = 10000 (but modern LLMs often use larger bases like 500,000 or 1,000,000).

So R_Θ(pos) has blocks R(pos·ω_i).

4. The RoPE Operation

⚠️ THIS IS CRITICAL — RoPE does NOT modify the input embeddings. It modifies the QUERY and KEY vectors AFTER the linear projections, BEFORE computing attention scores.

For a query vector q at position m and a key vector k at position n:

q'_m = R_Θ(m) · q_m k'_n = R_Θ(n) · k_n

The attention score between position m (query) and position n (key):

score(m, n) = q'_mᵀ k'_n

5. Proof: RoPE Encodes Relative Position

This is the central theorem of RoPE. Let's prove it.

score(m, n) = (R_Θ(m)·q_m)ᵀ (R_Θ(n)·k_n) = q_mᵀ R_Θ(m)ᵀ R_Θ(n) k_n

Now, R_Θ(m)ᵀ R_Θ(n): - For each 2×2 block i: R(m·ω_i)ᵀ R(n·ω_i) - R(m·ω_i)ᵀ = R(−m·ω_i) (rotation transpose = reverse rotation) - R(−m·ω_i) · R(n·ω_i) = R((n−m)·ω_i) (compose rotations: subtract angles)

Therefore:

R_Θ(m)ᵀ R_Θ(n) = R_Θ(n − m)

And:

score(m, n) = q_mᵀ R_Θ(n − m) k_n

The score depends on (n−m), the relative position! The position information has been "absorbed" into a single rotation applied to k_n, parameterized by the relative offset.

Expanding the dot product: For dimension pair i:

Contribution = (q_{2i}, q_{2i+1})ᵀ R((n−m)·ω_i) (k_{2i}, k_{2i+1})

= q_{2i}·[k_{2i}·cos((n−m)·ω_i) − k_{2i+1}·sin((n−m)·ω_i)] + q_{2i+1}·[k_{2i}·sin((n−m)·ω_i) + k_{2i+1}·cos((n−m)·ω_i)]

= (q_{2i}·k_{2i} + q_{2i+1}·k_{2i+1})·cos((n−m)·ω_i) + (q_{2i}·k_{2i+1} − q_{2i+1}·k_{2i})·sin((n−m)·ω_i)

The total score is the sum of these contributions over all i.

6. Frequency Schedule (Base Parameter)

The frequencies: ω_i = base^(−2i/d), for i = 0, 1, ..., d/2−1.

Geometric spacing: The ratios ω_i/ω_{i+1} = base^(2/d) are constant. This gives equal spacing on a log scale, covering a wide range of wavelengths.

Effect of base: - base = 10000: λ_max ≈ 2π·10000 ≈ 62,832 positions (original) - base = 500,000: λ_max ≈ 2π·500,000 ≈ 3.14M positions (LLaMA 2 long context, Code Llama) - base = 1,000,000: λ_max ≈ 6.28M positions (LLaMA 3)

Larger base → longer wavelengths → better extrapolation to long sequences. The lowest frequencies are the bottleneck for long-context performance.

7. Why RoPE Works So Well

Preserves dot product geometry: Since R_Θ is orthogonal, ||q'|| = ||q|| and ||k'|| = ||k||. RoPE doesn't change the magnitude of Q and K vectors — it only rotates them. This means the "content-based" attention behavior is preserved while position is injected.

Decays with relative distance: For a fixed (q,k) pair, the score decreases as |n−m| increases because the rotation misaligns the vectors. At specific relative positions where cos((n−m)·ω_i) ≈ 1 for all i, the score is maximal — these are the "resonant" positions.

No separate PE parameters: RoPE is parameter-free (given the base). No learned positional embeddings, no additional parameters.

Works with weight tying: Since RoPE is applied after Q/K projections, it doesn't interfere with input embeddings or output projections.

8. Extrapolation and Scaling

RoPE doesn't naturally extrapolate well past training length because the rotation angles for very large positions may not have been encountered during training. Several scaling methods exist:

Linear Interpolation (Position Interpolation, PI)

Scale positions down: pos' = pos · (L_train / L_target). For a model trained on 2K and evaluated on 4K, all positions are halved before computing RoPE. This compresses the rotations into the trained range but "crowds" nearby positions.

NTK-Aware Scaling

Instead of scaling positions, scale the base: base' = base · α^(d/(d−2)). This stretches the low frequencies (which handle long-range) more than high frequencies (which handle short-range). Avoids crowding nearby positions.

YaRN (Yet another RoPE extensioN)

Combines NTK-aware scaling with a temperature adjustment to the softmax, further improving long-context extrapolation.



Pitfalls

⚠️ Pitfall 1: Applying RoPE to the wrong vectors. RoPE rotates Q and K AFTER their linear projections, NOT the input embeddings. Rotating the embeddings directly would break the position-encoding property because RoPE needs to act on Q and K separately to produce the relative-position dot product.

⚠️ Pitfall 2: Mixing up the rotation direction. R(θ) = [[cosθ, -sinθ], [sinθ, cosθ]] rotates counterclockwise. But R(m)ᵀR(n) = R(n-m) regardless of the sign convention — the transpose reverses the rotation, so the composition always gives the difference. Just be consistent.

⚠️ Pitfall 3: Forgetting that RoPE's "decay" is oscillatory, not monotonic. The attention score with distance involves cos(Δ·ω) and sin(Δ·ω) terms. It oscillates — a token at distance 100 might get MORE attention than one at distance 50, depending on the rotation angle. This is different from ALiBi's monotonic exponential decay.


Key Terms

Worked Examples

Example 1: 2D RoPE Computation

Problem: For a query vector q = [1.0, 2.0] at position m = 3 and key vector k = [0.5, 1.5] at position n = 7, with ω = 0.5 rad/position, compute the RoPE attention score.

Solution:

Rotation angles: θ_m = 3·0.5 = 1.5 rad θ_n = 7·0.5 = 3.5 rad

Rotated q: q'₀ = 1.0·cos(1.5) − 2.0·sin(1.5) = 1.0·0.0707 − 2.0·0.9975 = 0.0707 − 1.9950 = −1.9243

q'₁ = 1.0·sin(1.5) + 2.0·cos(1.5) = 1.0·0.9975 + 2.0·0.0707 = 0.9975 + 0.1414 = 1.1389

Rotated k: k'₀ = 0.5·cos(3.5) − 1.5·sin(3.5) = 0.5·(−0.9365) − 1.5·(−0.3508) = −0.4682 + 0.5262 = 0.0580

k'₁ = 0.5·sin(3.5) + 1.5·cos(3.5) = 0.5·(−0.3508) + 1.5·(−0.9365) = −0.1754 − 1.4048 = −1.5802

Score = q'·k' = (−1.9243)(0.0580) + (1.1389)(−1.5802) = −0.1116 − 1.7997 = −1.9113

Check with relative position formula: Δθ = (7−3)·0.5 = 2.0 rad

Score = (q₀·k₀ + q₁·k₁)·cos(2.0) + (q₀·k₁ − q₁·k₀)·sin(2.0) = (0.5 + 3.0)·(−0.4161) + (1.5 − 1.0)·(0.9093) = 3.5·(−0.4161) + 0.5·(0.9093) = −1.4564 + 0.4546 = −1.0018

Wait — discrepancy. Let me recompute more carefully.

Content scores: q₀·k₀ = 1.0·0.5 = 0.5, q₁·k₁ = 2.0·1.5 = 3.0, sum = 3.5 Cross: q₀·k₁ = 1.0·1.5 = 1.5, q₁·k₀ = 2.0·0.5 = 1.0, difference = 0.5

Score = 3.5·cos(2.0) + 0.5·sin(2.0) = 3.5·(−0.4161) + 0.5·(0.9093) = −1.4564 + 0.4546 = −1.0018

The relative-position formula gives −1.0018. The direct rotation gave −1.9113. They should be equal — the difference is likely due to rounding in the intermediate trig values.

Let me recompute with more precision: cos(1.5) = cos(3π/2 − 3.212) — actually, let's just use decimal: cos(1.5) = cos(π/2 + ...) = −sin(0.0708).

Let me use the exact relationship: R(θ₁)ᵀR(θ₂) = R(θ₂−θ₁).

q' = R(1.5)·[1,2] k' = R(3.5)·[0.5,1.5]

q'ᵀk' = (R(1.5)·q)ᵀ(R(3.5)·k) = qᵀR(1.5)ᵀR(3.5)·k = qᵀR(3.5−1.5)·k = qᵀR(2.0)·k

qᵀR(2.0)·k = [1,2]ᵀ · [k₀·cos(2) − k₁·sin(2), k₀·sin(2) + k₁·cos(2)]

= 1·(0.5·cos(2) − 1.5·sin(2)) + 2·(0.5·sin(2) + 1.5·cos(2)) = 0.5·cos(2) − 1.5·sin(2) + 1.0·sin(2) + 3.0·cos(2) = 3.5·cos(2) − 0.5·sin(2) = 3.5·(−0.4161) − 0.5·(0.9093) = −1.4564 − 0.4547 = −1.9111

Now it matches! The direct computation was correct. The relative-position formula had the sign wrong on the cross term — the formula should be:

Score = (q₀k₀ + q₁k₁)cos(Δθ) + (q₁k₀ − q₀k₁)sin(Δθ)

Or equivalently: (q₀k₀ + q₁k₁)cos + (q₀k₁ − q₁k₀)(−sin).

Let me check: qᵀR(Δθ)k = q₀(k₀cos−k₁sin) + q₁(k₀sin+k₁cos) = q₀k₀cos − q₀k₁sin + q₁k₀sin + q₁k₁cos = (q₀k₀+q₁k₁)cos + (q₁k₀−q₀k₁)sin = 3.5·cos(2) + (−0.5)·sin(2) = −1.4564 − 0.4547 = −1.9111 ✓

Example 2: Building the 4D RoPE Rotation Matrix

Problem: Construct R_Θ(pos) for d = 4, base = 10000, and pos = 5.

Solution:

d = 4, so i ∈ {0, 1}.

ω₀ = 10000^(−0/4) = 1 ω₁ = 10000^(−2/4) = 10000^(−0.5) = 0.01

Angles: θ₀ = 5·1 = 5 rad θ₁ = 5·0.01 = 0.05 rad

R(5) = [[cos(5), sin(5), 0, 0 ], [−sin(5), cos(5), 0, 0 ], [0, 0, cos(0.05), sin(0.05)], [0, 0, −sin(0.05), cos(0.05)]]

Wait — using the standard convention: R(θ) = [[cosθ, −sinθ], [sinθ, cosθ]].

R(5) = [[cos(5), −sin(5), 0, 0 ], [sin(5), cos(5), 0, 0 ], [0, 0, cos(0.05), −sin(0.05)], [0, 0, sin(0.05), cos(0.05)]]

= [[0.2837, 0.9589, 0, 0 ], [−0.9589, 0.2837, 0, 0 ], [0, 0, 0.9988, −0.0500], [0, 0, 0.0500, 0.9988]]

Note: R(5) is orthogonal — its transpose is its inverse.

Example 3: RoPE Decay with Distance

Problem: For d = 2, ω = 0.5, and content vectors q = [1, 0], k = [1, 0] (perfectly aligned, unit length), compute the attention score as a function of relative distance Δ = n−m.

Solution:

qᵀk = 1·1 + 0·0 = 1 (content score, would be 1 without position)

With RoPE, Δθ = Δ·0.5: score(Δ) = (q₀k₀ + q₁k₁)·cos(Δθ) + (q₁k₀ − q₀k₁)·sin(Δθ) = 1·cos(0.5Δ) + 0·sin(0.5Δ) = cos(0.5Δ)

Δ score
0 cos(0) = 1.000
1 cos(0.5) ≈ 0.8776
2 cos(1.0) ≈ 0.5403
3 cos(1.5) ≈ 0.0707
4 cos(2.0) ≈ −0.4161
5 cos(2.5) ≈ −0.8011
6 cos(3.0) ≈ −0.9900
7 cos(3.5) ≈ −0.9365

The score oscillates and decays. At Δ ≈ π/ω = π/0.5 ≈ 6.28, the score reaches −1 (complete anti-alignment). This is characteristic of RoPE: attention decays with distance in an oscillatory fashion, not monotonically.

This means RoPE produces a "soft window" — tokens at certain distances receive less attention due to rotational misalignment.



Quiz

Q1: What does the concept of Applying Rotation to Vector Pairs primarily refer to in this subject?

A) A computational error related to Applying Rotation to Vector Pairs B) A historical anecdote about Applying Rotation to Vector Pairs C) A visual representation of Applying Rotation to Vector Pairs D) The definition and application of Applying Rotation to Vector Pairs

Correct: D)

Q2: What is the primary purpose of Common Pitfalls?

A) It replaces all other methods in this domain B) It is used to common pitfalls in mathematical analysis C) It is primarily a historical notation system D) It is used only in advanced research contexts

Correct: B)

Q3: Which statement about Extrapolation and Scaling is TRUE?

A) Extrapolation and Scaling is mentioned only as a historical footnote B) Extrapolation and Scaling is a fundamental concept covered in this subject C) Extrapolation and Scaling is an advanced topic beyond this subject's scope D) Extrapolation and Scaling is not related to this subject

Correct: B)

Q4: Based on the worked examples in this subject, what is the correct result?

A) R_Θ(n − m) B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer

Correct: A)

Q5: How are Extrapolation and Scaling and Frequency Schedule (Base Parameter) related?

A) Extrapolation and Scaling and Frequency Schedule (Base Parameter) are completely unrelated topics B) Extrapolation and Scaling and Frequency Schedule (Base Parameter) are closely related concepts C) Extrapolation and Scaling is a special case of Frequency Schedule (Base Parameter) D) Extrapolation and Scaling is the inverse of Frequency Schedule (Base Parameter)

Correct: B)

Q6: What is a common pitfall when working with The 2D Rotation Matrix (Building Block)?

A) A common mistake is confusing The 2D Rotation Matrix (Building Block) with a similar concept B) The 2D Rotation Matrix (Building Block) is always computed the same way in all contexts C) The main error with The 2D Rotation Matrix (Building Block) is using it when it is not needed D) The 2D Rotation Matrix (Building Block) has no common misconceptions

Correct: A)

Q7: When should you apply The Full Rope Rotation Matrix?

A) Avoid The Full Rope Rotation Matrix unless explicitly instructed B) Apply The Full Rope Rotation Matrix to solve problems in this subject's domain C) Use The Full Rope Rotation Matrix only in pure mathematics contexts D) The Full Rope Rotation Matrix is not practically useful

Correct: B)

Practice Problems

Problem 1

For d = 8, base = 10000, what are the rotation angles for dimension pair i = 2 at position pos = 100?

Answer ω₂ = 10000^(−4/8) = 10000^(−0.5) = 0.01 θ = pos · ω₂ = 100 · 0.01 = 1.0 rad The 2D vector at dimensions (4,5) is rotated by 1.0 rad.

Problem 2

Prove that RoPE preserves the norm of query and key vectors: ||R_Θ(pos)·q|| = ||q||.

Answer R_Θ(pos) is block-diagonal with each block being a 2D rotation matrix R(θ_i). Each R(θ_i) is orthogonal: R(θ_i)ᵀR(θ_i) = I. The block-diagonal matrix is therefore orthogonal: R_ΘᵀR_Θ = I. ||R_Θ·q||² = (R_Θ·q)ᵀ(R_Θ·q) = qᵀR_ΘᵀR_Θ·q = qᵀ·q = ||q||² Thus ||R_Θ·q|| = ||q||. RoPE rotates but doesn't scale.

Problem 3

Explain why larger base values help with long-context extrapolation.

Answer The lowest frequency is ω_max = base^(−(d−2)/d) ≈ 1/base. This gives wavelength λ_max ≈ 2π·base. For base=10000, λ_max ≈ 62,832. For a sequence of length 100,000, the rotation for some dimension pairs completes fewer than one full cycle over the entire sequence — these low frequencies can still provide useful positional signal. With base=10000, at position 100,000, some pairs have undergone multiple full rotations and become ambiguous. Larger base reduces this problem by making ALL frequencies lower, so even at position 100,000, most dimension pairs haven't wrapped around.

Problem 4

Given q = [a, b] and k = [c, d], derive the explicit formula for q'ᵀk' under RoPE at relative distance Δ.

Answer q' = [a·cos(θ_q) − b·sin(θ_q), a·sin(θ_q) + b·cos(θ_q)] k' = [c·cos(θ_k) − d·sin(θ_k), c·sin(θ_k) + d·cos(θ_k)] q'ᵀk' = (a·cosθ_q − b·sinθ_q)(c·cosθ_k − d·sinθ_k) + (a·sinθ_q + b·cosθ_q)(c·sinθ_k + d·cosθ_k) Expanding and using trig identities: = ac·cos(θ_q−θ_k) + bd·cos(θ_q−θ_k) + ad·sin(θ_k−θ_q) + bc·sin(θ_q−θ_k) = (ac+bd)·cos(Δθ) + (bc−ad)·sin(Δθ) where Δθ = θ_q − θ_k (or Δθ = θ_k − θ_q, depending on sign convention). With θ = pos·ω: Δθ = (n−m)·ω for query at m, key at n.

Problem 5

A model trained with RoPE and base=10000 on sequences up to length 2048. You want to fine-tune it for length 8192 using NTK-aware scaling with α = 4. What is the new effective base?

Answer NTK-aware scaling: base' = base · α^(d/(d−2)) For d = 128 (typical head dim): d/(d−2) = 128/126 ≈ 1.0159 base' = 10000 · 4^1.0159 ≈ 10000 · 4.126 ≈ 41,260 The lowest frequency scales from ~1/10000 to ~1/41260, extending the maximum wavelength from ~62.8K to ~259K positions. High frequencies are barely affected (since the exponent is applied to ω_i which already decays geometrically).

Summary

  1. RoPE applies 2D rotations to Q and K vectors in pairs of dimensions: q'{2i,2i+1} = R(pos·ω_i)·[q{2i}, q_{2i+1}]ᵀ
  2. The central theorem: q_mᵀR_Θ(m)ᵀR_Θ(n)k_n = q_mᵀR_Θ(n−m)k_n — the dot product depends only on relative position
  3. Frequencies are geometrically spaced: ω_i = base^(−2i/d), giving wavelengths from ~6 to ~2π·base positions
  4. RoPE produces oscillatory attention score decay with distance: score(Δ) involves cos(Δ·ω_i) and sin(Δ·ω_i) terms
  5. Extrapolation requires scaling: position interpolation compresses positions, NTK-aware scaling adjusts the base, and YaRN combines both


Next Steps

Continue to 18-05 — Decoder-Only Architecture to learn how RoPE, causal masking, and next-token prediction combine in the GPT-style decoder-only Transformer.