Math graphic
📐 Concept diagram

18-03 — Positional Encodings

Phase: 18 — Large Language Model Mathematics Subject: 18-03 Prerequisites: 18-02 (Embedding Layers), 02-08 (Trigonometric Functions), 09-10 (Matrix Calculus — dot products), 17-06 (Attention Mechanism) Next subject: 18-04 — Rotary Position Embeddings (RoPE) — Deep


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the sinusoidal positional encoding formula and explain why each dimension uses a different frequency
  2. Prove that the dot product of sinusoidal PE vectors depends only on relative position, enabling the model to attend to relative offsets
  3. Compare absolute, relative, and sinusoidal position encoding paradigms with mathematical precision
  4. Derive the ALiBi (Attention with Linear Biases) formulation as an alternative to explicit PE
  5. Analyze the extrapolation properties of each encoding scheme

Core Content

1. Why Position Matters

The self-attention mechanism is permutation-equivariant: if you shuffle the input tokens, the output shuffles identically. Formally, for any permutation matrix P:

Attention(PX) = P · Attention(X)

This means attention, by itself, has NO notion of token order. "The cat sat on the mat" and "mat the on sat cat The" produce the same attention pattern (permuted). Positional encodings break this symmetry by injecting position information.

2. Sinusoidal Positional Encoding (Original Transformer)

The sinusoidal PE from Vaswani et al. (2017) defines:

PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

where: - pos ∈ {0, 1, ..., L−1} is the position - i ∈ {0, 1, ..., d/2 − 1} is the dimension index - d is the model dimension (must be even)

Each dimension pair (2i, 2i+1) forms a sinusoid with a specific wavelength:

The angular frequency for dimension pair i is:

ω_i = 1 / 10000^(2i/d)

The wavelength (in positions) is: λ_i = 2π / ω_i = 2π · 10000^(2i/d)

For i = 0: ω₀ = 1, λ₀ = 2π ≈ 6.28 positions For i = d/2 − 1: ω_max = 1/10000, λ_max = 2π · 10000 ≈ 62,832 positions

The spectrum of frequencies ranges from very short (~6 positions) to very long (~63K positions). This is analogous to a Fourier basis — each frequency captures positional patterns at a different scale.

3. The Relative Position Property

⚠️ THIS IS CRITICAL — The key mathematical property of sinusoidal PE is that the dot product PE(pos)·PE(pos+k) depends only on k (the offset), not on the absolute positions.

Proof:

Let's use a rotation-matrix formulation. Define the angle for position pos at dimension pair i:

θ_{i,pos} = pos · ω_i

Then:

[PE(pos, 2i), PE(pos, 2i+1)] = [sin(θ_{i,pos}), cos(θ_{i,pos})]

This is a 2D rotation vector. The dot product between positions m and n at dimension pair i:

PE(m, 2i)·PE(n, 2i) + PE(m, 2i+1)·PE(n, 2i+1) = sin(θ_{i,m})·sin(θ_{i,n}) + cos(θ_{i,m})·cos(θ_{i,n}) = cos(θ_{i,m} − θ_{i,n}) = cos((m − n) · ω_i)

This depends only on (m − n), the relative offset! The total dot product across all dimensions:

PE(m)ᵀPE(n) = Σ_{i=0}^{d/2−1} cos((m − n) · ω_i)

This is a function purely of (m − n). As a result, the attention score:

score_{ij} = (x_i + PE(i))ᵀW_QᵀW_K(x_j + PE(j))

has a component that encodes relative position through the PE dot products, even though the PE is added absolutely.

4. Why Sinusoids? Design Rationale

Extrapolation: Sinusoidal functions are defined for any real-valued pos, so they naturally extend to positions beyond those seen in training.

Determinism: Fixed, not learned. No parameters, no training needed.

Boundedness: All values in [-1, 1], so PE doesn't dominate the token embeddings.

Smoothness: Nearby positions have similar encodings (cos(Δ·ω) ≈ 1 for small Δ), which is a useful inductive bias.

Multi-scale: Different frequencies capture different ranges of relative distance.

5. Absolute vs. Relative Position Encoding

Absolute (add PE to token embeddings): - PE(pos) is added: x_pos = embed(token) + PE(pos) - Position is encoded in what the model "sees" - Examples: sinusoidal PE, learned PE

Relative (modify attention scores): - Instead of adding position to input, directly bias attention scores by relative distance - Example: Shaw et al. (2018) learn an embedding for each relative distance: a_{ij} += r_{j−i} - ALiBi and RoPE are modern relative variants

6. ALiBi (Attention with Linear Biases)

Press et al. (2021) proposed a remarkably simple alternative: subtract a linear penalty from attention scores based on distance.

score_{ij} = q_iᵀk_j / √d_k − m · |i − j|

where m > 0 is a head-specific slope. The subtraction reduces attention to distant tokens.

Why linear? A linear bias means the attention weights decay geometrically with distance. For query at position i:

Before softmax: score_{ij} ≈ q_iᵀk_j/√d_k − m|i−j| After softmax: α_{ij} ∝ exp(q_iᵀk_j/√d_k) · exp(−m|i−j|)

The exp(−m|i−j|) term exponentially decays with distance. This directly implements a locality bias: attend more to nearby tokens.

Slope selection: Different heads get different slopes. For 8 heads, slopes are geometrically spaced: m ∈ {1/2^(1), 1/2^(2), ..., 1/2^(8)} or similar. Heads with larger m attend very locally; heads with smaller m can attend farther.

Key advantage: No positional embeddings at all. Just modify attention scores. Also, since the bias is linear in |i−j|, it extrapolates naturally — longer sequences just get larger biases for far-apart tokens, which is reasonable.

7. Rotary Position Embeddings (RoPE) — Preview

Covered exhaustively in 18-04. Here's the key insight:

Sinusoidal PE adds position information BEFORE attention. RoPE rotates the query and key vectors BY their position BEFORE computing attention scores:

q'_i = R(θ, i) · q_i k'_j = R(θ, j) · k_j

where R(θ, pos) is a rotation matrix. Then:

q'_iᵀk'_j = (R(θ, i)·q_i)ᵀ(R(θ, j)·k_j) = q_iᵀR(θ, i)ᵀR(θ, j)·k_j = q_iᵀR(θ, j−i)·k_j

The dot product depends only on (j−i) — relative position!

8. Extrapolation Properties

A critical practical concern: can the model handle sequences longer than training?

Scheme Extrapolation Mechanism
Learned PE ❌ Poor No vectors for new positions
Sinusoidal PE ⚠️ Moderate Defined for any pos, but frequencies may not generalize
ALiBi ✅ Good Linear bias defined for any distance
RoPE ⚠️ Moderate (better with YaRN/NTK scaling) Rotation angles grow linearly
NoPE (No Positional Encoding) ✅ Surprisingly good (Kazemnejad et al., 2023) Model may learn position from causal mask + token patterns

NoPE result: Recent work shows Transformers CAN learn positional information from the causal mask alone (since tokens at different positions see different context windows sizes) and from the sequential nature of text. However, explicit position encoding remains standard practice.



Pitfalls

⚠️ Pitfall 1: Forgetting that sinusoidal PE encodes relative position through dot products, NOT through the values themselves. PE(m) and PE(n) don't look similar when m and n are close — their DOT PRODUCT is what encodes distance. Don't try to interpret the raw PE values.

⚠️ Pitfall 2: Assuming sinusoidal PE extrapolates perfectly. While mathematically defined for any position, high-frequency components at unseen positions can produce attention patterns the model never encountered during training. ALiBi and RoPE with scaling are more robust in practice.

⚠️ Pitfall 3: Confusing ALiBi's linear bias with exponential decay. ALiBi SUBTRACTS m·|i-j| from scores. After softmax, this produces exp(-m·|i-j|) decay — exponential in distance, not linear. The "linear" refers to the bias, not the resulting attention weights.


Key Terms

Worked Examples

Example 1: Computing Sinusoidal PE

Problem: For d = 4, compute PE(0) and PE(1).

Solution:

d = 4 → i ∈ {0, 1}. ω₀ = 1/10000^(0/4) = 1. ω₁ = 1/10000^(2/4) = 1/10000^(0.5) = 1/100 = 0.01.

PE(0): - 2i=0 (sin): sin(0·1) = sin(0) = 0 - 2i+1=1 (cos): cos(0·1) = cos(0) = 1 - 2i=2 (sin): sin(0·0.01) = sin(0) = 0 - 2i+1=3 (cos): cos(0·0.01) = cos(0) = 1 PE(0) = [0, 1, 0, 1]

PE(1): - 2i=0: sin(1·1) = sin(1) ≈ 0.8415 - 2i+1=1: cos(1·1) = cos(1) ≈ 0.5403 - 2i=2: sin(1·0.01) = sin(0.01) ≈ 0.0100 - 2i+1=3: cos(1·0.01) = cos(0.01) ≈ 0.99995 PE(1) ≈ [0.8415, 0.5403, 0.0100, 0.99995]

Observations: - Dimension pair 0 changes significantly between pos 0 and 1 (high frequency) - Dimension pair 1 barely changes (low frequency) - All values stay in [-1, 1]

Example 2: Proving Relative Position Property

Problem: For d = 2 (one dimension pair), compute PE(3)ᵀPE(7) and PE(10)ᵀPE(14). Show they're equal.

Solution:

With d=2, ω₀ = 1/10000⁰ = 1.

PE(pos) = [sin(pos·1), cos(pos·1)] = [sin(pos), cos(pos)]

PE(3)ᵀPE(7) = sin(3)·sin(7) + cos(3)·cos(7) = cos(7−3) = cos(4) ≈ −0.6536

PE(10)ᵀPE(14) = sin(10)·sin(14) + cos(10)·cos(14) = cos(14−10) = cos(4) ≈ −0.6536

Both equal cos(4) — the dot product depends only on the offset (4 in both cases), not the absolute positions (3,7 vs. 10,14). ✓

Example 3: ALiBi Attention Weights

Problem: For a 5-token sequence, compute ALiBi attention weights for the last token (pos 4) with slopes m = 0.5 and m = 0.1. Assume q_iᵀk_j/√d_k is uniform for simplicity.

Solution:

For simplicity, assume all content-based scores are 0 (or equal). Then score_{4,j} = 0 − m·|4−j| = −m·(4−j) for j ≤ 4.

m = 0.5: scores: [−2.0, −1.5, −1.0, −0.5, 0] exp: [0.135, 0.223, 0.368, 0.607, 1.0] weights: normalize → [0.058, 0.096, 0.158, 0.260, 0.429]

m = 0.1: scores: [−0.4, −0.3, −0.2, −0.1, 0] exp: [0.670, 0.741, 0.819, 0.905, 1.0] weights: normalize → [0.162, 0.179, 0.198, 0.219, 0.242]

Observation: With m=0.5, the model attends much more to the most recent token (42.9%). With m=0.1, attention is nearly uniform. Different heads can have different "attention ranges" via different slopes.



Quiz

Q1: What does the concept of Absolute primarily refer to in this subject?

A) The definition and application of Absolute B) A visual representation of Absolute C) A historical anecdote about Absolute D) A computational error related to Absolute

Correct: A)

Q2: What is the primary purpose of Relative?

A) It is used only in advanced research contexts B) It is used to relative in mathematical analysis C) It is primarily a historical notation system D) It replaces all other methods in this domain

Correct: B)

Q3: Which statement about Why Position Matters is TRUE?

A) Why Position Matters is an advanced topic beyond this subject's scope B) Why Position Matters is not related to this subject C) Why Position Matters is a fundamental concept covered in this subject D) Why Position Matters is mentioned only as a historical footnote

Correct: C)

Q4: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) (x_i + PE(i))ᵀW_QᵀW_K(x_j + PE(j)) C) The inverse of the correct answer D) A different result from a common mistake

Correct: B)

Q5: How are Why Position Matters and Sinusoidal Positional Encoding (Original Transformer) related?

A) Why Position Matters is the inverse of Sinusoidal Positional Encoding (Original Transformer) B) Why Position Matters and Sinusoidal Positional Encoding (Original Transformer) are closely related concepts C) Why Position Matters is a special case of Sinusoidal Positional Encoding (Original Transformer) D) Why Position Matters and Sinusoidal Positional Encoding (Original Transformer) are completely unrelated topics

Correct: B)

Q6: What is a common pitfall when working with The Relative Position Property?

A) A common mistake is confusing The Relative Position Property with a similar concept B) The Relative Position Property is always computed the same way in all contexts C) The main error with The Relative Position Property is using it when it is not needed D) The Relative Position Property has no common misconceptions

Correct: A)

Q7: When should you apply Why Sinusoids? Design Rationale?

A) Use Why Sinusoids? Design Rationale only in pure mathematics contexts B) Why Sinusoids? Design Rationale is not practically useful C) Avoid Why Sinusoids? Design Rationale unless explicitly instructed D) Apply Why Sinusoids? Design Rationale to solve problems in this subject's domain

Correct: D)

Practice Problems

Problem 1

For d = 8, compute PE(100, 0) through PE(100, 7). What are the wavelengths for each dimension pair?

Answer d = 8, i ∈ {0,1,2,3} ω₀ = 1, λ₀ = 2π ≈ 6.28 ω₁ = 1/10000^(2/8) = 1/10000^0.25 = 1/10 = 0.1, λ₁ = 20π ≈ 62.8 ω₂ = 1/10000^(4/8) = 1/100, λ₂ = 200π ≈ 628.3 ω₃ = 1/10000^(6/8) = 1/1000, λ₃ = 2000π ≈ 6283.2 PE(100): 2i=0: sin(100) ≈ −0.5064 2i+1=1: cos(100) ≈ 0.8623 2i=2: sin(10) ≈ −0.5440 2i+1=3: cos(10) ≈ −0.8391 2i=4: sin(1) ≈ 0.8415 2i+1=5: cos(1) ≈ 0.5403 2i=6: sin(0.1) ≈ 0.0998 2i+1=7: cos(0.1) ≈ 0.9950

Problem 2

Prove that for sinusoidal PE, PE(pos + k) can be expressed as a linear transformation of PE(pos).

Answer For dimension pair i: [sin((pos+k)·ω_i), cos((pos+k)·ω_i)] Using rotation formulas: sin(A+B) = sin(A)cos(B) + cos(A)sin(B) cos(A+B) = cos(A)cos(B) − sin(A)sin(B) [sin((pos+k)·ω_i)] [cos(k·ω_i) sin(k·ω_i)] [sin(pos·ω_i)] [cos((pos+k)·ω_i)] = [−sin(k·ω_i) cos(k·ω_i)] · [cos(pos·ω_i)] So PE(pos+k) = R(k) · PE(pos) where R(k) is a block-diagonal rotation matrix. This is the core idea behind RoPE!

Problem 3

Explain why ALiBi with m = 0 (no bias) is equivalent to having no positional encoding at all. What does the model lose?

Answer With m = 0, score_{ij} = q_iᵀk_j/√d_k — pure content-based attention, no positional signal. The model loses all notion of token order. It cannot distinguish "dog bites man" from "man bites dog." In practice, the model might still pick up some positional cues from the sequential nature of training (causal mask) and from the fact that some content patterns are position-dependent (e.g., capital letters at sentence start).

Problem 4

For sinusoidal PE with d = 512, what is the ratio of the highest frequency (i=0) to the lowest frequency (i=255)?

Answer ω₀ = 1/10000^(0/512) = 1 ω₂₅₅ = 1/10000^(510/512) = 1/10000^0.996 ≈ 1/9647 Ratio: ω₀/ω₂₅₅ ≈ 9647 The frequencies span nearly 4 orders of magnitude, from ~6-position wavelengths to ~60K-position wavelengths.

Problem 5

In ALiBi, why is the bias subtracted rather than added? What would happen if you added m|i−j| instead?

Answer Subtracting m|i−j| means attention scores DECREASE with distance. After softmax, far tokens get exponentially smaller weights, implementing a locality bias (nearby tokens are more relevant). If you ADDED m|i−j|, distant tokens would get HIGHER scores — the model would preferentially attend to far-away tokens, which is the opposite of the desired inductive bias and would likely perform poorly since most linguistic dependencies are local.

Summary

  1. Sinusoidal PE uses sine/cosine at geometrically spaced frequencies: PE(pos, 2i) = sin(pos/10000^(2i/d)), giving wavelengths from ~6 to ~60K positions
  2. The sine/cosine formulation guarantees PE(m)ᵀPE(n) = Σ cos((m−n)·ω_i), making the dot product depend only on relative offset
  3. ALiBi replaces explicit PE with a linear bias −m·|i−j| in attention scores, causing exponential decay of attention with distance and natural extrapolation
  4. RoPE (preview) rotates Q and K vectors by position-dependent angles, encoding relative position in the dot product: q_mᵀR(θ, n−m)k_n
  5. Extrapolation varies by scheme: learned PE fails, sinusoidal/ALiBi/RoPE all theoretically can extrapolate, with ALiBi being the most naturally robust


Next Steps

Continue to 18-04 — Rotary Position Embeddings (RoPE) — Deep for a thorough mathematical treatment of RoPE, including the rotation matrix formulation, proof of relative position encoding, and extrapolation properties.