18-02 — Embedding Layers
Phase: 18 — Large Language Model Mathematics Subject: 18-02 Prerequisites: 18-01 (Tokenization Mathematics), 09-03 (SVD — low-rank interpretation), 08-04 (Matrices as Linear Transformations), 14-02 (Gradient Descent) Next subject: 18-03 — Positional Encodings
Learning Objectives
By the end of this subject, you will be able to:
- Derive the input embedding lookup as a matrix multiplication with a one-hot vector, and explain why it's implemented as a direct lookup
- Analyze the token embedding matrix E ∈ ℝ^(V×d) as a learned mapping from discrete tokens to a continuous vector space
- Prove the equivalence (and the conditions for) weight tying between the input embedding and output projection layers
- Compute gradient flow through the embedding layer during backpropagation
- Compare learned vs. fixed positional embeddings mathematically and understand when each is preferred
Core Content
1. The Embedding Matrix as a Lookup Table
The token embedding layer maps a token index t ∈ {1, 2, ..., V} to a dense vector e_t ∈ ℝ^d.
The embedding matrix is:
E ∈ ℝ^(V × d)
where row t of E is the embedding of token t: e_t = E[t, :].
⚠️ THIS IS CRITICAL — Although embeddings are implemented as a lookup table, they are mathematically equivalent to matrix multiplication with a one-hot vector: e_t = (one_hot(t)) · E where one_hot(t) ∈ ℝ^(1×V) has a 1 at position t and 0 elsewhere. This perspective is essential for understanding gradient flow.
For a sequence of n tokens t₁, t₂, ..., t_n:
X = [e_{t₁}; e_{t₂}; ...; e_{t_n}] ∈ ℝ^(n × d)
This is formed by stacking the embedding vectors. In practice, this is a single gather/index operation.
2. What the Embedding Learns
Intuition: The embedding maps discrete symbols to a continuous vector space where semantically similar tokens have similar vectors (high cosine similarity). This is NOT explicitly trained — it emerges from the language modeling objective.
Formal view: The embedding matrix E is a linear map from the V-dimensional token space to the d-dimensional embedding space. Since tokens are one-hot, each row of E IS the representation of that token. The Transformer then transforms these initial representations through its layers.
Dimensionality: d is the model dimension (d_model). Typical values: GPT-2 (768), GPT-3 (12,288), Llama-7B (4,096), Llama-70B (8,192).
Initialization: Embeddings are typically initialized from N(0, 1/√d) or N(0, 0.02). Small initial values prevent the softmax in attention from saturating early in training.
3. Weight Tying
A key mathematical insight: the input embedding matrix and the output projection can share weights.
The output projection: After the final Transformer layer produces h_n ∈ ℝ^d for the last token, we compute logits:
z = h_n · W_outᵀ + b_out
where W_out ∈ ℝ^(V × d) and z ∈ ℝ^V gives unnormalized log-probabilities over the vocabulary.
Weight tying: Set W_out = E. Then:
z = h_n · Eᵀ
The logit for token j is: z_j = h_n · e_j (dot product of the final hidden state with token j's embedding).
Why this works mathematically: The embedding of token j serves both as its "input meaning" and as its "output target." If token j appears frequently in context C, its embedding e_j should have high dot product with the representation h that context C produces. Weight tying enforces this consistency.
Parameter savings: Without tying: |V|·d + |V|·d = 2|V|d parameters for embedding + output. With tying: |V|·d parameters (half). For |V|=32K, d=4096: savings of 131M parameters (~2% of a 7B model).
Proof of gradient with weight tying:
Let L be the cross-entropy loss. With weight tying, E receives gradients from both the input side (∂L/∂E_in) and the output side (∂L/∂E_out):
∂L/∂E = ∂L/∂E_in + ∂L/∂E_out
The output-side gradient for row j: ∂L/∂e_j = (∂L/∂z_j) · h_n, where ∂L/∂z_j = p_j − 1_{j=t} (for one-hot target t; p_j = softmax(z)_j).
The input-side gradient: for a token t at position k, ∂L/∂e_t accumulates ∂L/∂h_k · (∂h_k/∂e_t). Since h_k = e_t (identity mapping at input), this is straightforward.
4. Gradient Flow Through Embeddings
Consider the embedding lookup for a single position with token index t:
x = lookup(E, t) = E[t, :]
The forward pass is: E[t, :] → x.
Backward pass: Given ∂L/∂x ∈ ℝ^d (gradient of loss with respect to the embedding output):
Only row t of E receives gradient:
∂L/∂E[t, :] = ∂L/∂x (all other rows: zero) ∂L/∂E[i, :] = 0 for i ≠ t
Sparse gradient property: Since each token corresponds to exactly one row of E, the embedding gradient is extremely sparse. For a batch with B sequences of length S, only B·S rows of E get non-zero gradient per step, out of V possible rows. Typical ratio: for B=512, S=2048: B·S ≈ 1M, while V ≈ 32K–100K. Most tokens are updated, but heavily skewed — frequent tokens get many more gradient updates.
5. Positional Embeddings: Learned vs. Fixed
Besides token embeddings, we need to encode position information. There are two approaches for the positional contribution:
Learned Positional Embeddings (GPT, BERT pre-RoPE)
Add a learned matrix P ∈ ℝ^(max_seq_len × d):
X_input = X_token + P[0:n, :]
Row p of P encodes "position p." The model learns what position p means.
Advantages: Flexible, can learn any position encoding pattern. Disadvantages: Cannot extrapolate beyond max_seq_len seen during training. Every new position needs a new learned vector.
Fixed Sinusoidal Positional Encodings (Original Transformer)
Covered in depth in 18-03. The key point here is: these are NOT learned — they're fixed functions of position.
Rotary Position Embeddings (RoPE — modern LLMs)
Covered in depth in 18-03 and 18-04. RoPE encodes position via rotation rather than addition, so there is NO separate positional embedding matrix to learn.
6. Embedding Dimensionality and the Geometry of the Token Space
The embedding dimension d determines the capacity of the token representation. But there's a crucial insight from the SVD perspective:
Low-rank nature of learned embeddings: After training, the embedding matrix E often has effective rank much less than min(V, d). This is because language has structure — tokens cluster into semantic categories. The SVD: E = UΣVᵀ reveals that most variance is captured by the top k ≪ d singular values.
Cosine similarity semantics: For two tokens i and j, their similarity is:
sim(i, j) = (e_i · e_j) / (||e_i|| · ||e_j||) = cos(θ_{ij})
In a well-trained embedding space: synonyms have high cosine similarity, antonyms sometimes also have high cosine similarity (they share semantic context), and unrelated tokens have near-zero cosine similarity.
The embedding norm matters: ||e_t|| affects attention dot products (QKᵀ has terms involving e_t). Modern practice often normalizes embeddings or uses LayerNorm on the embedding output before feeding into the first Transformer block.
Pitfalls
⚠️ Pitfall 1: Forgetting that the embedding gradient is sparse. Only row t of E gets gradient for token t. All other rows get zero. This means rare tokens update extremely infrequently — their embeddings can remain near their random initialization for most of training.
⚠️ Pitfall 2: Misunderstanding weight tying. Weight tying sets W_out = E, but this means the SAME matrix receives gradients from TWO sources (input and output). The total gradient is the SUM: ∂L/∂E = ∂L/∂E_in + ∂L/∂E_out. This isn't just parameter sharing — it changes the optimization dynamics.
⚠️ Pitfall 3: Ignoring embedding norm explosion. Without LayerNorm/RMSNorm on the embedding output, tokens with large ||e_t|| dominate attention scores. A single outlier embedding can cause attention collapse where all queries attend to that token.
Key Terms
- 18 02 Embedding Layers
- Common Pitfalls
- Example 1: Embedding Lookup as Matrix Multiply
- Example 2: Weight Tying Gradient Calculation
- Example 3: Embedding Norm Impact on Attention
- Gradient Flow Through Embeddings
- Pitfall 2: Misunderstanding weight tying.
- Pitfall 3: Ignoring embedding norm explosion.
- Positional Embeddings: Learned vs. Fixed
- Problem 1
- Problem 2
- Problem 3
Worked Examples
Example 1: Embedding Lookup as Matrix Multiply
Problem: Show that embedding lookup is equivalent to matrix multiplication with a one-hot vector. For V = 3, d = 2, E = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]], what is the embedding of token 2?
Solution:
One-hot of token 2 (0-indexed: token 2 = index 1): h = [0, 1, 0]
h · E = [0, 1, 0] · [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]] = [0·0.1 + 1·0.3 + 0·0.5, 0·0.2 + 1·0.4 + 0·0.6] = [0.3, 0.4]
This equals E[1, :] — the direct lookup. The matrix multiply perspective shows that the embedding layer is a linear transformation applied to a one-hot representation.
Example 2: Weight Tying Gradient Calculation
Problem: With d = 2, V = 3, and weight tying, the final hidden state is h = [0.5, 0.8]. The embedding matrix rows are e₀=[0.1,0.2], e₁=[0.3,0.4], e₂=[0.5,0.6]. The true next token is token 1. Compute the output-side gradient for each row of E.
Solution:
Logits: z_j = h · e_j z₀ = 0.5·0.1 + 0.8·0.2 = 0.05 + 0.16 = 0.21 z₁ = 0.5·0.3 + 0.8·0.4 = 0.15 + 0.32 = 0.47 z₂ = 0.5·0.5 + 0.8·0.6 = 0.25 + 0.48 = 0.73
Softmax: exp(z₀)=1.234, exp(z₁)=1.600, exp(z₂)=2.075 sum = 4.909 p = [0.251, 0.326, 0.423]
∂L/∂z_j = p_j − 1_{j=1} = [0.251, 0.326−1, 0.423] = [0.251, −0.674, 0.423]
∂L/∂e_j = (∂L/∂z_j) · h ∂L/∂e₀ = 0.251 · [0.5, 0.8] = [0.126, 0.201] ∂L/∂e₁ = −0.674 · [0.5, 0.8] = [−0.337, −0.539] ∂L/∂e₂ = 0.423 · [0.5, 0.8] = [0.212, 0.338]
Interpretation: The correct token (1) gets a negative gradient to reduce its embedding's alignment with h (because p₁ should be higher — wait, let me check). Actually, p₁ = 0.326 but should be 1.0. ∂L/∂z₁ = p₁ − 1 = -0.674. This means we need to INCREASE z₁. The gradient pushes e₁ in the negative h direction, which with SGD update e₁ ← e₁ − η·(−0.674·h) = e₁ + η·0.674·h, INCREASES e₁'s alignment with h. Correct.
Example 3: Embedding Norm Impact on Attention
Problem: An input sequence has token embeddings with norms: ||e_a|| = 2.0, ||e_b|| = 0.5, ||e_c|| = 1.0. In the first attention layer, the query vector q comes from a linear projection: q_i = W_Q · e_i with W_Q initialized to small random values. Estimate the relative magnitudes of attention scores and explain the problem.
Solution:
q_a ≈ W_Q · e_a. Since W_Q has small entries (N(0, 1/√d)), the norms roughly scale with ||e_i||: ||q_a|| ≈ ||W_Q||_F · ||e_a|| ∝ 2.0 ||q_b|| ∝ 0.5 ||q_c|| ∝ 1.0
Attention scores: score_{ij} = q_iᵀk_j / √d_k ∝ ||q_i|| · ||k_j|| ∝ ||e_i|| · ||e_j||
So score_{a,j} is ~4× larger than score_{b,j} on average. Token a dominates attention regardless of content. This is why embedding LayerNorm or RMSNorm on the embedding output is critical — it normalizes ||e_i|| to ~1 for all tokens before the first attention layer.
Summary
- Token embeddings are a lookup table E ∈ ℝ^(V×d): each row is a learned d-dimensional representation of a token
- Embedding lookup is mathematically a matrix multiply with a one-hot vector, making gradient flow analysis clean: only the row corresponding to the input token receives non-zero gradient
- Weight tying sets the output projection equal to the input embedding transpose (W_out = Eᵀ), saving ~|V|·d parameters and enforcing representational consistency
- Embedding initialization from N(0, 1/√d) gives E[||e_t||²] = 1, preventing attention score magnitude issues early in training
- Positional information is added via learned positional embeddings, fixed sinusoidal encodings, or RoPE — the choice impacts extrapolation ability
Quiz
Q1: How is the embedding of token index t computed from matrix E?
A) e_t = E[t, :] (row lookup) — equivalent to one_hot(t) · E B) e_t = E[:, t] (column lookup) C) e_t = E · one_hot(t) D) e_t = softmax(E[t, :])
Correct: A)
- If you chose A: Rows correspond to tokens, columns to embedding dimensions. The one-hot vector multiplication perspective confirms this: [0,...,1,...,0] · E selects row t. B would use columns as tokens. C would produce a V-dimensional vector, not d-dimensional. Correct!
Q3: In the backward pass of an embedding lookup for token t at position k, which rows of E receive non-zero gradient?
A) All rows proportionally B) Only row t C) Rows t-1, t, t+1 D) All rows of tokens that appear in the sequence
Correct: B)
- If you chose B: The forward pass is x = E[t,:], an identity mapping from that row. The chain rule gives ∂L/∂E[t,:] = ∂L/∂x, and ∂L/∂E[i,:] = 0 for i ≠ t. This is why embedding gradients are sparse. D describes which rows get gradient across the full sequence, but for a single position, only row t gets gradient. Correct!
Q5: What happens if you use learned positional embeddings and evaluate on a sequence length 2× the training max?
A) The model gracefully extrapolates B) Positions beyond the training max have no learned embedding — undefined behavior C) The model automatically interpolates D) The attention mechanism ignores position for those tokens
B — Learned positional embeddings P have exactly L_max rows. For position p > L_max, there is no P[p,:]. You must either interpolate P to create entries for new positions, extrapolate by reusing P[L_max,:], or use a scheme like RoPE that naturally extrapolates.
Next Steps
Continue to 18-03 — Positional Encodings for a deep dive into sinusoidal positional encodings, why they work, and how they encode relative positions.