19-01 — Mixture of Experts (MoE) — Deep
Phase: 19 — Advanced LLM Mathematics Subject: 19-01 Prerequisites: 18-10 (Transformer Variants Math Focus), 18-05 (Decoder-Only Architecture), 17-07 (Scaled Dot-Product Attention), 14-02 (Gradient Descent), 13-04 (KL Divergence), 10-06 (Expectation of Discrete RVs) Next subject: 19-02 — Attention Variants
Learning Objectives
By the end of this subject, you will be able to:
- Derive the full MoE forward pass from the router logits through top-k selection to the weighted expert output, including the gate Jacobian for backpropagation
- Formulate and differentiate at least three load balancing loss variants (importance-based, CV-based, z-loss) and explain their effects on expert utilization entropy
- Compute expert capacity and analyze the token dropping rate as a function of capacity factor and batch-averaged routing distribution
- Prove why auxiliary losses are necessary from the gradient starvation argument and demonstrate the "rich get richer" collapse in a minimal setup
- Compare the FLOPs, parameter count, and memory footprint of dense vs MoE layers for arbitrary E (experts), k (top-k), and batch size B, including the all-to-all communication cost in distributed training
Core Content
1. The MoE Layer: Complete Forward Pass
A standard MoE layer replaces the feed-forward network (FFN) in a transformer block with E parallel expert FFNs and a learned router (gate) that selects a subset per token.
1.1 Router / Gate Function
Given input token representation x ∈ ℝ^d (after attention + residual):
$z = W_g · x ∈ ℝ^E (router logits) h = top-k(z) ∈ ℝ^E (sparse selection — k non-zero entries) g = softmax(h) ∈ ℝ^E (normalized gate scores) $
where W_g ∈ ℝ^(E×d) is the router weight matrix.
The $top-k$ operation zeroes out all but the k largest entries of z:
top-k(z)_j = { z_j if j ∈ argtop-k(z)
{ −∞ otherwise
After softmax, for j ∉ argtop-k(z): g_j = 0 (since e^(−∞) = 0).
For the selected experts T = argtop-k(z):
g_j = e^(z_j) / Σ_{t∈T} e^(z_t) for j ∈ T
Important: The softmax is computed ONLY over the selected experts, not over all E. This is called "sparse softmax" or "masked softmax."
1.2 Expert Forward Pass
Each expert j is an independent FFN with its own parameters:
FFN_j(x) = W_{out,j} · σ(W_{in,j} · x + b_{in,j}) + b_{out,j}
Typically: W_{in,j} ∈ ℝ^(d_ff×d), W_{out,j} ∈ ℝ^(d×d_ff) with d_ff = 4d (SwiGLU variant splits the intermediate dimension).
The MoE layer output is the weighted sum over selected experts:
$y = Σ_{j∈T} g_j · FFN_j(x)
$
For k=2 (most common):
$y = g_{e₁} · FFN_{e₁}(x) + g_{e₂} · FFN_{e₂}(x)
$
1.3 Parameter and FLOPs Accounting
Let each expert FFN have 2·d·d_ff parameters (ignoring biases). With d_ff = 4d:
- Parameters per expert: 8d²
- Total MoE FFN params: E · 8d²
- Router params: E·d
- Attention params (shared across experts): 4d²
Per-token FLOPs: - Router: 2·E·d (matmul + softmax) - Selected experts: k · (2 · 8d²) = 16k·d² - Total: ≈ 16k·d² (router is negligible for large d)
Comparison to dense: - Dense FFN: 8d² params, 16d² FLOPs per token - MoE (E experts, top-k): E·8d² params, 16k·d² FLOPs per token - Parameter ratio: E× - FLOPs ratio: k× (not E×!)
This is the core MoE value proposition: massive parameter increase at near-constant per-token compute.
2. Gradient Flow Through the Router
2.1 The Gate Jacobian
⚠️ THIS IS CRITICAL — understanding the router gradient explains why MoE needs auxiliary losses and why training is tricky.
The MoE output for a single token is:
$y = Σ_{j∈T} g_j · FFN_j(x) where g_j = e^(z_j) / Σ_{t∈T} e^(z_t)
$
The Jacobian of the gate scores with respect to the logits (for j,i ∈ T):
$∂g_j / ∂z_i = g_j·(δ_{ji} − g_i) (sparse softmax derivative)
$
where δ_{ji} = 1 if j=i else 0.
For j ∉ T: g_j = 0 and ∂g_j/∂z_i = 0 for all i (the gradient is blocked by top-k).
Chain rule for the loss L:
$∂L/∂z_i = Σ_{j∈T} (∂L/∂y · FFN_j(x)) · ∂g_j/∂z_i
$
For non-selected experts (j ∉ T): ∂L/∂z_j = 0. These experts receive zero gradient for this token.
2.2 The Gradient Starvation Problem
Consider a simplified 2-expert MoE with k=1. If expert 1 is selected for a token:
- Expert 1 receives the full gradient through g₁ = 1
- Expert 2 receives ZERO gradient for that token
The only way expert 2 ever gets gradient is from tokens where it is selected. This creates a feedback loop:
Expert 1 gets selected → Expert 1 gets better → Expert 1 selected more
Expert 2 not selected → Expert 2 gets worse → Expert 2 selected less
Formal argument: Let f_j(t) be the fraction of tokens routed to expert j at step t. The expected gradient norm received by expert j is proportional to f_j(t). The update rule (simplified):
$θ_j(t+1) = θ_j(t) − η · f_j(t) · g_j $
If f₁ is slightly larger than f₂ at initialization, Expert 1 improves faster, increasing f₁ further — a positive feedback loop that collapses the router.
3. Load Balancing Loss Functions
3.1 Importance-Based Auxiliary Loss (Switch Transformer)
Define the importance of expert j (fraction of tokens routed to it):
$f_j = (1/B) · Σ_{i=1}^{B} 𝟙[token i routes to expert j]
$
Define the gate probability mass for expert j:
P_j = (1/B) · Σ_{i=1}^{B} p_{ij}
where p_{ij} = softmax(z_i)_j (before top-k) is the "soft" gate score.
The auxiliary loss encourages uniform routing:
$L_aux = α · E · Σ_{j=1}^{E} f_j · P_j
$
Under uniform routing: f_j = 1/E and P_j = 1/E, so:
$L_aux_uniform = α · E · E · (1/E²) = α $
Under total collapse (all tokens to one expert): f₁ = 1, P₁ = 1, others = 0, so:
$L_aux_collapse = α · E · 1 · 1 = α·E $
This penalizes collapse proportionally to E — large numbers of experts need stronger balancing.
3.2 Coefficient of Variation (CV) Loss
Used in GShard and later MoE models. The CV of expert loads:
$CV(f) = σ(f) / μ(f) $
where μ(f) = 1/E (mean fraction per expert) and σ(f) = √((1/E)·Σ(f_j − 1/E)²).
$L_cv = α · CV(f)² = α · E · Σ_{j=1}^{E} (f_j − 1/E)²
$
This is more direct: it penalizes any deviation from uniform load, not just the imbalance × importance cross-product. Under uniform routing, L_cv = 0. Under collapse, L_cv = α·(1 − 1/E)²·E ≈ α·E (same asymptotic behavior).
3.3 Router Z-Loss
Added in ST-MoE to stabilize training in large MoE models:
$L_z = (1/B) · Σ_{i=1}^{B} (log Σ_{j=1}^{E} e^(z_{ij}))²
$
This penalizes large logit magnitudes in the router, encouraging z-score-like logit distributions. It prevents router logit explosion, which can cause softmax saturation (near-deterministic selection).
3.4 Total Training Loss
$L_total = L_LM + α_aux · L_aux + α_z · L_z $
where L_LM is the language modeling loss. Typical coefficients: α_aux ∈ [0.01, 0.1], α_z ∈ [0.001, 0.01].
4. Expert Capacity and Token Dropping
4.1 Capacity Formulation
Each expert has a fixed capacity — the maximum number of tokens it can process per batch. This is essential for efficient hardware utilization (fixed tensor sizes).
$C = capacity_factor · (B · k / E) $
where: - B = batch size (total tokens) - k = top-k (tokens per expert) - E = number of experts - capacity_factor > 1 (typically 1.25-1.5)
B·k/E is the expected number of tokens per expert under uniform routing.
4.2 Token Dropping
If more than C tokens route to an expert, the excess tokens are dropped — their expert outputs are zeroed, and the residual connection passes the token through unchanged.
The dropping rate depends on the routing distribution and capacity factor:
For capacity_factor = 1.25, up to 25% more tokens than expected can be accommodated. If routing variance is high (uneven distribution), some experts overflow while others are underfilled.
Expected dropping rate (approximate, assuming independent routing):
$P(drop)_j ≈ P(Poisson(B·p_j) > C) $
where p_j ≈ 1/E (targeted). With capacity_factor = 1.5 and E ≥ 8, the dropping rate is typically <1%.
4.3 Load Balancing Loss vs Capacity Interaction
Aggressive load balancing (high α_aux) → uniform routing → low dropping with small capacity_factor → memory efficient.
Weak load balancing → skewed distribution → needs large capacity_factor to avoid dropping → memory heavy.
This is a trilemma: parameter count (E), per-token FLOPs (k), and memory (C ≈ capacity_factor·B·k/E). You can't optimize all three simultaneously.
5. Distributed MoE: All-to-All Communication
5.1 Expert Parallelism
Experts are sharded across devices. Each device hosts E/D experts. Tokens must be routed across devices:
- All-to-all send: Each device sends tokens to the devices hosting their selected experts
- Expert computation: Each device processes the tokens assigned to its experts
- All-to-all recv: Tokens are sent back to their originating devices
5.2 Communication Cost
For D devices, B tokens, d model dimension:
$communication_volume = B · d · k (bytes per all-to-all, fp16) $
Compared to dense model communication (no expert routing needed), MoE adds significant network traffic. For large E and D, all-to-all can become the bottleneck.
Pitfalls
⚠️ Pitfall 1: Using softmax over ALL experts. The standard MoE implementation computes softmax ONLY over the selected k experts (masked softmax). Computing softmax over all E and then taking top-k gives different gate scores because the denominator includes all experts, not just the top-k.
⚠️ Pitfall 2: Forgetting that the top-k operation blocks gradient to non-selected experts. For experts j ∉ T, g_j = 0 and ∂y/∂z_j = 0. These experts learn NOTHING from this token. This is the "gradient starvation" problem — the load balancing loss is the only mechanism pushing tokens toward underused experts.
⚠️ Pitfall 3: Ignoring all-to-all communication cost in distributed MoE. When experts live on different GPUs, each token must be sent to the GPUs hosting its selected experts. This cross-device communication can dominate training time — an MoE with 8 experts across 8 GPUs spends significant time just moving tokens around.
Key Terms
- Expected dropping rate
Worked Examples
Example 1: Router Forward Pass
Given a router weight matrix W_g and token x:
$W_g = [[1, 0, 2, 1],
[0, 1, -1, 2],
[2, -1, 0, 1],
[1, 1, 1, 0]]
x = [1, 2, -1, 3]^T
$
Step 1: Compute logits z = W_g · x
$z₁ = 1·1 + 0·2 + 2·(−1) + 1·3 = 1 + 0 − 2 + 3 = 2 z₂ = 0·1 + 1·2 + (−1)·(−1) + 2·3 = 0 + 2 + 1 + 6 = 9 z₃ = 2·1 + (−1)·2 + 0·(−1) + 1·3 = 2 − 2 + 0 + 3 = 3 z₄ = 1·1 + 1·2 + 1·(−1) + 0·3 = 1 + 2 − 1 + 0 = 2 z = [2, 9, 3, 2] $
Step 2: Apply top-2
The two largest values are z₂ = 9 and z₃ = 3.
$h = [−∞, 9, 3, −∞] $
Step 3: Compute sparse softmax g = softmax(h)
$g₂ = e^9 / (e^9 + e^3) = 8103.08 / (8103.08 + 20.09) = 8103.08 / 8123.17 = 0.9975 g₃ = e^3 / (e^9 + e^3) = 20.09 / 8123.17 = 0.0025 g₁ = g₄ = 0 g = [0, 0.9975, 0.0025, 0] $
Token routes to experts 2 and 3, with expert 2 dominating.
Example 2: Gradient Through the Top-k Gate
For the same setup, let FFN₂(x) = a₂ ∈ ℝ^d and FFN₃(x) = a₃ ∈ ℝ^d. The MoE output:
$y = 0.9975 · a₂ + 0.0025 · a₃ $
Gradient of L with respect to z:
Only z₂ and z₃ receive gradient (they passed top-k).
$∂L/∂z₂ = ∂L/∂y · a₂ · g₂(1 − g₂) + ∂L/∂y · a₃ · (−g₂·g₃)
= ∂L/∂y · [a₂ · 0.0025 − a₃ · 0.0025]
= 0.0025 · ∂L/∂y · (a₂ − a₃)
∂L/∂z₃ = ∂L/∂y · a₃ · g₃(1 − g₃) + ∂L/∂y · a₂ · (−g₃·g₂)
= ∂L/∂y · [a₃ · 0.0025 − a₂ · 0.0025]
= 0.0025 · ∂L/∂y · (a₃ − a₂)
∂L/∂z₁ = 0 (non-selected, gradient blocked)
∂L/∂z₄ = 0 (non-selected, gradient blocked)
$
Experts 1 and 4 get zero gradient for this token. Their only hope is tokens that route to them.
Example 3: Load Balancing Loss Computation
Consider E=4 experts, B=100 tokens, top-2 routing. After one forward pass:
$Token distribution: f = [0.4, 0.3, 0.2, 0.1] (fractions routed to each) Gate probability: P = [0.35, 0.3, 0.25, 0.1] (average softmax score) $
Importance-based loss (α = 0.01):
$L_aux = α · E · Σ f_j · P_j
= 0.01 · 4 · (0.4·0.35 + 0.3·0.3 + 0.2·0.25 + 0.1·0.1)
= 0.04 · (0.14 + 0.09 + 0.05 + 0.01)
= 0.04 · 0.29
= 0.0116
$
Under uniform (f = [0.25, 0.25, 0.25, 0.25], P = [0.25, 0.25, 0.25, 0.25]):
L_aux_uniform = 0.04 · 4 · 0.0625 = 0.01 (= α, as derived earlier)
CV-based loss (α = 0.01):
$μ = 0.25 σ² = (1/4) · [(0.4−0.25)² + (0.3−0.25)² + (0.2−0.25)² + (0.1−0.25)²] = 0.25 · [0.0225 + 0.0025 + 0.0025 + 0.0225] = 0.25 · 0.05 = 0.0125 CV² = σ²/μ² = 0.0125/0.0625 = 0.2 L_cv = 0.01 · 0.2 = 0.002 $
Quiz
Q1: What does the concept of Expected dropping rate primarily refer to in this subject?
A) A historical anecdote about Expected dropping rate B) The definition and application of Expected dropping rate C) A visual representation of Expected dropping rate D) A computational error related to Expected dropping rate
Correct: B)
- If you chose A: This is incorrect. Expected dropping rate is defined as: the definition and application of expected dropping rate. The other options describe different aspects that are not the primary focus.
- If you chose B: Expected dropping rate is defined as: the definition and application of expected dropping rate. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. Expected dropping rate is defined as: the definition and application of expected dropping rate. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Expected dropping rate is defined as: the definition and application of expected dropping rate. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of The Moe Layer: Complete Forward Pass?
A) It is used to the moe layer: complete forward pass in mathematical analysis B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used only in advanced research contexts
Correct: A)
- If you chose A: The Moe Layer: Complete Forward Pass serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. The Moe Layer: Complete Forward Pass serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. The Moe Layer: Complete Forward Pass serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. The Moe Layer: Complete Forward Pass serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about Gradient Flow Through The Router is TRUE?
A) Gradient Flow Through The Router is a fundamental concept covered in this subject B) Gradient Flow Through The Router is an advanced topic beyond this subject's scope C) Gradient Flow Through The Router is not related to this subject D) Gradient Flow Through The Router is mentioned only as a historical footnote
Correct: A)
- If you chose A: Gradient Flow Through The Router is a fundamental concept covered in this subject. This subject covers Gradient Flow Through The Router as part of its core content. Correct!
- If you chose B: This is incorrect. Gradient Flow Through The Router is a fundamental concept covered in this subject. This subject covers Gradient Flow Through The Router as part of its core content.
- If you chose C: This is incorrect. Gradient Flow Through The Router is a fundamental concept covered in this subject. This subject covers Gradient Flow Through The Router as part of its core content.
- If you chose D: This is incorrect. Gradient Flow Through The Router is a fundamental concept covered in this subject. This subject covers Gradient Flow Through The Router as part of its core content.
Q4: How are Gradient Flow Through The Router and Load Balancing Loss Functions related?
A) Gradient Flow Through The Router and Load Balancing Loss Functions are completely unrelated topics B) Gradient Flow Through The Router and Load Balancing Loss Functions are closely related concepts C) Gradient Flow Through The Router is a special case of Load Balancing Loss Functions D) Gradient Flow Through The Router is the inverse of Load Balancing Loss Functions
Correct: B)
- If you chose A: This is incorrect. Both Gradient Flow Through The Router and Load Balancing Loss Functions are covered in this subject as interconnected topics.
- If you chose B: Both Gradient Flow Through The Router and Load Balancing Loss Functions are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both Gradient Flow Through The Router and Load Balancing Loss Functions are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Gradient Flow Through The Router and Load Balancing Loss Functions are covered in this subject as interconnected topics.
Q5: What is a common pitfall when working with Expert Capacity And Token Dropping?
A) Expert Capacity And Token Dropping has no common misconceptions B) The main error with Expert Capacity And Token Dropping is using it when it is not needed C) Expert Capacity And Token Dropping is always computed the same way in all contexts D) A common mistake is confusing Expert Capacity And Token Dropping with a similar concept
Correct: D)
- If you chose A: This is incorrect. Students often confuse Expert Capacity And Token Dropping with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Expert Capacity And Token Dropping with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Expert Capacity And Token Dropping with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: Students often confuse Expert Capacity And Token Dropping with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
Q6: When should you apply Distributed Moe: All-To-All Communication?
A) Avoid Distributed Moe: All-To-All Communication unless explicitly instructed B) Use Distributed Moe: All-To-All Communication only in pure mathematics contexts C) Apply Distributed Moe: All-To-All Communication to solve problems in this subject's domain D) Distributed Moe: All-To-All Communication is not practically useful
Correct: C)
- If you chose A: This is incorrect. Distributed Moe: All-To-All Communication is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Distributed Moe: All-To-All Communication is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: Distributed Moe: All-To-All Communication is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose D: This is incorrect. Distributed Moe: All-To-All Communication is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
A MoE layer has E=8 experts, d=512, d_ff=2048, k=2. Compute (a) parameters per expert FFN, (b) total MoE FFN parameters, (c) FLOPs per token for the MoE FFN, and (d) the parameter-to-FLOP ratio compared to a dense FFN.
Answer
(a) Per expert: W_in (2048×512) + W_out (512×2048) = 1,048,576 + 1,048,576 = 2,097,152 params (ignoring biases). Or 8d² = 8·512² = 2,097,152. (b) Total MoE FFN: 8 × 2,097,152 = 16,777,216 params. Router: 8×512 = 4,096 params. (c) Per-token FLOPs: 2 selected experts × (2 matmuls per expert, each 2·d·d_ff ops) = 2 × 2 × 2 × 512 × 2048 = 8,388,608 ≈ 8.4M FLOPs. (d) Dense FFN: 8d² = 2,097,152 params, 16d² = 4,194,304 FLOPs per token. - Parameter ratio: 16,777,216 / 2,097,152 = 8× - FLOPs ratio: 8,388,608 / 4,194,304 ≈ 2× The MoE model uses 8× more parameters but only 2× more FLOPs per token — exactly what sparse activation provides.Problem 2
For E=16 experts, B=256 tokens, k=2, capacity_factor=1.25. Compute (a) expected tokens per expert under uniform routing, (b) expert capacity C, and (c) how many total tokens can be dropped at most (if all capacity overflow is dropped).
Answer
(a) Expected tokens per expert = B·k/E = 256·2/16 = 32 tokens per expert. (b) Capacity C = capacity_factor · (B·k/E) = 1.25 · 32 = 40 tokens per expert. (c) Total token slots: E·C = 16·40 = 640. Total tokens routed: B·k = 512. So 640−512 = 128 slack tokens can be absorbed (tokens can be up to 128 over-allocated to heavy experts). Maximum droppable = B·k = 512 (worst case: all tokens route to one expert, capacity 40, so 512−40 = 472 dropped unless capacity factor compensates). Actually the question asks: if ALL capacity overflows are dropped, maximum total dropped = B·k − E·C = 512 − 640 = 0 (with capacity_factor=1.25 and uniform routing, no dropping is expected). Under worst-case imbalance where a single expert gets all 512 tokens, 512−40 = 472 tokens would be dropped across the other 15 experts. The other 15 experts get 0 tokens each, so total dropped = 472.Problem 3
Prove that an MoE layer with k=1 and E experts, trained with NO load balancing loss, will collapse to a single expert under gradient descent, assuming the experts have identical initialization but slightly different cumulative gradients.
Answer
Assume all E experts start with identical parameters θⱼ = θ₀. After one step, small floating-point differences or batch ordering cause slight differences. Suppose expert 1 receives slightly more gradient (f₁ > f₂ = ... = f_E). The update: θ₁(t+1) = θ₁(t) − η·f₁·∇L₁. Expert 1 improves faster than others since f₁ > f_j for j≠1. This makes its output more useful, causing the router to select it more often (f₁ increases). This creates a positive feedback loop: f₁(t+1) > f₁(t) The fixed point f₁ = 1, f_{j≠1} = 0 is an attractor of the dynamics. To see why: when expert 1 dominates, all gradients flow to it, making it even better. Other experts, starved of gradient, drift randomly (or remain frozen if no other signal reaches them). The router logits for expert 1 grow unboundedly while others decay, making the softmax and top-k selection deterministic for expert 1. This is the gradient starvation collapse — it requires an auxiliary loss to break the symmetry.Problem 4
Write the full MoE forward and backward pass pseudocode for a single layer with k=2, including the gate Jacobian computation.
Answer
**Forward pass:**z = W_g @ x # (E,) router logits
top2_idx = argtop2(z) # indices of 2 largest
h = fill(-inf, E); h[top2_idx] = z[top2_idx]
g = softmax(h[top2_idx]) # (2,) normalized gate scores
y = zeros(d)
for (j, idx) in enumerate(top2_idx):
y += g[j] * FFN[idx](x) # weighted expert output
**Backward pass (given ∂L/∂y):**
∂L/∂z = zeros(E)
# Gate Jacobian component: ∂L/∂y · FFN_j(x)
for (j, idx) in enumerate(top2_idx):
f_j = FFN[idx](x)
# Sparse softmax Jacobian: ∂g_j/∂z_i = g_j(δ_{ji} − g_i)
for (i, idx2) in enumerate(top2_idx):
∂L/∂z[idx2] += dot(∂L/∂y, f_j) * g[j] * ((1.0 if j==i else 0.0) − g[i])
# Backprop through the expert FFN
∂L/∂x += g[j] * FFN_backward[idx](∂L/∂y)
# Backprop through router weights
∂L/∂W_g = outer(∂L/∂z, x) # only non-zero for selected indices
The key insight: ∂L/∂z is non-zero ONLY for the selected expert indices. Non-selected experts get zero router gradient.
Problem 5
A training run uses α_aux = 0.1, E = 8, and the router stabilizes at f = [0.2, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1]. Compute the auxiliary loss and compare it to the uniform ideal. How much language modeling loss do you sacrifice? (Hint: compare to α_aux effect on total loss, assuming L_LM ≈ 2.0 nats).
Answer
Importance-based auxiliary loss:$L_aux = α_aux · E · Σ f_j · P_j $We don't have P_j directly (gate probabilities), but if P_j ≈ f_j (router tends to give high scores where it routes):
$Σ f_j · f_j = 0.2² + 0.15² + 0.15² + 6·0.1² = 0.04 + 0.0225 + 0.0225 + 0.06 = 0.145 L_aux ≈ 0.1 · 8 · 0.145 = 0.116 $Uniform (f_j = 0.125):
$Σ f_j² = 8 · 0.015625 = 0.125 L_aux_uniform = 0.1 · 8 · 0.125 = 0.1 $The actual distribution is slightly MORE concentrated (0.116 > 0.1 — wait, 0.145 > 0.125, but that's because P≈f here and f₁=0.2 is above uniform 0.125): Correct: Σ f_j² = 0.145 vs uniform 0.125. So L_aux = 0.1·8·0.145 = 0.116. At uniform: L_aux = 0.1·8·0.125 = 0.1. The cost over uniform: 0.016, which is only 0.8% of L_LM (2.0). This is a modest price — the load balancing loss is designed to be small enough not to dominate training but large enough to provide signal.
Summary
- MoE replaces dense FFN with E experts and a learned router g = softmax(top-k(W_g·x)); sparsity (k ≪ E) gives E× parameters at k× FLOPs per token
- The top-k operation creates a gradient bottleneck: non-selected experts get zero gradient, causing potential "rich get richer" router collapse — auxiliary losses are NOT optional
- Load balancing losses (importance-based, CV-based, z-loss) penalize uneven expert utilization; the typical auxiliary loss magnitude is ~1% of the language modeling loss
- Expert capacity C = capacity_factor · B·k/E caps per-expert tokens per batch; capacity_factor trades off memory vs dropping rate
- Distributed MoE requires all-to-all communication to route tokens to expert-hosting devices; this scales as B·d·k and can become the training bottleneck at large scale
Next Steps
Continue to 19-02 — Attention Variants to explore Multi-Query Attention, Grouped-Query Attention, FlashAttention optimizations, and ring attention for distributed long-context processing.