19-06 — Speculative Decoding
Phase: 19 — Advanced LLM Mathematics Subject: 19-06 Prerequisites: 19-05 (Knowledge Distillation — teacher-student framework, KL divergence), 18-09 (KV Caching), 18-08 (Inference Mathematics — autoregressive decoding, sampling), 18-05 (Decoder-Only Architecture), 17-07 (Scaled Dot-Product Attention) Next subject: 20-01 — Learning Rate Schedules
Learning Objectives
By the end of this subject, you will be able to:
- Derive the speculative decoding acceptance criterion: accept token x with probability min(1, p_target(x)/q_draft(x)), and prove that it preserves the target model's exact output distribution
- Compute the expected number of accepted tokens per draft step for a given draft-target agreement rate, and derive the resulting wall-clock speedup formula
- Analyze the theoretical speedup ceiling under different draft model quality regimes (oracle, random, adversarial) and explain why draft length γ has a diminishing returns property
- Design a tree-structured verification scheme that accepts multiple candidate tokens per position using SpecInfer-style token tree verification
- Formulate the memory-bandwidth analysis: explain why speculative decoding converts a memory-bound problem to a compute-bound one, and compute the arithmetic intensity gain
Core Content
1. The Autoregressive Bottleneck
1.1 Why Inference Is Slow
⚠️ THIS IS CRITICAL — Standard autoregressive decoding generates one token at a time, and each step requires loading the entire model weights from memory. This makes inference memory-bandwidth bound.
For a model with P parameters (fp16), each token generation requires: 1. Load P·2 bytes of weights from GPU HBM (or worse, from CPU RAM) 2. Compute ~2P FLOPs (one forward pass) 3. Write KV cache entries back to memory
Arithmetic intensity (FLOPs per byte loaded) for a single token:
$AI_single = 2P / (2P) = 1 FLOP/byte $
For an A100 with 2 TB/s memory bandwidth and 312 TFLOP/s compute:
$Memory-bound throughput: 2 TB/s / (2P bytes) tokens/s Compute-bound throughput: 312 TFLOP/s / (2P FLOPs) tokens/s $
For P=7B (14 GB weights): memory-bound rate ≈ 2TB/s / 14 GB ≈ 143 tokens/s. Compute-bound rate: 312T/14G ≈ 22,000 tokens/s. The memory bandwidth limits us to ~143 tokens/s — we're using only 0.6% of the GPU's compute capacity!
1.2 The Key Insight
To overcome the memory-bandwidth bottleneck, we need to generate MULTIPLE tokens per weight load. If we can generate γ tokens per model invocation, the arithmetic intensity becomes:
$AI_batched = 2P·γ / (2P) = γ FLOP/byte $
At γ tokens per forward pass, we're γ× closer to the compute-bound regime.
Speculative decoding achieves exactly this: Use a fast draft model to generate γ candidate tokens, then verify all γ with ONE target model forward pass.
2. The Speculative Decoding Algorithm
2.1 Algorithm Overview
Given: - Target model M_p (large, slow, desired output distribution p) - Draft model M_q (small, fast, approximate distribution q) - Draft length γ (number of tokens to speculate)
Step 1: Draft. Given prefix x_{<t}: - For i = 0,...,γ−1: - Sample x_{t+i} ~ q(·|x_{<t+i}) - This generates γ candidate tokens using only the small model
Step 2: Verify. Run the target model ONCE on the concatenated sequence x_{<t+γ}: - Get target probabilities p(x|prefix) for each of the γ positions - Also get draft probabilities q(x|prefix) (recomputed or cached from draft step)
Step 3: Accept/Reject. For i = 0,...,γ−1: - Generate u ~ Uniform(0,1) - If u < min(1, p(x_{t+i}) / q(x_{t+i})): ACCEPT token x_{t+i} - Else: REJECT. Sample x_{t+i}^{new} ~ norm(max(0, p − q)) where norm rescales to sum to 1. STOP the loop.
Step 4: The prefix becomes x_{<t+i+1} where i is the last accepted position. Repeat.
2.2 The Acceptance Criterion — Proof of Correctness
⚠️ THIS IS CRITICAL — The acceptance rule min(1, p(x)/q(x)) is carefully designed to preserve the target distribution.
Theorem: Speculative decoding with acceptance criterion min(1, p(x)/q(x)) and rejection sampling from norm(max(0, p − q)) produces tokens distributed exactly according to the target distribution p.
Proof (for a single token):
We sample x ~ q(x). Then we accept with probability:
$P(accept | x) = min(1, p(x)/q(x)) $
The joint probability of accepting token x:
$P(accept, x) = q(x) · min(1, p(x)/q(x))
= min(q(x), p(x))
$
Similarly, the probability of rejecting x:
$P(reject, x) = q(x) · max(0, 1 − p(x)/q(x))
= max(0, q(x) − p(x))
$
After rejection, we sample x' from the residual distribution:
$r(x) = max(0, p(x) − q(x)) / Σ_y max(0, p(y) − q(y)) $
The total probability of outputting token x:
$P(x) = P(accept, x) + P(reject any) · r(x)
= min(q(x), p(x)) + Σ_y max(0, q(y) − p(y)) · r(x)
$
Note: Σ_y max(0, p(y) − q(y)) = Σ_y max(0, q(y) − p(y)) (since Σ p = Σ q = 1):
$= min(q(x), p(x)) + Σ_y max(0, q(y) − p(y)) · max(0, p(x) − q(x)) / Σ_y max(0, p(y) − q(y)) = min(q(x), p(x)) + max(0, p(x) − q(x)) = p(x) ∎ $
The target distribution is exactly preserved! Speculative decoding is a lossless acceleration technique — it produces the same outputs as autoregressive decoding from the target model.
2.3 Rejection Sampling Detail
When a token is rejected at position i, we must resample from the adjusted distribution:
$p_adjusted(x) = norm(max(0, p(x) − q(x))) $
where norm(v) = v / Σ v ensures a valid probability distribution. This is the conditional distribution given that the draft model has "missed" some probability mass.
In practice, this is implemented by: 1. Compute residual r(x) = max(0, p(x) − q(x)) for all tokens 2. Normalize r → r̂ (sums to 1) 3. Sample x' ~ r̂
3. Expected Speedup Analysis
3.1 Acceptance Rate
For a single position, the probability that the draft token is accepted:
$α = P(accept) = E_x~q[min(1, p(x)/q(x))]
= Σ_x q(x) · min(1, p(x)/q(x))
= Σ_x min(q(x), p(x))
= 1 − TV(p, q)
$
where TV(p, q) = (1/2)·Σ|p(x)−q(x)| is the total variation distance. This is a beautiful result: the acceptance rate equals 1 minus the TV distance between the draft and target distributions.
Interpretation: When p and q are identical, TV=0, α=1 — every draft token is accepted. When p and q have disjoint support, TV=1, α=0 — no tokens are accepted.
3.2 Expected Tokens Per Step
The number of accepted tokens follows a geometric-like distribution (truncated at γ):
$E[#accepted] = Σ_{i=1}^{γ} (1 − α) · α^{i−1} · i + α^{γ} · γ
= (1 − α^{γ}) / (1 − α)
$
Derivation: for i < γ, we accept i tokens (getting to position i) then reject at position i+1 with probability α^i·(1−α). For i = γ, we accept all γ tokens with probability α^γ.
Special cases: - α = 1: E = γ (all tokens accepted — perfect draft) - α = 0: E = 1 (no tokens accepted — falls back to standard decoding) - α = 0.5, γ = 5: E = (1−0.5^5)/(1−0.5) = 0.96875/0.5 = 1.9375
3.3 Wall-Clock Speedup
Let: - T_target = time for one target model forward pass - T_draft = time for one draft model forward pass (per token)
Total time for one speculative step: T_draft_total + T_target where T_draft_total = γ · T_draft.
Speedup over standard decoding (which takes T_target per token):
$speedup = E[#accepted] / (γ·T_draft/T_target + 1)
= E[#accepted] / (γ/c + 1)
$
where c = T_target / T_draft is the speedup ratio of the draft model.
Optimal γ: As γ increases, E[#accepted] saturates at 1/(1−α), while the denominator γ/c grows linearly. The optimal γ balances these:
$γ* = argmax_γ (1−α^γ)/((1−α)·(γ/c + 1)) $
For α = 0.8, c = 100 (draft is 100× faster than target): γ* ≈ 5-8, giving ~3-4× speedup.
Key insight: Even with a highly accurate draft (α=0.9), the speedup saturates because rejected tokens waste work. For γ > γ*, the marginal benefit of additional draft tokens is outweighed by their generation cost.
4. Advanced Variants
4.1 Tree-Structured Verification (SpecInfer)
Instead of drafting a single sequence, the draft model can generate a TREE of candidates at each position, leveraging multiple speculative paths.
Token tree: - Root: current prefix - Level 1: top-k draft tokens (k candidates) - Level 2: for each level-1 token, top-k continuations - ... up to depth γ
The target model verifies the ENTIRE tree in ONE forward pass (using tree attention masks).
Acceptance: Multi-round rejection sampling on the tree. The acceptance probability for a path becomes:
$P(accept path) = Σ_{i=1}^{depth} product over path of min(1, p/node_prob)
$
This increases the expected accepted tokens by exploring multiple hypotheses in parallel. For k=3, depth=4: up to 40 tokens can be verified per target forward pass (but expected acceptance is lower due to branching factor).
4.2 Medusa: Multiple Draft Heads
Instead of a separate draft model, Medusa adds MULTIPLE prediction heads to the target model itself. Each head predicts the token at a specific future offset:
$Head j: predicts token at position t+j (j = 1,...,k) $
Advantages: - No separate draft model to train and load - Heads share the target model's base representations (cheap to compute) - Can be trained independently on the target model's own training data
Disadvantage: Draft quality decreases with offset j (predicting 3 tokens ahead is harder than 1). The expected acceptance drops geometrically with depth.
4.3 Self-Speculative Decoding
Use the target model itself as the draft model, but with: - Skipped layers: Drop every other layer in the draft forward pass - Reduced precision: Use int8 or fp8 for the draft pass - No KV cache update: The draft pass doesn't update the full KV cache
This eliminates the need for a separate draft model while providing speedup when the approximate forward pass is sufficiently faster than the full pass (typically 1.5-2×).
4.4 Staged Speculative Decoding
A cascade of models: M₁ (smallest/fastest) drafts, M₂ verifies and re-drafts on rejection, M₃ (target) gives final verification. Each stage provides a speed-quality tradeoff.
Acceptance cascade:
Stage 1: Draft γ₁ tokens with M₁
Stage 2: Verify+redraft with M₂
Stage 3: Final verification with M_target
Effective when M₂ is much faster than M_target but higher quality than M₁. The staged approach can extend the effective draft length γ.
5. Practical Considerations
5.1 Memory Overhead
Speculative decoding requires loading BOTH models into GPU memory: - Target model: P_target parameters - Draft model: P_draft parameters
For 70B target + 7B draft: ~154 GB (fp16) — requires model parallelism or CPU offloading.
Mitigation: Use a draft model that shares the target's embedding and/or LM head (common in practice — the draft and target use the same tokenizer and embedding matrix). This saves ~2·V·d parameters (V=vocabulary, d=dimension).
5.2 KV Cache Management
During drafting, the draft model builds up a speculative KV cache. During verification, the target model processes the full concatenated sequence (including speculative tokens) and builds its own KV cache. If all γ tokens are accepted, the target's KV cache for those positions is valid for the next step. If rejection occurs at position i < γ, the KV cache entries for positions ≥ i are discarded.
Tree attention for KV cache: In tree verification, the KV cache stores branches. Position i may have k possible key-value pairs (from different draft paths). Tree attention masks ensure each query attends only to its ancestors in the tree.
5.3 Batch Inference
In batch serving, different requests may be at different sequence lengths and may accept different numbers of speculative tokens. The batch must handle variable-length speculation — typically by padding to the max draft length in the batch.
Pitfalls
⚠️ Pitfall 1: Expecting speculative decoding to always speed up generation. The draft model must be correct often enough for the speedup to outweigh the cost of running it. If the draft acceptance rate is low (e.g., <50%), the overhead of running two models can make speculative decoding SLOWER than standard autoregressive generation.
⚠️ Pitfall 2: Using a draft model that's too large. The draft model should be 10-100× smaller than the target. If the draft has 50% of the target's parameters, the overhead is too high — you're essentially running the target model 1.5×. A tiny n-gram model or a 1-layer draft is much more efficient.
⚠️ Pitfall 3: Forgetting to match the tokenizer. The draft and target models MUST use the same tokenizer. If they don't, token indices don't correspond, and the verification step (checking if draft token matches target's top prediction) is meaningless.
Key Terms
- ACCEPT
- Arithmetic intensity
- Draft length
- Draft model
- Proof
- Speedup over standard decoding
- Target model
Worked Examples
Example 1: Single-Step Acceptance Probability
p = [0.7, 0.2, 0.1] (target distribution, 3 tokens) q = [0.6, 0.3, 0.1] (draft distribution)
Compute: (a) probability draft samples each token, (b) acceptance probability per sampled token, (c) overall acceptance rate α, (d) total variation distance.
Solution:
(a) Draft sampling probabilities: q = [0.6, 0.3, 0.1].
(b) Acceptance if sampled: - Token 1 sampled: P(accept|1) = min(1, 0.7/0.6) = min(1, 1.167) = 1.0 - Token 2 sampled: P(accept|2) = min(1, 0.2/0.3) = min(1, 0.667) = 0.667 - Token 3 sampled: P(accept|3) = min(1, 0.1/0.1) = 1.0
(c) Overall acceptance:
$α = Σ q(x)·min(1, p(x)/q(x)) = 0.6·1.0 + 0.3·0.667 + 0.1·1.0 = 0.6 + 0.2001 + 0.1 = 0.9001 $
(d) TV distance:
$TV = (1/2)·(|0.7−0.6| + |0.2−0.3| + |0.1−0.1|) = (1/2)·(0.1 + 0.1 + 0.0) = 0.1 α = 1 − TV = 1 − 0.1 = 0.9 ✓ (matches) $
Example 2: Multi-Step Speculative Decoding
Draft model generates γ=3 tokens: [x₁, x₂, x₃]. The acceptance probabilities at each position:
Position 1: α₁ = min(1, p(x₁)/q(x₁)) = min(1, 0.05/0.04) = 1.0 Position 2: α₂ = min(1, p(x₂)/q(x₂)) = min(1, 0.02/0.03) = 0.667 Position 3: α₃ = min(1, p(x₃)/q(x₃)) = min(1, 0.01/0.01) = 1.0
Random draws for acceptance: - u₁ = 0.32 < 1.0 → ACCEPT x₁ - u₂ = 0.85 > 0.667 → REJECT x₂ - (stop — don't check x₃)
Result: 1 token accepted. We must resample position 2 from the adjusted distribution:
For position 2, residual: r(x) = max(0, p(x) − q(x)).
Let's assume the distributions at the rejected position were: p₂ = [0.4, 0.35, 0.25] (target at position 2 after accepting x₁) q₂ = [0.3, 0.5, 0.2] (draft sampled x₂ which was class 1... actually the draft gave x₂ at this position)
Residual: max(0, p₂ − q₂) = [max(0, 0.4−0.3), max(0, 0.35−0.5), max(0, 0.25−0.2)] = [0.1, 0, 0.05]
Normalize: [0.1/0.15, 0, 0.05/0.15] = [0.667, 0, 0.333]
Resample from r̂: with probability 0.667 select token 0, with probability 0.333 select token 2.
This rejection correction ensures the final distribution still matches p.
Example 3: Expected Speedup Calculation
Target model: T_target = 100 ms per forward pass. Draft model: T_draft = 1 ms per token (100× faster). Draft accuracy: α = 0.8 (per token acceptance rate). Draft length: γ = 5.
Compute: (a) Expected tokens per step (b) Total time per speculative step (c) Speedup over standard decoding (d) What if we increase γ to 10?
Solution:
(a) E[#accepted] = (1−0.8^5)/(1−0.8) = (1−0.32768)/0.2 = 0.67232/0.2 = 3.362 tokens.
(b) Draft time for 5 tokens: 5 · 1 = 5 ms. Target verification: 100 ms. Total: 105 ms. Expected tokens: 3.362. Time per token: 105/3.362 = 31.2 ms.
(c) Standard: 100 ms/token. Speedup = 100/31.2 = 3.2×.
(d) γ=10: E[#accepted] = (1−0.8^10)/0.2 = (1−0.107)/0.2 = 0.893/0.2 = 4.465 tokens. Draft time: 10 ms. Total: 110 ms. Time per token: 110/4.465 = 24.6 ms. Speedup: 100/24.6 = 4.06×.
Diminishing returns: γ=5 → 3.2×, γ=10 → 4.06×. Doubling γ gave only 27% more speedup. The marginal benefit per additional draft token decreases.
At γ=100: E = (1−0.8^100)/0.2 ≈ 1/0.2 = 5 (the infinite limit). Total time: 100+100=200ms. Time per token: 200/5 = 40ms. Speedup: 100/40 = 2.5× — WORSE than γ=5!
Optimal γ for this setup: max speedup ~4.1× at γ ≈ 8-10. Beyond this, the draft model cost outweighs additional acceptance.
Quiz
Q1: What does the concept of Arithmetic intensity primarily refer to in this subject?
A) A visual representation of Arithmetic intensity B) A historical anecdote about Arithmetic intensity C) The definition and application of Arithmetic intensity D) A computational error related to Arithmetic intensity
Correct: C)
- If you chose A: This is incorrect. Arithmetic intensity is defined as: the definition and application of arithmetic intensity. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Arithmetic intensity is defined as: the definition and application of arithmetic intensity. The other options describe different aspects that are not the primary focus.
- If you chose C: Arithmetic intensity is defined as: the definition and application of arithmetic intensity. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Arithmetic intensity is defined as: the definition and application of arithmetic intensity. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of Target model?
A) It is used to target model in mathematical analysis B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used only in advanced research contexts
Correct: A)
- If you chose A: Target model serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. Target model serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Target model serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Target model serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about Draft model is TRUE?
A) Draft model is mentioned only as a historical footnote B) Draft model is an advanced topic beyond this subject's scope C) Draft model is not related to this subject D) Draft model is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. Draft model is a fundamental concept covered in this subject. This subject covers Draft model as part of its core content.
- If you chose B: This is incorrect. Draft model is a fundamental concept covered in this subject. This subject covers Draft model as part of its core content.
- If you chose C: This is incorrect. Draft model is a fundamental concept covered in this subject. This subject covers Draft model as part of its core content.
- If you chose D: Draft model is a fundamental concept covered in this subject. This subject covers Draft model as part of its core content. Correct!
Q4: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) An unrelated numerical value C) A different result from a common mistake D) the acceptance rate equals 1 minus the TV distance
Correct: D)
- If you chose A: This is incorrect. The worked examples show that the result is the acceptance rate equals 1 minus the TV distance. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is the acceptance rate equals 1 minus the TV distance. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is the acceptance rate equals 1 minus the TV distance. The other options represent common errors.
- If you chose D: The worked examples show that the result is the acceptance rate equals 1 minus the TV distance. The other options represent common errors. Correct!
Q5: How are Draft model and Draft length related?
A) Draft model is the inverse of Draft length B) Draft model is a special case of Draft length C) Draft model and Draft length are completely unrelated topics D) Draft model and Draft length are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Draft model and Draft length are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Draft model and Draft length are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Draft model and Draft length are covered in this subject as interconnected topics.
- If you chose D: Both Draft model and Draft length are covered in this subject as interconnected topics. Correct!
Q6: What is a common pitfall when working with ACCEPT?
A) ACCEPT has no common misconceptions B) A common mistake is confusing ACCEPT with a similar concept C) ACCEPT is always computed the same way in all contexts D) The main error with ACCEPT is using it when it is not needed
Correct: B)
- If you chose A: This is incorrect. Students often confuse ACCEPT with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse ACCEPT with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse ACCEPT with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse ACCEPT with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply Proof?
A) Use Proof only in pure mathematics contexts B) Apply Proof to solve problems in this subject's domain C) Proof is not practically useful D) Avoid Proof unless explicitly instructed
Correct: B)
- If you chose A: This is incorrect. Proof is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Proof is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Proof is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Proof is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
Prove that the expected number of accepted tokens is E = (1−α^γ)/(1−α) for a per-token acceptance probability α and draft length γ. Verify for α=0 and α=1.
Answer
Let N be the number of accepted tokens (0 ≤ N ≤ γ). P(N = i) for i < γ: accepted i tokens then rejected at position i+1. P(accept i tokens) = α^i. P(reject at i+1 | accept i tokens) = 1−α (for i < γ). P(N = i) = α^i · (1−α) for i = 0,...,γ−1. P(N = γ) = α^γ (all accepted, no rejection check beyond γ). Expected value:$E[N] = Σ_{i=0}^{γ−1} i·α^i·(1−α) + γ·α^γ
$
Using the identity Σ_{i=0}^{k} i·r^i = r·(1−(k+1)·r^k + k·r^{k+1})/(1−r)²:
Let S = Σ_{i=0}^{γ−1} i·α^i = α·(1−γ·α^{γ−1} + (γ−1)·α^γ)/(1−α)².
E = (1−α)·S + γ·α^γ = α·(1−γ·α^{γ−1} + (γ−1)·α^γ)/(1−α) + γ·α^γ
After algebraic simplification:
E = (1−α^γ)/(1−α) ∎
Verification:
- α=0: E = (1−0)/(1−0) = 1. Correct — with zero acceptance, exactly 1 token (the non-speculative fallback) is generated.
- α=1: E = (1−1)/(1−1) → use limit: lim_{α→1} (1−α^γ)/(1−α) = γ·α^{γ−1}/1 = γ. Correct — perfect draft accepts all γ tokens.
Problem 2
A draft model generates token distributions that are always within TV distance 0.15 of the target. Compute the maximum possible speedup for γ=1,2,4,8,∞ when c=50 (draft is 50× faster per token than target).
Answer
α = 1 − TV = 1 − 0.15 = 0.85. Speedup = E[#accepted] / (γ/c + 1) where E = (1−α^γ)/(1−α). γ=1: E = 1.0 (trivially, since γ=1 means at most 1 token accepted... wait: (1−0.85^1)/0.15 = 0.15/0.15 = 1). Speedup = 1/(1/50+1) = 1/1.02 = 0.98×. Speculation with γ=1 is slightly SLOWER due to draft overhead — the single draft token is always "accepted" (you check it), but it costs 1/c extra time. γ=2: E = (1−0.85²)/0.15 = (1−0.7225)/0.15 = 0.2775/0.15 = 1.85. Speedup = 1.85/(2/50+1) = 1.85/1.04 = 1.78×. γ=4: E = (1−0.85⁴)/0.15 = (1−0.522)/0.15 = 0.478/0.15 = 3.187. Speedup = 3.187/(4/50+1) = 3.187/1.08 = 2.95×. γ=8: E = (1−0.85⁸)/0.15 = (1−0.2725)/0.15 = 0.7275/0.15 = 4.85. Speedup = 4.85/(8/50+1) = 4.85/1.16 = 4.18×. γ=∞: E = 1/0.15 = 6.67. But speedup = 6.67/(∞/50+1) → 0. The denominator grows unbounded while numerator saturates — infinite draft length gives ZERO speedup! Maximum speedup ≈ 4.2× at γ ≈ 6-8 for these parameters. The optimal γ balances the saturated acceptance against the growing draft cost.Problem 3
In tree-structured verification, the draft model produces k=3 candidates per position up to depth d=3. The per-token acceptance rate is α=0.7. Compute the expected total number of accepted tokens if verification simply takes the longest accepted path in the tree.
Answer
For tree verification, each node in the tree is independently accepted with probability α (assuming the target token matches the draft at that node — this is an approximation; in reality, the acceptance depends on the specific distribution match). The longest accepted root-to-leaf path in a k-ary tree of depth d: Let P(path of length i is accepted) = α^i (need i consecutive acceptances). For there to be at least one path of length ≥ i, we need the maximum path length to be ≥ i. For independent branches (k branches at root, then each branches into k, etc.): P(no path of length i) = (1 − α^i)^{k^i} (there are k^i nodes at level i, and we need at least one of their root-to-leaf paths to be all-accepted). This is complex. A simpler approximation: expected max depth = Σ_{i=0}^{d} P(max depth = i) · i. Actually, for a geometric branching process where each node survives with probability α, the expected max depth in a k-ary tree is approximately: E[depth] ≈ d · (1 − (1−α)^{k^d} / (k·α^d)) (approximation) For k=3, d=3, α=0.7: The probability ALL k=3 root nodes fail: (1−0.7)^3 = 0.027. So with 97.3% probability, at least one root token is accepted. Expected accepted tokens ≈ α · (1 + α·k + α²·k²) ≈ 0.7 · (1 + 0.7·3 + 0.49·9) = 0.7 · (1 + 2.1 + 4.41) = 0.7 · 7.51 ≈ 5.26 tokens. Compare to single-path with same α=0.7, γ=3: E = (1−0.7³)/0.3 = 0.657/0.3 = 2.19 tokens. Tree gives ~2.4× more tokens! The branching leverage is significant — you're exploring alternatives in parallel at no extra target model cost (still one forward pass).Problem 4
Explain why speculative decoding's speedup formula converts inference from memory-bound to compute-bound. Derive the arithmetic intensity ratio.
Answer
Standard decoding (1 token per forward pass): - Memory: load P·2 bytes of weights - Compute: ~2P FLOPs - Arithmetic intensity: AI = 2P/(P·2) = 1 FLOP/byte On an A100 (2 TB/s bandwidth, 312 TFLOP/s compute): - Roofline: memory-bound since AI=1 < machine balance point (~156 FLOP/byte) - Only ~0.6% of compute capacity utilized Speculative decoding (γ tokens verified per forward pass): - Memory: load P·2 bytes (same — the model is loaded once) - Compute: ~2P·γ FLOPs (γ× more compute for the same weight load) - Arithmetic intensity: AI = 2P·γ/(P·2) = γ FLOP/byte With effective γ_eff = E[#accepted], the arithmetic intensity increases to ~γ_eff. For γ_eff=4: AI=4, still memory-bound on A100 but using 4× more compute (2.6% utilization). For γ_eff=20: AI=20, closing toward the compute-bound regime. In the LIMIT of perfect speculation (α→1, unlimited γ, zero draft cost): AI→γ, and we can approach the compute-bound roof. The speedup comes from amortizing the weight loading cost over multiple generated tokens. Each additional token verified in the same forward pass uses ~2P FLOPs of the abundant compute without requiring additional weight loads from memory. **Roofline speedup:**$speedup ≤ min(compute_roof / (2P·γ), bandwidth_roof · γ / (2P))
= min(312T/(14G·γ), 2T·γ/14G)
$
For P=7B, the compute roof gives ~22K/γ tokens/s, bandwidth gives ~143·γ tokens/s. Speedup is capped by whichever is lower. At γ=4: bandwidth limited at 572 tokens/s, speedup = 572/143 = 4×. At γ=50: compute limited at 440 tokens/s (the bandwidth roof at γ=50 would be 7150 tokens/s but compute can't keep up). Actual speedup = 440/143 = 3.1× — worse because compute saturation limits further gains.
Problem 5
You have a 70B target model and need to serve 1000 requests/second with average output length 256 tokens. Without speculation, what GPU count do you need? With speculation (3× speedup), what GPU count? Assume each GPU can serve 5 tokens/s for the 70B model (memory-bound on A100).
Answer
Total required output rate: 1000 requests/s · 256 tokens/request = 256,000 tokens/s. Without speculation: 5 tokens/s per GPU → 256,000/5 = 51,200 GPUs. This is clearly infeasible. With speculation (3× speedup): 15 tokens/s per GPU → 256,000/15 ≈ 17,067 GPUs. Still infeasible for most organizations, but 3× fewer. Practical alternatives to achieve this throughput: - Use a SMALLER model (7B instead of 70B): 7B model might achieve 50-100 tokens/s per GPU → 2,560-5,120 GPUs needed - Use quantization: int4 70B fits in ~40 GB (vs 140 GB fp16), allowing larger batch sizes and higher throughput - Distillation: compress 70B teacher into a 7B student that retains most of the quality - Optimized serving systems (vLLM, TensorRT-LLM) with continuous batching, PagedAttention, and kernel fusion — can achieve 3-5× higher throughput than naive serving With ALL optimizations combined (quantization + batching + speculation + smaller model), throughput can reach 200-500 tokens/s per GPU, requiring 512-1280 GPUs for 1K RPS at 256 output tokens. This example illustrates why inference optimization is critical — naive serving of large models requires absurd hardware, but the combination of techniques in Phase 19-20 makes deployment practical.Summary
- Speculative decoding overcomes the autoregressive memory-bandwidth bottleneck by generating γ draft tokens with a fast model and verifying all γ with one target model forward pass, achieving γ× higher arithmetic intensity
- The acceptance criterion min(1, p(x)/q(x)) preserves the target model's exact output distribution (lossless acceleration); the overall acceptance rate α = 1 − TV(p,q) where TV is total variation distance
- Expected speedup = (1−α^γ)/((1−α)·(γ/c+1)) where c is the draft-to-target speed ratio; optimal γ balances saturating acceptance (term approaches 1/(1−α)) against growing draft cost (γ/c)
- Tree-structured verification (SpecInfer) explores k alternative tokens per position, increasing accepted tokens per forward pass at the cost of more complex KV cache management
- For real deployments, memory overhead (both models in GPU RAM), draft quality, and batch inference dynamics determine practical speedup, with 2-4× being typical for well-matched draft-target pairs
Next Steps
You've completed Phase 19! Continue to Phase 20 with 20-01 — Learning Rate Schedules to begin exploring training and fine-tuning mathematics.