Math graphic
📐 Concept diagram

19-04 — Pruning

Phase: 19 — Advanced LLM Mathematics Subject: 19-04 Prerequisites: 19-03 (Quantization Mathematics), 16-08 (Regularization — L1/L2), 14-06 (Convex Functions — Hessian), 09-03 (SVD — rank approximation concepts), 09-10 (Matrix Calculus) Next subject: 19-05 — Knowledge Distillation


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the Taylor expansion argument that justifies magnitude pruning and explain when it fails (correlated weights, scale variation)
  2. Formulate SparseGPT as an extension of GPTQ's OBS framework to pruning — derive the weight update rule for zeroing one weight and compensating the rest
  3. Prove that N:M sparsity in hardware maps to a constrained optimization problem: find the N largest magnitudes in each group of M consecutive weights
  4. State the Lottery Ticket Hypothesis precisely, define the conditions for winning ticket identification, and contrast iterative magnitude pruning with one-shot pruning
  5. Compute the speedup ceiling for structured vs unstructured sparsity on GPU hardware given memory bandwidth, compute throughput, and sparsity pattern constraints

Core Content

1. The Mathematics of Pruning

1.1 What is Pruning?

Pruning removes weights (or entire structures) from a neural network to reduce computation and memory while minimizing accuracy loss.

Types: - Unstructured: Zero individual weights; no pattern constraint. Achieves high sparsity with low accuracy loss but requires sparse matrix hardware for speedup. - Structured: Remove entire rows, columns, channels, or attention heads. Less accuracy per parameter removed, but directly reduces dense matmul dimensions → immediate speedup on standard hardware. - N:M sparsity: A hardware-friendly compromise: at most N non-zero weights in every contiguous group of M. E.g., 2:4 sparsity (NVIDIA Ampere) means exactly 2 non-zeros per group of 4.

1.2 Why Magnitude Pruning?

⚠️ THIS IS CRITICAL — Magnitude pruning is the simplest and most widely used criterion. Understanding its theoretical justification connects Taylor expansions to weight importance.

Consider the change in loss L when we set weight w_i to zero:

ΔL_i = L(w with w_i=0) − L(w)

Taylor expansion around the current weights:

$ΔL_i ≈ (∂L/∂w_i) · (−w_i) + (1/2) · H_{ii} · w_i² + O(||w||³)
$

where H_{ii} = ∂²L/∂w_i² is the diagonal Hessian entry.

At convergence (or near a local minimum): ∂L/∂w_i ≈ 0. The first-order term vanishes, leaving:

$ΔL_i ≈ (1/2) · H_{ii} · w_i²
$

This says: the damage from pruning weight w_i is proportional to w_i² (magnitude squared) times the local curvature H_{ii}.

When H_{ii} is roughly uniform (isotropic local geometry), |w_i| (magnitude) alone determines pruning safety. This is the theoretical justification for magnitude pruning.

When it fails: 1. Correlated weights: The diagonal Hessian assumption ignores off-diagonal terms — removing two correlated weights together is worse than the sum of individual effects 2. Scale variation across layers: Different layers have different Hessians; |w_i| normalizes poorly across layer boundaries 3. Early training: ∂L/∂w_i ≠ 0, so the first-order term dominates; magnitude isn't informative

2. SparseGPT: One-Shot Pruning with Compensation

2.1 From GPTQ to SparseGPT

GPTQ (covered in 19-03) uses OBS to compensate for quantization. SparseGPT adapts the same framework for pruning: instead of rounding to the nearest quantization level, we SET the pruned weight to ZERO. The compensation math is nearly identical.

For a row w of the weight matrix, pruning weight w_q:

$Error introduced: Δ_q = w_q − 0 = w_q
Compensation to remaining weights: δ_w = −H⁻¹_{:,q} / H⁻¹_{qq} · w_q
$

2.2 SparseGPT Algorithm

Preprocessing: Compute H⁻¹ from calibration data (same as GPTQ).

Pruning loop (per row, processing columns sequentially):

For each column q (ordered by some criterion): 1. Compute the error from pruning: Δ = w_q 2. Set w_q = 0 (prune it) 3. Compensate remaining weights: w_{q+1:} += −(w_q / H⁻¹_{qq}) · H⁻¹_{q, q+1:}

Column ordering: SparseGPT can use: - Sequential (natural): Prune left-to-right. Simple, reasonably effective. - By magnitude (descending): Prune largest weights FIRST, giving small weights maximum chance to absorb the error. Counter-intuitive but empirically better. - By H⁻¹ diagonal (ascending): Most \"sensitive\" last (same as GPTQ ordering).

Sparsity level: To achieve sparsity s (fraction of weights zeroed), we prune the first s·d_in columns.

Complexity: O(d_in³ + d_in²·d_out) — the Hessian inversion dominates. SparseGPT can prune a 175B model in ~4 hours on a single GPU.

2.3 Derivation: SparseGPT Update Rule

Given the output error: E = (1/2) · Σ_j (w_j − w_j') · H_{jk} · (w_k − w_k').

For pruning w_q → 0, the optimal adjustment to minimize output error:

$min_{δ} E(δ) = (1/2)·H_{qq}·w_q² + w_q · Σ_{j>q} H_{qj}·δ_j + (1/2)· Σ_{j,k>q} δ_j·H_{jk}·δ_k
$

Set ∂E/∂δ_j = 0:

H_{qj}·w_q + Σ_{k>q} H_{jk}·δ_k = 0  ⟹  δ = −w_q · H_{q+1:,q+1:}^{-1} · H_{q+1:,q}

Using block matrix inversion:

δ_j = −w_q · H⁻¹_{j,q} / H⁻¹_{qq}      for j > q

This is the SparseGPT update: redistribute the pruned weight's contribution to remaining weights, weighted by the inverse Hessian to correctly account for correlations.

3. The Lottery Ticket Hypothesis

3.1 Statement

The Lottery Ticket Hypothesis (Frankle & Carbin, 2019):

A randomly-initialized, dense neural network contains a subnetwork (a \"winning ticket\") that — when trained in isolation — can match the test accuracy of the original network after training for at most the same number of iterations.

More formally: Let f(x; θ₀) be a network with random initialization θ₀ ~ D. After training for T steps, the network achieves accuracy a. There exists a mask m ∈ {0,1}^{|θ|} and t ≤ T such that training f(x; m⊙θ₀) for t steps also achieves accuracy a.

3.2 Iterative Magnitude Pruning (IMP)

The winning ticket is found through IMP:

  1. Randomly initialize network θ₀
  2. Train for T steps to convergence
  3. Prune p% of remaining weights with smallest magnitude
  4. Critical: RESET surviving weights to their initial values from θ₀ (not their trained values)
  5. Repeat steps 2-4 until target sparsity

Why reset? The hypothesis claims the WINNING TICKET is defined by the INITIALIZATION + architecture, not the final trained weights. The mask identifies which random weights, when trained, form a winning combination.

3.3 Conditions and Caveats

Robust finding (stable): For small networks (LeNet, small ConvNets) trained with low learning rates, IMP reliably finds winning tickets at 50-90% sparsity.

Large-scale complications: For large models: - Rewinding: Replace step 4 with \"reset to early-training weights\" (e.g., epoch 1-5) rather than initialization. This is \"late resetting\" or \"learning rate rewinding.\" - Structured pruning combined with IMP: find winning tickets at the channel/head level for hardware efficiency.

For LLMs: The Lottery Ticket Hypothesis hasn't been demonstrated at scale (>1B params) with full IMP (computationally prohibitive). Instead, one-shot post-training pruning methods (SparseGPT, Wanda) are used.

4. Wanda: Pruning by Weights × Activations

4.1 The Wanda Criterion

SparseGPT requires the Hessian (expensive). Wanda (Sun et al., 2023) proposes a simpler saliency metric:

$S_{ij} = |W_{ij}| · ||X_j||₂
$

where ||X_j||₂ is the L2 norm of the j-th input feature's activations (averaged over calibration data).

Why it works: This combines magnitude pruning (|W_{ij}|) with activation awareness (||X_j||₂). A weight is important if BOTH: 1. It has large magnitude (intuitively important) 2. Its input feature has large activation norms (it affects the output significantly)

4.2 Wanda Algorithm

Per output neuron (row of W):

  1. Compute saliency: S_j = |W_j| · ||X_j||₂ for each input j
  2. Prune the weights with smallest S_j to achieve sparsity s
  3. NO compensation step (unlike SparseGPT) — just zero the weights

Why no compensation? The activation-aware saliency already identifies weights whose removal will cause minimal output perturbation. For moderate sparsity (50%), Wanda matches SparseGPT despite no compensation. At higher sparsity (70%+), SparseGPT's compensation provides better accuracy.

Complexity: O(d_out · d_in) — purely element-wise, no Hessian. Extremely fast: prunes a 7B model in seconds.

5. N:M Structured Sparsity

5.1 Hardware Motivation

NVIDIA Ampere+ GPUs accelerate 2:4 sparse matrix multiplications via hardware support. The constraint: every contiguous group of 4 values in the weight matrix can have AT MOST 2 non-zeros.

Speedup: 2:4 sparse matmul is 2× faster than dense (only 2 multiplies per group instead of 4). The hardware uses a compressed format that stores only the non-zero indices.

5.2 Mathematical Formulation

Given weight matrix W ∈ ℝ^{d_out × d_in}, the 2:4 constraint:

$For each row r and each group g = 0,...,d_in/4−1:
    ||W[r, 4g:4(g+1)]||₀ ≤ 2
$

The optimization: min ||W − W_sparse|| subject to the ||·||₀ constraints.

Greedy solution: Within each group of 4, keep the 2 largest magnitudes, zero the rest:

W_sparse[r, 4g + k] = W[r, 4g + k] if |W[r, 4g + k]| is in top-2 of the group, else 0

This is optimal for minimizing the L1 or Frobenius norm error element-wise, but NOT for minimizing output error (which would require considering correlations).

5.3 Extension to N:M

General N:M sparsity: at most N non-zeros per group of M. The Ampere GPU supports 2:4, allowing a clean 2× speedup. Future hardware may support other patterns (4:8, 1:2, etc.).

Speedup for N:M: M/N × (plus overhead for the compressed format decode).



Pitfalls

⚠️ Pitfall 1: Pruning by magnitude without fine-tuning. Removing the smallest-magnitude weights directly (one-shot pruning) destroys model quality. The remaining weights must be UPDATED to compensate — this is what OBS/OBD do with the Hessian. Magnitude pruning alone is a baseline that almost never works for LLMs.

⚠️ Pitfall 2: Pruning individual weights vs. structured pruning. Unstructured pruning (individual weights → sparse matrices) saves parameters but requires sparse matrix hardware for speedup. Structured pruning (whole neurons/heads) gives immediate speedup on any hardware at the cost of more quality loss.

⚠️ Pitfall 3: Assuming pruning removes the same fraction everywhere. Early layers and attention heads are often more sensitive to pruning than later FFN layers. Uniform pruning across all layers is suboptimal — use importance-based allocation.


Key Terms

Worked Examples

Example 1: Magnitude Pruning with Taylor Analysis

A weight vector w = [0.5, 0.02, 1.0, 0.001, 0.3] is at a local minimum (∂L/∂w ≈ 0). The diagonal Hessian at this point is H = diag([1.0, 100.0, 1.0, 50.0, 1.0]). Which weight causes the LEAST damage when pruned to zero, according to the second-order Taylor approximation?

Solution:

$ΔL_i ≈ (1/2) · H_{ii} · w_i²

ΔL₁ = 0.5 · 1.0 · 0.25 = 0.125
ΔL₂ = 0.5 · 100.0 · 0.0004 = 0.02
ΔL₃ = 0.5 · 1.0 · 1.0 = 0.5
ΔL₄ = 0.5 · 50.0 · (1×10⁻⁶) = 2.5×10⁻⁵
ΔL₅ = 0.5 · 1.0 · 0.09 = 0.045
$

Ranking (least damage first): w₄ (2.5×10⁻⁵), w₂ (0.02), w₅ (0.045), w₁ (0.125), w₃ (0.5).

Interesting: w₄ has tiny magnitude (0.001) but large curvature (50.0); w₂ has small magnitude (0.02) and huge curvature (100.0). Despite being 20× larger than w₄, w₂ has roughly the same predicted damage because H_{22} is 2× H_{44}. Magnitude alone would rank w₄ easiest (correct here) and w₂ second easiest (also correct). But magnitude alone could mis-rank w₁ (0.5, Δ=0.125) vs w₅ (0.3, Δ=0.045): magnitude says prune w₅ first, but combined with Hessian, w₅ IS less damaging — so magnitude gives the right answer here.

Where magnitude fails: If H were diag([1, 1, 1, 1, 1000]), pruning w₅ (magnitude 0.3, Δ=45) would be terrible, while pruning w₁ (magnitude 0.5, Δ=0.125) would be fine. Magnitude alone would prune w₅ first and cause massive damage.

Example 2: SparseGPT Compensation

Weight row w = [1.0, 2.0, 3.0, 4.0, 5.0], target sparsity 40% (prune 2 weights). Calibration Hessian inverse:

$H⁻¹ = [[4, 0, −2, 1, 0],
       [0, 1, 1, 0, −1],
       [−2, 1, 3, 0, 0],
       [1, 0, 0, 2, −1],
       [0, −1, 0, −1, 5]]
$

Step 1: Determine order. Prune by magnitude, descending: w₅=5.0 (prune first), w₄=4.0, w₃=3.0 (keep), w₂=2.0 (keep), w₁=1.0 (keep). Wait — we need to prune 2, so prune w₅ and w₄.

Step 2: Prune w₅ (q=5).

Δ = w₅ = 5.0. H⁻¹_{55} = 5. Compensation to columns 1-4:

$δ_j = −5.0 · H⁻¹_{j,5} / 5 = −H⁻¹_{j,5}

H⁻¹_{1,5}=0 → δ₁=0
H⁻¹_{2,5}=−1 → δ₂=1
H⁻¹_{3,5}=0 → δ₃=0
H⁻¹_{4,5}=−1 → δ₄=1
$

Updated w: [1.0, 2.0+1.0=3.0, 3.0, 4.0+1.0=5.0, 0].

Step 3: Prune w₄ (q=4, now value 5.0).

But we must update H⁻¹ for the reduced problem (column 5 removed). This requires the Schur complement update... In practice, SparseGPT processes sequentially one index at a time, removing it from consideration.

Simplified: after pruning w₅, we adjust and the weight is [1.0, 3.0, 3.0, 5.0]. Now prune w₄=5.0. With δ = −5.0 · H⁻¹_{1:3,4} / H⁻¹_{44} (on the reduced matrix): approximate H⁻¹_{44}=2.

$δ₁ = −5.0 · 1 / 2 = −2.5
δ₂ = −5.0 · 0 / 2 = 0
δ₃ = −5.0 · 0 / 2 = 0
$

Updated: [1.0−2.5=−1.5, 3.0, 3.0, 0, 0].

Final: [−1.5, 3.0, 3.0, 0, 0]. Without compensation: [1.0, 2.0, 3.0, 0, 0] (just zero w₄,w₅). SparseGPT's compensation made w₁ negative and w₂ larger to preserve the output.

Example 3: 2:4 Sparsity Greedy Selection

A 4×8 weight matrix:

$W = [[0.5, −0.3, 1.2, 0.0, −0.8, 0.1, 2.1, −0.4],
     [0.2, 0.9, 0.1, −1.1, 0.6, 0.3, −0.2, 0.7],
     [1.5, 0.4, −0.5, 0.8, 0.0, −1.3, 0.2, 0.9],
     [−0.1, 0.3, 1.1, 0.4, −0.6, 0.0, 0.8, −0.5]]
$

Apply 2:4 sparsity per row. For each row, process groups of 4.

Row 0: Group 1 [0.5, −0.3, 1.2, 0.0]: magnitudes [0.5, 0.3, 1.2, 0.0]. Top-2: 1.2, 0.5 → keep indices 2 and 0: [0.5, 0, 1.2, 0]. Group 2 [−0.8, 0.1, 2.1, −0.4]: magnitudes [0.8, 0.1, 2.1, 0.4]. Top-2: 2.1, 0.8 → keep indices 2 and 0: [−0.8, 0, 2.1, 0].

Row 1: Group 1 [0.2, 0.9, 0.1, −1.1]: magnitudes [0.2, 0.9, 0.1, 1.1]. Top-2: 1.1, 0.9 → keep indices 3 and 1: [0, 0.9, 0, −1.1]. Group 2 [0.6, 0.3, −0.2, 0.7]: magnitudes [0.6, 0.3, 0.2, 0.7]. Top-2: 0.7, 0.6 → keep indices 3 and 0: [0.6, 0, 0, 0.7].

Row 2: Group 1 [1.5, 0.4, −0.5, 0.8]: magnitudes [1.5, 0.4, 0.5, 0.8]. Top-2: 1.5, 0.8 → keep indices 0 and 3: [1.5, 0, 0, 0.8]. Group 2 [0.0, −1.3, 0.2, 0.9]: magnitudes [0.0, 1.3, 0.2, 0.9]. Top-2: 1.3, 0.9 → keep indices 1 and 3: [0, −1.3, 0, 0.9].

Row 3: Group 1 [−0.1, 0.3, 1.1, 0.4]: magnitudes [0.1, 0.3, 1.1, 0.4]. Top-2: 1.1, 0.4 → keep indices 2 and 3: [0, 0, 1.1, 0.4]. Group 2 [−0.6, 0.0, 0.8, −0.5]: magnitudes [0.6, 0.0, 0.8, 0.5]. Top-2: 0.8, 0.6 → keep indices 2 and 0: [−0.6, 0, 0.8, 0].

Final 2:4 sparse matrix has exactly 16 non-zeros (8 groups × 2/group = 16), down from 32. 50% sparsity.



Quiz

Q1: What does the concept of At convergence primarily refer to in this subject?

A) A computational error related to At convergence B) A visual representation of At convergence C) The definition and application of At convergence D) A historical anecdote about At convergence

Correct: C)

Q2: What is the primary purpose of Pruning loop?

A) It is used to pruning loop in mathematical analysis B) It replaces all other methods in this domain C) It is used only in advanced research contexts D) It is primarily a historical notation system

Correct: A)

Q3: Which statement about The Lottery Ticket Hypothesis is TRUE?

A) The Lottery Ticket Hypothesis is not related to this subject B) The Lottery Ticket Hypothesis is mentioned only as a historical footnote C) The Lottery Ticket Hypothesis is an advanced topic beyond this subject's scope D) The Lottery Ticket Hypothesis is a fundamental concept covered in this subject

Correct: D)

Q4: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) 2 MB. C) The inverse of the correct answer D) A different result from a common mistake

Correct: B)

Q5: How are The Lottery Ticket Hypothesis and Structured pruning related?

A) The Lottery Ticket Hypothesis is the inverse of Structured pruning B) The Lottery Ticket Hypothesis and Structured pruning are closely related concepts C) The Lottery Ticket Hypothesis is a special case of Structured pruning D) The Lottery Ticket Hypothesis and Structured pruning are completely unrelated topics

Correct: B)

Q6: What is a common pitfall when working with The Mathematics Of Pruning?

A) The main error with The Mathematics Of Pruning is using it when it is not needed B) The Mathematics Of Pruning is always computed the same way in all contexts C) The Mathematics Of Pruning has no common misconceptions D) A common mistake is confusing The Mathematics Of Pruning with a similar concept

Correct: D)

Q7: When should you apply Sparsegpt: One-Shot Pruning With Compensation?

A) Avoid Sparsegpt: One-Shot Pruning With Compensation unless explicitly instructed B) Use Sparsegpt: One-Shot Pruning With Compensation only in pure mathematics contexts C) Apply Sparsegpt: One-Shot Pruning With Compensation to solve problems in this subject's domain D) Sparsegpt: One-Shot Pruning With Compensation is not practically useful

Correct: C)

Practice Problems

Problem 1

A weight vector has been pruned to 80% sparsity (80% zeros) using unstructured magnitude pruning. The original matrix was 1024×1024 dense in fp16. Compute: (a) the memory reduction ratio (ignoring index overhead), (b) the memory with COO sparse format (each non-zero stored as row, col, value — 2 indices at int32 each + 1 fp16 value). At what sparsity would the compressed sparse format become LARGER than the original dense format?

Answer (a) Original: 1024² · 2 bytes = 2,097,152 bytes = 2 MB. Non-zeros at 80% sparsity: 0.2 · 1,048,576 = 209,715. Memory (just values): 209,715 · 2 = 419,430 bytes ≈ 0.4 MB. Ratio: 5× compression. (b) COO format: each non-zero = 2·4 bytes (int32 indices) + 2 bytes (fp16 value) = 10 bytes. Sparse memory: 209,715 · 10 = 2,097,150 bytes ≈ 2 MB. At 80% sparsity, the COO format is roughly THE SAME SIZE as dense! This illustrates why unstructured sparsity needs specialized hardware — naive sparse formats have high overhead. Break-even: 2 MB = N_nz · 10 bytes → N_nz = 209,715. That's exactly 20% non-zeros = 80% sparsity. So for any sparsity ≤ 80%, COO is LARGER than dense. At 90% sparsity: 10% non-zeros → 104,858 · 10 ≈ 1 MB (2× compression). At 95%: ~0.5 MB (4× compression). This is why pruning needs sparsity >80% for memory savings with COO, and why N:M formats (no index overhead for the in-group selection) are preferred.

Problem 2

Explain why structured pruning (removing entire rows/columns) often causes more accuracy loss than unstructured pruning at the same parameter reduction ratio. Derive the expected damage ratio.

Answer Unstructured pruning can choose the LEAST important individual weights. If we prune p% of weights, we get to select the p% with smallest saliency. Structured pruning must remove ENTIRE rows (or columns). To achieve p% parameter reduction, we remove p% of rows. But a row may contain both important and unimportant weights — removing it destroys its important contributions too. Expected damage ratio: For unstructured, the per-weight damage (from Taylor) is proportional to the saliency of the p%-ile weight. For structured, the per-row damage includes the sum of saliencies of ALL weights in that row. If saliency is uniformly distributed, structured pruning at p% parameter reduction has the same expected damage as unstructured pruning at the same sparsity. But if saliency CONCENTRATES in a few rows (common — some neurons matter more), structured pruning forces us to choose between: 1. Remove a high-saliency row (lots of damage) 2. Keep the high-saliency row and remove a low-saliency row (low damage per parameter, but we don't reach p% reduction) So structured pruning optimally selects rows with the smallest TOTAL saliency, while unstructured selects weights with the smallest individual saliency. The gap grows with saliency concentration. Formal bound: E_structured/E_unstructured ≥ 1, with equality only when saliency is uniform across rows. For power-law distributed saliency, the gap can be 2-10× at the same parameter reduction.

Problem 3

Use Wanda to rank the following weights for pruning (sparsity 50%). Weights W = [0.8, 0.05, 1.5, 0.3, 0.01, 0.6] with activation norms ||X||₂ = [1.0, 20.0, 0.5, 2.0, 100.0, 0.8]. Rank from least to most important.

Answer Saliency S_j = |W_j| · ||X_j||₂: S₁ = 0.8 · 1.0 = 0.80 S₂ = 0.05 · 20.0 = 1.00 S₃ = 1.5 · 0.5 = 0.75 S₄ = 0.3 · 2.0 = 0.60 S₅ = 0.01 · 100.0 = 1.00 S₆ = 0.6 · 0.8 = 0.48 Rank (least important → most important): w₆ (0.48), w₄ (0.60), w₃ (0.75), w₁ (0.80), w₂ (1.00), w₅ (1.00). Prune 50% → remove 3 weights: w₆, w₄, w₃. **Interesting:** w₅ has tiny magnitude (0.01) but huge activation (100.0), making it the most important! Pure magnitude pruning would have pruned it first, causing massive damage (it's connected to the most active input feature). w₃ has the LARGEST magnitude (1.5) but gets pruned because its activation is tiny (0.5) — removing it affects very little output. This illustrates WHY activation-aware pruning is critical.

Problem 4

In the Lottery Ticket Hypothesis IMP algorithm, what happens if we skip step 4 (resetting weights to initialization) and instead continue training the pruned network? Is the result still a "winning ticket"? Why or why not?

Answer If we skip resetting, we're doing standard iterative pruning (prune, fine-tune, repeat). The final subnetwork's weights are the TRAINED values, not the INITIAL values. This is NOT a "winning ticket" because the Lottery Ticket Hypothesis specifically claims that the ARCHITECTURE (mask) combined with INITIALIZATION alone is sufficient. The reset step tests whether the INITIAL random values, when trained, can succeed. Without resetting, we can't distinguish between: 1. The architecture is genuinely special (and would work from scratch) 2. The architecture + trained weights just happens to be a local optimum reachable from the initial point Skipping the reset makes the result a fine-tuned pruned network (which also works well in practice!), but it doesn't test the Hypothesis. The reset is the critical experimental condition. In practice, for LLMs, one-shot pruning + short fine-tuning (without resetting) is the standard. The Lottery Ticket Hypothesis has been demonstrated for small vision models but not for billion-parameter language models trained from scratch — the compute cost of IMP at scale is prohibitive.

Problem 5

A 2:4 sparse GEMM on an A100 runs at 2× the speed of a dense GEMM for the same matrix size (theoretical). The original model spends 70% of inference time in dense matmuls. After applying 2:4 sparsity to all weight matrices, what is the maximum possible speedup? What factors prevent achieving the theoretical maximum?

Answer Maximum theoretical speedup (Amdahl's Law):
$speedup = 1 / (1 − f + f/s) = 1 / (0.3 + 0.7/2) = 1 / 0.65 = 1.54×
$
where f = 0.7 (fraction in matmuls), s = 2 (speedup of matmuls). **Factors preventing theoretical max:** 1. **Not all matmuls are large enough:** 2:4 speedup requires sufficiently large matrices to saturate GPU. Small matmuls (e.g., projection in attention, small batch sizes) may not achieve 2×. 2. **Non-matmul overhead:** Attention softmax, layer norm, residual adds, activation functions, KV cache management — none benefit from weight sparsity. 3. **Memory bandwidth:** If the model was already memory-bound (common for decode), weight sparsity reduces weight loading but doesn't help KV cache bandwidth. The actual speedup is bounded by the memory bandwidth improvement. 4. **Compressed format overhead:** The 2:4 format requires decoding the compressed indices and gathering non-zero values. This adds ~5-10% overhead vs theoretical. 5. **2:4 pruning accuracy loss:** Pruning to 2:4 with PER-GROUP selection (greedy) is suboptimal vs global magnitude pruning. Some models require fine-tuning after 2:4 pruning, which incurs cost. Typical real-world speedup for 2:4 sparse inference: 1.3-1.5× end-to-end, vs 1.54× theoretical. The gap between structured (row/column) speedup and N:M speedup is large: removing 50% of rows (structured) gives 2× matmul speedup with 1.54× end-to-end, while 2:4 (50% sparse, unstructured-within-groups) gives the same theoretical matmul speedup but much better accuracy.

Summary

  1. Magnitude pruning is justified by the second-order Taylor expansion at convergence: ΔL ≈ (1/2)·H_{ii}·w_i², which reduces to |w_i| when the Hessian is uniform; it fails when curvatures vary or weights are correlated
  2. SparseGPT adapts the OBS framework for one-shot pruning: it zeros weights one by one and uses the inverse Hessian to optimally redistribute the error to remaining weights, achieving high sparsity without retraining
  3. Wanda provides a fast alternative: saliency = |W_{ij}| · ||X_j||₂, combining magnitude and activation awareness — it matches SparseGPT at moderate sparsity with no Hessian computation
  4. The Lottery Ticket Hypothesis asserts that a randomly initialized network contains a sparse subnetwork that can be trained to full accuracy; iterative magnitude pruning with weight resetting identifies it, but the technique is computationally impractical for LLMs
  5. N:M sparsity constraints (e.g., 2:4) enable hardware-accelerated sparse matmuls; the greedy per-group magnitude selection is near-optimal for element-wise error but suboptimal for output error


Next Steps

Continue to 19-05 — Knowledge Distillation to explore teacher-student training, temperature-scaled KL divergence, and distillation variants for compressing LLM knowledge into smaller models.