Math graphic
📐 Concept diagram

19-03 — Quantization Mathematics

Phase: 19 — Advanced LLM Mathematics Subject: 19-03 Prerequisites: 19-02 (Attention Variants), 15-01 (Floating Point Arithmetic), 15-06 (Mixed-Precision Training), 14-06 (Convex Sets and Functions — second-order condition), 09-3 (SVD — for low-rank approximation concepts) Next subject: 19-04 — Pruning


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the uniform affine quantization mapping q = round((r − z)/s) and compute the quantization error bounds as a function of scale s and bit-width b
  2. Formulate the Optimal Brain Surgeon (OBS) row-wise weight compensation problem that GPTQ solves, and derive the Hessian-weighted quantization error
  3. Prove why activation-awareness matters for quantization (AWQ) by deriving the per-channel weight saliency s_j = ||W_j|| · ||X_j|| and showing it preserves salient channels
  4. Compute the dequantization cost and memory savings for symmetric vs asymmetric, per-tensor vs per-channel vs per-group quantization schemes
  5. Decompose the NF4 (Normal Float 4-bit) data type — show why it's a quantile-based encoding that's optimal for normally distributed weights

Core Content

1. Uniform Affine Quantization

1.1 The Quantization Function

Uniform affine quantization maps a floating-point value r ∈ [r_min, r_max] to an integer q ∈ [0, 2^b − 1] (unsigned) or [−2^(b−1), 2^(b−1)−1] (signed):

$q = round((r − z) / s)
r̂ = s · q + z      (dequantization)
$

where: - s = (r_max − r_min) / (2^b − 1) — scale (step size) - z = round(−r_min / s) or z = 0 (symmetric) — zero point - b — bit width (4, 8, etc.)

Quantization error: Δ = r_max − r_min for the range. The rounding error per value:

$|r − r̂| ≤ s/2 = Δ / (2·(2^b−1))
$

For b=8: s/2 ≈ Δ/510. For b=4: s/2 ≈ Δ/30. Four-bit quantization has ~17× coarser granularity than 8-bit.

⚠️ THIS IS CRITICAL — The quantization error bound |r − r̂| ≤ s/2 means that reducing bit-width from 8 to 4 (a 2× memory savings) increases per-element error by ~17×. This is why naive 4-bit quantization destroys model quality, and advanced techniques (GPTQ, AWQ, NF4) are required to compensate. The core challenge: preserving model accuracy requires compensating for this dramatically coarser quantization grid.

1.2 Symmetric vs Asymmetric Quantization

Symmetric (z = 0):

$q = round(clip(r/s, −2^(b−1), 2^(b−1)−1))
r̂ = s · q
$

Range is symmetric around zero: [−s·2^(b−1), s·2^(b−1)]. Scale is typically chosen as s = max(|r|) / 2^(b−1). ZERO POINT is zero — one fewer parameter to store, simpler dequantization.

Asymmetric (z ≠ 0): Uses actual r_min and r_max. Better for distributions that are NOT symmetric around zero (e.g., ReLU activations which are ≥ 0). Cost: z must be stored per quantization group, and dequantization requires a subtraction.

When to use which: - Weights: usually symmetric (they're roughly zero-centered after normalization) - Activations: usually asymmetric (ReLU outputs are non-negative, other activations may have negative ranges) - For per-channel quantization of weights, symmetric is standard

1.3 Quantization Granularity

Per-tensor: One scale and zero point for the entire tensor. Memory: 2 scalars. Fastest dequantization. Risk: outlier channels dominate the range, wasting precision for small-valued channels.

s ∈ ℝ, z ∈ ℤ     (scalar for whole W ∈ ℝ^{m×n})

Per-channel (row/column): One scale per row or column. Memory: m or n scalars. Let each row have its own scale based on its own range — protects small rows from being drowned by outlier rows.

For weight matrix W ∈ ℝ^{out×in}, per-output-channel:

s_j = max(|W_{j,:}|) / 2^(b−1)       for j = 1,...,out
q_{j,k} = round(W_{j,k} / s_j)

Per-group: The most common configuration. Split each row into groups of size G and quantize each group independently. Memory: (out · in/G) scalars. Standard for 4-bit: G=128. Excellent precision at modest scale overhead.

For group size 128, a weight matrix with dim 4096×4096: 4096 · 4096/128 = 131,072 scales (vs 4,096 for per-channel). At fp16: 131K · 2 bytes = 256KB vs 8KB. Still negligible vs 16M weights (32MB fp16 → 8MB int4 + 256KB scales = ~4.1× compression).

1.4 Compression Ratio

For a tensor with N elements stored in b bits:

$compression = original_bits / (b + scale_overhead_bits/N)
$

For int4 (b=4) with group-128 and fp16 scales:

$scale_overhead = 16 bits / 128 elements = 0.125 bits/element
compression = 16 / (4 + 0.125) = 16/4.125 ≈ 3.88×
$

Compare: int8 per-tensor: 16/(8+ε) ≈ 2×. int4 per-channel: 16/(4+16/4096) ≈ 3.998× (nearly 4×). The group-128 overhead is small but real.

2. GPTQ: Layer-wise Quantization via Optimal Brain Surgeon

2.1 The Quantization Problem

Given a weight matrix W ∈ ℝ^{d_out × d_in} and calibration data X ∈ ℝ^{N_cal × d_in}, we want quantized weights Ŵ that minimize the output error:

min_{Ŵ} ||WX − ŴX||²_F

This is NOT element-wise quantization. The goal is to preserve layer outputs, not individual weights. This is crucial: some weights matter more than others for the output.

2.2 Optimal Brain Surgeon (OBS) Formulation

OBS provides a framework for pruning weights with error compensation. GPTQ adapts this for quantization: instead of pruning (setting weight to 0), we quantize (setting weight to nearest quantized value) and compensate.

Let H = 2XX^T ∈ ℝ^{d_in × d_in} be the Hessian of the squared error with respect to W (each row independently). For row w (size d_in), the error increase from quantizing weight w_q to ŵ_q:

E = (w_q − ŵ_q)² / [H⁻¹]_{qq}

The OPTIMAL update to other weights to compensate:

δ_w = −(w_q − ŵ_q) / [H⁻¹]_{qq} · H⁻¹_{:,q}

This minimizes the output error after quantization.

2.3 GPTQ Algorithm

Preprocessing: Compute H⁻¹ from calibration data. Add a small diagonal damping λ to H for numerical stability: H ← 2XX^T + λI.

Quantization loop (per row, processing columns sequentially):

For each column q = 1 to d_in: 1. Quantize w_q → ŵ_q = quant(w_q) 2. Compute error: Δ_q = w_q − ŵ_q 3. Compensate remaining weights in the row: w_{q+1:d_in} ← w_{q+1:d_in} − (Δ_q / [H⁻¹]{qq}) · H⁻¹{q, q+1:d_in}

Key properties: - Columns are processed in order of increasing diagonal H⁻¹ (most \"sensitive\" weights first) to minimize accumulated error - Each compensation step uses the INVERSE HESSIAN to optimally redistribute error - The Hessian captures correlations between weights — quantizing one weight can be compensated by adjusting others that are correlated with it

Complexity: O(d_in³) for H⁻¹ inversion (done once), O(d_in² · d_out) for the row-wise quantization loop. For large matrices, the Hessian inversion dominates.

2.4 Column Reordering Heuristic

GPTQ reorders columns by the inverse Hessian diagonal (H⁻¹_{qq}) — columns that are \"harder\" to compensate (larger diagonal) are quantized LAST, giving the maximum opportunity for compensation from not-yet-quantized columns.

This is a greedy approximation to the optimal ordering (which is NP-hard — subset selection from OBS). In practice, ordering by increasing H⁻¹_{qq} works well because it gives the largest \"budget\" of uncompensated columns to absorb errors from the most sensitive columns.

3. AWQ: Activation-Aware Weight Quantization

3.1 The Activation-Aware Insight

GPTQ treats all output channels equally. But in LLMs, some channels have disproportionately large activations (\"outlier channels\"). Quantizing these channels' weights loses information that gets multiplied by large activations, producing large output errors.

Key observation (from AWQ paper): Only ~1% of channels have significantly larger activations, but they dominate the quantization error.

3.2 Saliency and Scaling

Define per-channel saliency:

s_j = ||W_j|| · mean(|X_j|)    for output channel j

where W_j is row j of the weight matrix and X_j is the corresponding input activation.

AWQ's approach: Before quantization, SCALE UP the weights of salient channels and SCALE DOWN their inputs correspondingly (so the matmul output is unchanged):

$W_j' = W_j · scale_j
X_j' = X_j / scale_j
$

Now when we quantize W_j', the salient channels have larger values → the fixed quantization grid allocates more representable levels to the range that matters. During inference, we fold the scale into the weight (activated weight is permanent) so there's NO runtime overhead.

Optimal scale search:

$min_{scale_j} ||W_j·X − quant(W_j·scale_j)·(X/scale_j)||²
$

This is a per-channel grid search (AWQ uses ≈20 candidate scales per channel). The search is cheap because it's per-channel and only over a few candidates.

3.3 Why This Works: Scale Invariance

For linear layers: (W·s) · (X/s) = W·X (exact equality in floating point). The scale is a mathematical identity in full precision. After quantization:

$quant(W·s) · (X/s) ≈ W·X
$

The key: quant(W·s) allocates more quantization levels to the scaled-up range. After dividing activations by s, the output is preserved. This is a form of per-channel equivalent scaling — changing the quantization grid per channel without actually changing the scale parameter of the quantizer.

4. NF4: Normal Float 4-bit

4.1 The Problem with Uniform Quantization

Uniform quantization assumes values are uniformly distributed. Neural network weights are approximately normally distributed N(0, σ²). Uniform quantization wastes half the bins on the tails (which rarely occur) and under-represents the dense center.

For N(0,1): ~68% of weights in [−1, 1], but a uniform 4-bit quantizer with range [−3, 3] (capturing 3σ) gives only ~3 bins in [−1, 1] out of 16 total.

4.2 Quantile Quantization

NF4 is information-theoretically optimal for normally distributed data: it places quantization levels at the quantiles of the normal distribution so each bin has equal probability mass.

For b=4 bits (16 levels), we divide the standard normal CDF into 16 equal-probability regions:

q_i = Φ⁻¹((2i+1) / (2·2^b))       for i = 0,...,2^b−1

where Φ⁻¹ is the inverse CDF (probit function).

For 16 levels:

i=0:  Φ⁻¹(1/32)   = −1.96
i=1:  Φ⁻¹(3/32)   = −1.44
i=2:  Φ⁻¹(5/32)   = −1.13
i=3:  Φ⁻¹(7/32)   = −0.90
i=4:  Φ⁻¹(9/32)   = −0.70
i=5:  Φ⁻¹(11/32)  = −0.53
i=6:  Φ⁻¹(13/32)  = −0.37
i=7:  Φ⁻¹(15/32)  = −0.22
i=8:  Φ⁻¹(17/32)  = +0.22    (symmetric — use negatives of above for i=8..15 or mirror)
i=9:  Φ⁻¹(19/32)  = +0.37
...
i=15: Φ⁻¹(31/32)  = +1.96

These levels are fine near zero (many levels in the high-probability region) and coarse in the tails.

Key property: For a normally distributed source, the expected quantization error of NF4 is LOWER than uniform quantization of the same bit-width, because NF4 allocates representational capacity proportional to probability mass.

4.3 Implementation: k-bit NormalFloat

In practice, NF4 doesn't store these exact levels. Instead: 1. Quantize the weight tensor to the NF4 grid using a per-block scale 2. The quantized values are indices 0..15 3. Dequantization: lookup table maps index → NF4 value, multiply by scale

NF4 is the format used in QLoRA for 4-bit fine-tuning.



Pitfalls

⚠️ Pitfall 1: Applying per-tensor quantization to matrices with outlier channels. If one row of a weight matrix has values in [-10, 10] and another in [-0.1, 0.1], per-tensor quantization uses s = max(|all|)/127 ≈ 10/127 ≈ 0.079. The small row gets quantized to q ∈ {-1, 0, 1} — essentially losing all precision. Per-channel quantization fixes this.

⚠️ Pitfall 2: Quantizing weights without considering activation magnitudes. A weight that's 10× larger than others may SEEM important, but if the corresponding activation is near zero, it contributes nothing. AWQ's saliency s_j = ||W_j|| · ||X_j|| accounts for both.

⚠️ Pitfall 3: Using symmetric quantization for asymmetric distributions. ReLU activations are ≥ 0 — their distribution is inherently asymmetric. Symmetric quantization wastes half the quantized range on negative values that never occur. Use asymmetric quantization for activations.


Key Terms

Worked Examples

Example 1: Uniform Quantization Step-by-Step

Quantize the vector r = [0.5, −1.2, 3.7, −0.3, 2.1] to 4-bit signed integer (−8 to +7).

Step 1: Find range.

$r_min = −1.2, r_max = 3.7
Δ = r_max − r_min = 4.9
$

Step 2: Symmetric quantization. Scale uses max absolute value:

$s = max(|r|) / 2^(b−1) = 3.7 / 8 = 0.4625
$

Step 3: Quantize. q = round(r / s), clipped to [−8, 7]:

$q₁ = round(0.5/0.4625) = round(1.08) = 1
q₂ = round(−1.2/0.4625) = round(−2.59) = −3
q₃ = round(3.7/0.4625) = round(8.0) = 8 → clip to 7
q₄ = round(−0.3/0.4625) = round(−0.65) = −1
q₅ = round(2.1/0.4625) = round(4.54) = 5

q = [1, −3, 7, −1, 5]
$

Step 4: Dequantize. r̂ = s · q:

$r̂₁ = 0.4625 · 1 = 0.4625
r̂₂ = 0.4625 · (−3) = −1.3875
r̂₃ = 0.4625 · 7 = 3.2375
r̂₄ = 0.4625 · (−1) = −0.4625
r̂₅ = 0.4625 · 5 = 2.3125
$

Step 5: Errors.

$|r − r̂| = [0.0375, 0.1875, 0.4625, 0.1625, 0.2125]
$

The clipping of q₃ (8→7) caused the largest error (0.4625). The maximum theoretical error for non-clipped values is s/2 = 0.23125. q₃'s error exceeds this because of clipping!

With asymmetric quantization:

s = (3.7 − (−1.2)) / 15 = 4.9/15 = 0.3267
z = round(−(−1.2)/0.3267) = round(3.67) = 4

q'₁ = round((0.5 − 4·0.3267)/0.3267) works out differently. Let's compute:
q = round((r + 1.2)/0.3267)
q₁ = round(1.7/0.3267) = 5
q₂ = round(0/0.3267) = 0
q₃ = round(4.9/0.3267) = 15  (full range!)
q₄ = round(0.9/0.3267) = 3
q₅ = round(3.3/0.3267) = 10

r̂ = 0.3267·q − 1.2 (dequant)
Errors are roughly s/2 ≈ 0.163 — better than symmetric for this asymmetric distribution.

Example 2: GPTQ Compensation

For a simplified 1×3 weight row w = [1.0, 2.0, 3.0] and calibration Hessian:

$H⁻¹ = [[4, 0, −2],
       [0, 1, 1],
       [−2, 1, 3]]
$

Quantize to integers with scale s=1.0 (so quantization = rounding). Quantize in order of increasing H⁻¹ diagonal: q=2 (H⁻¹_22=1), then q=1 (H⁻¹_11=4), then q=0 (H⁻¹_00=4... break ties arbitrarily).

Step 1: Quantize w₂ = 3.0 → ŵ₂ = 3 (no error, integer already). Δ₂ = 0, no compensation needed. w = [1.0, 2.0, 3.0].

Step 2: Quantize w₁ = 2.0 → ŵ₁ = 2 (no error). Δ₁ = 0. w = [1.0, 2.0, 3.0].

Step 3: Quantize w₀ = 1.0 → ŵ₀ = 1. Δ₀ = 0. All done, no errors.

Now change to non-integer: w = [1.3, 2.7, 2.2].

Quantize in order H⁻¹ diagonal: index 1 (1), index 2 (3), index 0 (4).

Step 1: Index 1 (H⁻¹_11=1). w₁ = 2.7 → ŵ₁ = 3. Δ₁ = −0.3.

Compensate remaining columns (0, 2):

$δ_w₀ = −(−0.3) / 1 · H⁻¹_{1,0} = 0.3 · 0 = 0
δ_w₂ = −(−0.3) / 1 · H⁻¹_{1,2} = 0.3 · 1 = 0.3
$

Update: w₀ = 1.3, w₂ = 2.2 + 0.3 = 2.5.

Step 2: Index 2. w₂ = 2.5 → ŵ₂ = 3 (or 2? round to nearest: 3). Δ₂ = −0.5.

Compensate index 0:

$δ_w₀ = −(−0.5) / 3 · H⁻¹_{2,0} = 0.5/3 · (−2) = −0.333
$

Update: w₀ = 1.3 − 0.333 = 0.967.

Step 3: Index 0. w₀ = 0.967 → ŵ₀ = 1. Δ₀ = −0.033.

No columns left to compensate.

Final quantized: ŵ = [1, 3, 3]. Original: [1.3, 2.7, 2.2].

Without compensation, quantized would be: [1, 3, 2] (element-wise rounding). GPTQ changed w₂ from 2 to 3 through compensation.

Example 3: AWQ Scale Computation

Given a row w = [0.1, 0.15, 5.0, 0.08] (channel 3 is an outlier) and corresponding activation mean x̄ = [1.0, 1.2, 0.8, 0.9].

Compute saliencies and determine which channel gets an AWQ scale.

Saliencies:

$s₁ = |0.1| · 1.0 = 0.1
s₂ = |0.15| · 1.2 = 0.18
s₃ = |5.0| · 0.8 = 4.0    ← outlier!
s₄ = |0.08| · 0.9 = 0.072
$

Channel 3 has 22× the saliency of the next largest. If we use uniform per-channel quantization with scale based on max(|w|)=5.0, small channels suffer. AWQ applies a scale to channel 3:

Let s = 2.0 (example scale from grid search):

$w₃' = 5.0 · 2.0 = 10.0
x₃' = 0.8 / 2.0 = 0.4
$

After quantization (4-bit symmetric, range [−8,7]): - Quantize w₃': scale = max(|w'|)/8 = 10.0/8 = 1.25. q₃' = round(10.0/1.25) = 8 → clip to 7. Dequant: 7·1.25 = 8.75. Output: 8.75 · 0.4 = 3.5. - Without scaling: w₃=5.0 quantized with scale 5.0/8=0.625. q₃=round(5.0/0.625)=8→clip to 7. Output: 7·0.625·0.8 = 3.5.

Same output? Actually AWQ's benefit shows when the larger weight range allows better relative precision across channels. The key is that by scaling up the outlier channel, its quantization error relative to its magnitude decreases, AND the per-channel quantization scale for other channels can now be smaller (since the outlier doesn't inflate their scale). In per-tensor quantization, scaling the outlier up and activation down lets the shared scale be optimized for ALL channels.



Quiz

Q1: What does the concept of Quantization loop primarily refer to in this subject?

A) A historical anecdote about Quantization loop B) A visual representation of Quantization loop C) The definition and application of Quantization loop D) A computational error related to Quantization loop

Correct: C)

Q2: What is the primary purpose of Uniform Affine Quantization?

A) It is used to uniform affine quantization in mathematical analysis B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used only in advanced research contexts

Correct: A)

Q3: Which statement about Gptq: Layer-Wise Quantization Via Optimal Brain Surgeon is TRUE?

A) Gptq: Layer-Wise Quantization Via Optimal Brain Surgeon is mentioned only as a historical footnote B) Gptq: Layer-Wise Quantization Via Optimal Brain Surgeon is a fundamental concept covered in this subject C) Gptq: Layer-Wise Quantization Via Optimal Brain Surgeon is an advanced topic beyond this subject's scope D) Gptq: Layer-Wise Quantization Via Optimal Brain Surgeon is not related to this subject

Correct: B)

Q4: Based on the worked examples in this subject, what is the correct result?

A) 6.0. B) A different result from a common mistake C) An unrelated numerical value D) The inverse of the correct answer

Correct: A)

Q5: How are Gptq: Layer-Wise Quantization Via Optimal Brain Surgeon and Awq: Activation-Aware Weight Quantization related?

A) Gptq: Layer-Wise Quantization Via Optimal Brain Surgeon is the inverse of Awq: Activation-Aware Weight Quantization B) Gptq: Layer-Wise Quantization Via Optimal Brain Surgeon and Awq: Activation-Aware Weight Quantization are completely unrelated topics C) Gptq: Layer-Wise Quantization Via Optimal Brain Surgeon and Awq: Activation-Aware Weight Quantization are closely related concepts D) Gptq: Layer-Wise Quantization Via Optimal Brain Surgeon is a special case of Awq: Activation-Aware Weight Quantization

Correct: C)

Q6: What is a common pitfall when working with Nf4: Normal Float 4-Bit?

A) The main error with Nf4: Normal Float 4-Bit is using it when it is not needed B) A common mistake is confusing Nf4: Normal Float 4-Bit with a similar concept C) Nf4: Normal Float 4-Bit is always computed the same way in all contexts D) Nf4: Normal Float 4-Bit has no common misconceptions

Correct: B)

Q7: When should you apply Common Pitfalls?

A) Common Pitfalls is not practically useful B) Apply Common Pitfalls to solve problems in this subject's domain C) Use Common Pitfalls only in pure mathematics contexts D) Avoid Common Pitfalls unless explicitly instructed

Correct: B)

Practice Problems

Problem 1

A weight vector has values normally distributed with μ=0, σ=2.0. Quantize it to 8-bit signed integer (range [−128, 127]) using symmetric per-tensor quantization. What is the optimal scale s? What is the maximum quantization error? What fraction of values are expected to be clipped (beyond 3σ)?

Answer For symmetric with normal distribution, we set the range to cover some multiple k of σ, balancing quantization error against clipping. Common choice: k=3 (3σ range). Then max = 3·2.0 = 6.0. Scale: s = 6.0 / 127 = 0.04724. Max error (non-clipped): s/2 = 0.0236. Clipping probability: P(|w| > 6.0) = 2 · P(Z > 3) where Z ~ N(0,1). From standard normal table: P(Z > 3) ≈ 0.00135, so about 0.27% of values are clipped. Alternative: k=4 gives s = 8.0/127 = 0.063, error s/2 = 0.0315, clipping ~0.0063%. Trade-off: wider range = more error per value but less clipping. For 8-bit, k=3 is common; for 4-bit, often k=2.5 to concentrate bins near zero.

Problem 2

Compute the NF4 quantization levels (16 values) for a standard normal distribution. List the first 8 positive levels (skip the exact negative symmetric ones). Show the spacing between consecutive levels.

Answer Using Φ⁻¹ at probabilities (2i+1)/32 for i=0..15: i=0: Φ⁻¹(1/32) = −1.96 i=1: Φ⁻¹(3/32) = −1.44 i=2: Φ⁻¹(5/32) = −1.13 i=3: Φ⁻¹(7/32) = −0.90 i=4: Φ⁻¹(9/32) = −0.70 i=5: Φ⁻¹(11/32) = −0.53 i=6: Φ⁻¹(13/32) = −0.37 i=7: Φ⁻¹(15/32) = −0.22 i=8: Φ⁻¹(17/32) = +0.22 i=9: Φ⁻¹(19/32) = +0.37 i=10: Φ⁻¹(21/32) = +0.53 i=11: Φ⁻¹(23/32) = +0.70 i=12: Φ⁻¹(25/32) = +0.90 i=13: Φ⁻¹(27/32) = +1.13 i=14: Φ⁻¹(29/32) = +1.44 i=15: Φ⁻¹(31/32) = +1.96 Positive levels (i=8..15): 0.22, 0.37, 0.53, 0.70, 0.90, 1.13, 1.44, 1.96. Spacing between consecutive positive levels: 0.37−0.22=0.15 0.53−0.37=0.16 0.70−0.53=0.17 0.90−0.70=0.20 1.13−0.90=0.23 1.44−1.13=0.31 1.96−1.44=0.52 Spacing increases toward the tails — more resolution near zero (high probability), less in the tails (low probability). This is the key property of quantile quantization. Compare to uniform over [−2,2]: spacing = 4/15 ≈ 0.267, equal everywhere.

Problem 3

In GPTQ, explain why quantizing columns in order of increasing H⁻¹ diagonal is a good heuristic. What would happen if you quantized in order of decreasing diagonal?

Answer The error from quantizing column q is E_q = (w_q − ŵ_q)² / H⁻¹_{qq}. This error gets compensated by updating the remaining (not-yet-quantized) columns. A column with SMALL H⁻¹_{qq} has a LARGE error (since it's in the denominator) — it's a \"sensitive\" column where small weight perturbations cause large output errors. By quantizing the LEAST sensitive columns first (large H⁻¹_{qq} → small error), the error they introduce is small AND there are many remaining columns available to absorb it. By the time we reach the MOST sensitive columns (small H⁻¹_{qq}), most other columns are already quantized and can't help, but the error is small because only a few columns remain to corrupt. If we did the opposite (decreasing diagonal = most sensitive first): the first quantization step would introduce a large error when most columns are unquantized, and the compensation would push this error into many remaining columns. Those columns then get quantized with extra accumulated error, cascading into poor quality. This is effectively quantizing the \"hard\" columns before you know what the \"easy\" columns will look like. The greedy heuristic of least-sensitive-first is not provably optimal but empirically works very well because it minimizes the cascade of compounded errors.

Problem 4

A model has 7B parameters in fp16. Compute the memory reduction from fp16 to: (a) int8 per-tensor, (b) int4 per-channel, (c) int4 per-group (group size 128). Express as absolute memory (GB) and compression ratio. Assume fp16 scales for all.

Answer fp16 baseline: 7B · 2 bytes = 14 GB. **(a) int8 per-tensor:** Weight data: 7B · 1 byte = 7 GB. Scale metadata: 2 scales per tensor. For ~1000 tensors (one per linear layer weight), 1000·2·2 = 4 KB — negligible. Total: ~7 GB. Compression: 14/7 = 2.0×. **(b) int4 per-channel:** Weight data: 7B · 0.5 bytes = 3.5 GB. Scale metadata: for a weight matrix out×in, one scale per output channel. Average matrix size ~4096×4096 in large models. Each has 4096 scales = 8 KB fp16. Across ~200 matrices: 1.6 MB — negligible. Total: ~3.5 GB. Compression: 14/3.5 = 4.0×. **(c) int4 per-group (G=128):** Weight data: 3.5 GB. Scale metadata: per row, in/128 scales. For out×in = 4096×4096, each matrix: 4096·4096/128 = 131,072 scales · 2 bytes = 256 KB. 200 matrices: ~50 MB. Total: 3.5 + 0.05 = ~3.55 GB. Compression: 14/3.55 ≈ 3.94×. The per-group overhead reduces compression from 4.0× to 3.94× — a very small price for much better accuracy (protecting outlier channels within a row).

Problem 5

Explain why AWQ's scaling trick (W·s, X/s before quantization) works only for linear operations, and why it would NOT work directly for layer normalization or attention softmax.

Answer AWQ exploits the linearity property: (W·s) · (X/s) = W·X. This holds because multiplication by a scalar commutes with the matrix multiplication. The scale factor is \"folded\" into the weight permanently, and the activation is proportionally reduced, preserving the mathematical output. For layer normalization: LN(x) = γ·(x−μ)/σ + β. If we scale the input x by 1/s, the mean μ scales by 1/s, and σ scales by 1/s: LN(x/s) = γ·(x/s−μ/s)/(σ/s) + β = γ·(x−μ)/σ + β = LN(x). So LN IS invariant to input scaling! However, the LN weights γ,β are parameters that may need their own quantization strategy. For attention softmax: softmax(QK^T/√d_k) is NOT scale invariant. If we scale Q by s and K by 1/s, the product QK^T is unchanged, so the softmax output is unchanged. But if we scale only Q and compensate in the output projection, the softmax non-linearity breaks the invariance. The general principle: AWQ's per-channel scaling works for any operation where output = weight · input (element-wise multiplication or matmul). Non-linear operations (softmax, ReLU, GELU) break the simple scaling invariance and require different handling (e.g., quantization-aware training of the non-linearities, or separate activation quantization).

Summary

  1. Uniform affine quantization maps floats to integers via q = round((r−z)/s); per-group quantization (e.g., group-128) provides the best accuracy-to-overhead tradeoff for 4-bit LLM weights (~3.9× compression)
  2. GPTQ frames quantization as a layer-wise output reconstruction problem using the Optimal Brain Surgeon framework — it compensates quantization error by adjusting not-yet-quantized weights according to the inverse Hessian
  3. AWQ identifies that a small fraction of \"salient\" channels dominate quantization error; by scaling these channels up (and their activations down), AWQ protects them within the fixed quantization grid with zero runtime overhead
  4. NF4 is a quantile-based non-uniform quantization format optimal for normally distributed weights — it allocates levels proportional to probability mass, placing more bins near zero
  5. For linear layers, scale invariance enables activation-aware techniques; non-linear operations (softmax, norm) require separate treatment in quantization pipelines


Next Steps

Continue to 19-04 — Pruning to explore magnitude pruning, structured vs unstructured sparsity, the Lottery Ticket Hypothesis, SparseGPT, and N:M sparsity patterns.