📐 Concept diagram

20-09 — Parameter-Efficient Fine-Tuning (PEFT)

Phase: 20 — Training & Fine-tuning Mathematics Subject: 20-09 Prerequisites: 20-05 (Instruction Tuning SFT), 20-08 (Constitutional AI and RLAIF), 17-05 (Attention Mathematics), 09-03 (SVD — for low-rank concepts), 15-01 (Floating Point Arithmetic — quantization) Next subject: 21-01 — Bayesian Inference

Learning Objectives

By the end of this subject, you will be able to:

Derive the LoRA weight update: W = W₀ + (α/r)·BA and prove it preserves the original model behavior at initialization (B=0)
Analyze why the intrinsic dimension of fine-tuning is much smaller than the full parameter count, connecting to the low-rank hypothesis
Compute memory savings of QLoRA (NF4 quantization + LoRA adapters) vs full fine-tuning for a given model size
Compare the five major PEFT methods (LoRA, QLoRA, Adapters, Prefix Tuning, Prompt Tuning) along the axes of trainable parameters, memory, and inference overhead
Design a PEFT configuration given model size, GPU memory budget, and task-switching requirements

Core Content

1. Why PEFT?

Full fine-tuning of large language models is prohibitively expensive. A 70B parameter model with Adam optimizer requires storing:

Parameters (fp16):     70B × 2 bytes  = 140 GB
Gradients (fp16):      70B × 2 bytes  = 140 GB
Optimizer m (fp32):    70B × 4 bytes  = 280 GB
Optimizer v (fp32):    70B × 4 bytes  = 280 GB
─────────────────────────────────────────────
TOTAL:                                  840 GB

This requires 11+ A100 80GB GPUs just for memory. PEFT (Parameter-Efficient Fine-Tuning) addresses this by FREEZING the base model and learning only a small auxiliary set of parameters, reducing trainable parameters from billions to millions.

⚠️ THIS IS CRITICAL — PEFT methods (especially LoRA/QLoRA) have become the standard approach for fine-tuning LLMs. Full fine-tuning is now rare outside of pre-training and large-scale RLHF runs.

2. LoRA (Low-Rank Adaptation)

The key insight (Hu et al., 2021): the weight update ΔW during fine-tuning has low "intrinsic rank." Rather than learning a full d×k update matrix, we factor it as the product of two low-rank matrices.

Mathematical formulation:

For a pre-trained weight matrix W₀ ∈ ℝ^{d×k}:

$h = W₀x + ΔWx = W₀x + BAx
$

where: - B ∈ ℝ^{d×r} (low-rank, r ≪ min(d,k)) - A ∈ ℝ^{r×k} - W₀ is FROZEN (no gradients) - Only A and B are trained

With scaling factor α:

$h = W₀x + (α/r) · BAx
$

Setting α = r removes the scaling entirely. Tuning α as a hyperparameter allows larger updates without instability — α acts as a learning-rate multiplier specific to the LoRA path.

Initialization: - A: random Gaussian initialization (small values) - B: ZEROS — so at initialization, BA = 0 and the model behaves exactly like the pre-trained model

Why low-rank works: 1. The pre-trained model already captures general knowledge; adaptation only needs to specialize 2. The intrinsic dimension of fine-tuning tasks is often much smaller than the parameter count (measured in hundreds, not billions) 3. Low-rank decomposition is a form of implicit regularization — it prevents overfitting to small fine-tuning datasets

Common practice: Apply LoRA to attention projection matrices (Q, K, V, O). Typical rank r ∈ {4, 8, 16, 32, 64}.

3. QLoRA (Quantized LoRA)

QLoRA (Dettmers et al., 2023) extends LoRA by quantizing the FROZEN base model to 4-bit precision, further reducing memory:

Base model W₀: stored in 4-bit NF4 (NormalFloat4)
LoRA adapters BA: stored and trained in BF16/FP16

Forward pass:
1. Dequantize W₀ from 4-bit → BF16 (on-the-fly)
2. Compute W₀x + BAx in BF16
3. Discard dequantized W₀

Memory breakdown (70B model):

Component	FP16 Full FT	QLoRA
Base model (W₀)	140 GB	35 GB (4-bit NF4)
Optimizer states	560 GB	— (frozen)
Gradients (W₀)	140 GB	— (frozen)
LoRA params	—	~0.17 GB (r=64)
LoRA opt states	—	~0.5 GB
Total	840 GB	~36 GB

A ~23× memory reduction. This is how a single consumer GPU (RTX 3090/4090 with 24GB) can fine-tune a 70B model.

NF4 (NormalFloat4): A 4-bit data type optimized for normally distributed weights. It uses quantile-based boundaries where each of the 2⁴ = 16 bins has equal probability mass under a Gaussian, minimizing quantization error for neural network weights (which are approximately normally distributed).

⚠️ THIS IS CRITICAL — QLoRA is the gateway that makes LLM fine-tuning accessible. Without QLoRA, fine-tuning a 70B model requires enterprise-grade hardware. With QLoRA, it fits on a high-end consumer GPU.

4. Adapters

Adapters (Houlsby et al., 2019) insert small bottleneck layers between transformer sublayers:

$Standard transformer block:
  x → LayerNorm → Attention → x + Attention(x)
  x → LayerNorm → FFN → x + FFN(x)

With adapters:
  x → LayerNorm → Attention → Adapter → x + Adapter(Attention(x))
  x → LayerNorm → FFN → Adapter → x + Adapter(FFN(x))
$

Each adapter is:

$Adapter(x) = W_up · σ(W_down · x)
$

where W_down: ℝ^d → ℝ^r (down-projection) and W_up: ℝ^r → ℝ^d (up-projection).

Parameter count per adapter: 2dr + r + d ≈ 2dr (ignoring bias terms). With two adapters per transformer layer and L layers: ~4drL parameters. For d=4096, r=64, L=32: ~33M trainable parameters — still small compared to the full model.

Tradeoff: Adapters add a small latency overhead at inference (~1-3%) because the adapter computations are sequential (can't be merged into the weight matrix the way LoRA can).

5. Prefix Tuning

Prefix tuning (Li & Liang, 2021) prepends p learned continuous vectors to the key/value in EVERY attention layer:

$K_layer = [K_prefix; K_input]
V_layer = [V_prefix; V_input]
$

where K_prefix, V_prefix ∈ ℝ^{p×d} are learned parameters (one pair per layer). The attention mechanism attends to these prefix tokens as if they were additional context.

Why K and V (not Q)? The prefix influences what the model ATTENDS TO (through K, V) without changing what the model is "looking for" (Q remains computed from the actual input).

Parameter count: 2 × p × d × L. For GPT-2 Medium (L=24, d=1024, p=10): ~490K parameters.

Alternative — Prompt Tuning (Lester et al., 2021): Prefix tuning applied only to the INPUT embedding layer (p learned embedding vectors prepended to the input). The rest of the model stays frozen. Works best with larger models (>10B params).

6. PEFT Method Comparison

Method	Trainable Params	Memory	Inference Overhead	Mergeable?	Best For
Full FT	All	Very High	None	—	Pre-training
LoRA	r(d+k) per layer	Medium	None (merged)	Yes	Most fine-tuning
QLoRA	r(d+k) per layer	Low	None (merged)	Yes	Consumer GPU FT
Adapters	~4drL	Medium	Small latency	No	Multi-task switching
Prefix Tuning	2pdL	Low	None	No	Task conditioning
Prompt Tuning	pd	Very Low	None	No	Large model FT

Key insight: LoRA can be MERGED after training (W_merged = W₀ + BA), making inference cost identical to the base model. Adapters, prefix tuning, and prompt tuning cannot be merged — they always add some overhead.

Worked Examples

Example 1: LoRA Parameter Count

Problem: For a 7B parameter model with hidden dimension d=4096, applying LoRA with r=8 to all attention projection matrices (Q, K, V, O), each of shape 4096×4096. How many trainable parameters? What fraction of the total model?

Solution:

For each weight matrix:

$A: r × d = 8 × 4096 = 32,768
B: d × r = 4096 × 8 = 32,768
Per matrix: 65,536 params
$

4 matrices per layer, 32 layers (typical for 7B):

$4 × 32 × 65,536 = 8,388,608 trainable params ≈ 8.4M
$

Total model: 7B params. Trainable fraction = 8.4M / 7B ≈ 0.0012 = 0.12%.

Only ~0.1% of parameters are trained — the rest are frozen.

Example 2: QLoRA Memory Budget

Problem: You have an RTX 4090 with 24GB VRAM. You want to fine-tune Llama 2 13B using QLoRA with r=16. The 13B model in NF4 takes 6.5 GB. LoRA adapters and optimizer states: ~0.3 GB. Batch size 1 activations: ~0.5 GB. Can this fit? What's the maximum rank r before exceeding memory?

Solution:

$NF4 base model:    6.5 GB
LoRA + optimizer:  0.3 GB
Activations:       0.5 GB
────────────────────────
Total:             7.3 GB
$

Easily fits in 24GB — plenty of room. Available for larger rank: 24 − 7.3 = 16.7 GB. But LoRA params scale as r×(d+k) per matrix. Even r=256 would only add a few hundred MB. Memory is NOT the bottleneck for rank — it's primarily about model capacity needs.

Maximum rank estimate: LoRA adapters at r=256 for this model: ~8.4M × (256/8) ≈ 268M params ≈ 0.5 GB. Total still under 8 GB — well within 24 GB. So r=256 fits comfortably, but higher rank may overfit on small datasets.

Example 3: LoRA Forward Pass Computation

Problem: Given W₀ = [[1, 2], [3, 4]], A = [[0.1, 0.2]], B = [[0.3], [0.4]], α = 1, r = 2, input x = [1, 1]^T. Compute the LoRA output and compare to the unadapted output.

Solution:

$W₀x = [[1, 2], [3, 4]] · [1, 1]^T = [1·1+2·1, 3·1+4·1] = [3, 7]

Ax = [0.1, 0.2] · [1, 1]^T = 0.3

BAx = B · (Ax) = [[0.3], [0.4]] · 0.3 = [0.09, 0.12]

Scaling: (α/r) = 1/2 = 0.5

Output = W₀x + (α/r)·BAx = [3, 7] + 0.5·[0.09, 0.12] = [3.045, 7.06]
$

The LoRA adaptation shifted the output slightly from [3, 7] to [3.045, 7.06] — a ~1.5% change. This illustrates how LoRA makes small, targeted adjustments to the pre-trained model's behavior.

Quiz

Q1: What does the concept of Total primarily refer to in this subject?

A) A visual representation of Total B) A historical anecdote about Total C) A computational error related to Total D) The definition and application of Total

Correct: D)

If you chose A: This is incorrect. Total is defined as: the definition and application of total. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Total is defined as: the definition and application of total. The other options describe different aspects that are not the primary focus.
If you chose C: This is incorrect. Total is defined as: the definition and application of total. The other options describe different aspects that are not the primary focus.
If you chose D: Total is defined as: the definition and application of total. The other options describe different aspects that are not the primary focus. Correct!

Q2: What is the primary purpose of Full FT?

A) It replaces all other methods in this domain B) It is used only in advanced research contexts C) It is used to full ft in mathematical analysis D) It is primarily a historical notation system

Correct: C)

If you chose A: This is incorrect. Full FT serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. Full FT serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: Full FT serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose D: This is incorrect. Full FT serves the purpose described in the correct answer. The other options misrepresent its role.

Q3: Which statement about QLoRA is TRUE?

A) QLoRA is a fundamental concept covered in this subject B) QLoRA is an advanced topic beyond this subject's scope C) QLoRA is mentioned only as a historical footnote D) QLoRA is not related to this subject

Correct: A)

If you chose A: QLoRA is a fundamental concept covered in this subject. This subject covers QLoRA as part of its core content. Correct!
If you chose B: This is incorrect. QLoRA is a fundamental concept covered in this subject. This subject covers QLoRA as part of its core content.
If you chose C: This is incorrect. QLoRA is a fundamental concept covered in this subject. This subject covers QLoRA as part of its core content.
If you chose D: This is incorrect. QLoRA is a fundamental concept covered in this subject. This subject covers QLoRA as part of its core content.

Q4: Based on the worked examples in this subject, what is the correct result?

A) A different result from a common mistake B) An unrelated numerical value C) The inverse of the correct answer D) 65,536

Correct: D)

If you chose A: This is incorrect. The worked examples show that the result is 65,536. The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is 65,536. The other options represent common errors.
If you chose C: This is incorrect. The worked examples show that the result is 65,536. The other options represent common errors.
If you chose D: The worked examples show that the result is 65,536. The other options represent common errors. Correct!

Q5: How are QLoRA and Adapters related?

A) QLoRA and Adapters are completely unrelated topics B) QLoRA is a special case of Adapters C) QLoRA is the inverse of Adapters D) QLoRA and Adapters are closely related concepts

Correct: D)

If you chose A: This is incorrect. Both QLoRA and Adapters are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both QLoRA and Adapters are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both QLoRA and Adapters are covered in this subject as interconnected topics.
If you chose D: Both QLoRA and Adapters are covered in this subject as interconnected topics. Correct!

Q6: What is a common pitfall when working with Prefix Tuning?

A) Prefix Tuning is always computed the same way in all contexts B) A common mistake is confusing Prefix Tuning with a similar concept C) The main error with Prefix Tuning is using it when it is not needed D) Prefix Tuning has no common misconceptions

Correct: B)

If you chose A: This is incorrect. Students often confuse Prefix Tuning with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: Students often confuse Prefix Tuning with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose C: This is incorrect. Students often confuse Prefix Tuning with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: This is incorrect. Students often confuse Prefix Tuning with similar-sounding or related concepts. Pay attention to the precise definitions.

Q7: When should you apply Prompt Tuning?

A) Avoid Prompt Tuning unless explicitly instructed B) Prompt Tuning is not practically useful C) Apply Prompt Tuning to solve problems in this subject's domain D) Use Prompt Tuning only in pure mathematics contexts

Correct: C)

If you chose A: This is incorrect. Prompt Tuning is a practical tool used throughout this subject to solve relevant problems.
If you chose B: This is incorrect. Prompt Tuning is a practical tool used throughout this subject to solve relevant problems.
If you chose C: Prompt Tuning is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose D: This is incorrect. Prompt Tuning is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Problem 1

For a LoRA adapter on a weight matrix W ∈ ℝ^{1024×4096} with r=16, compute the trainable parameter count. What compression ratio does this achieve vs full fine-tuning of this matrix?

Answer

$A: 16 × 4096 = 65,536
B: 1024 × 16 = 16,384
Total: 81,920 parameters

Full matrix: 1024 × 4096 = 4,194,304 parameters
Compression: 4,194,304 / 81,920 = 51.2×
$

LoRA uses 51× fewer parameters than full fine-tuning for this single matrix.

Problem 2

Prove that at initialization (B = 0), the LoRA-adapted model produces exactly the same output as the pre-trained model, regardless of A's initialization.

Answer

LoRA output: h = W₀x + (α/r)·BAx At initialization, B = 0 (zero matrix), so BAx = 0·(Ax) = 0. Therefore h = W₀x + 0 = W₀x — identical to the pre-trained model. This is a key property: LoRA starts from the pre-trained model's exact behavior and gradually deviates as training progresses. Unlike full fine-tuning, there's no risk of catastrophic forgetting at step 0.

Problem 3

You apply QLoRA with r=64 to a 70B model. The NF4 quantization stores each weight in 4 bits (0.5 bytes). The LoRA adapters use ~168 MB. What's the total GPU memory for inference-only (no optimizer states)? For training?

Answer

**Inference only:**

NF4 base: 70B × 0.5 bytes = 35 GB
LoRA (merged for inference): 0 GB (merged into W₀)
Total: 35 GB (but dequantization to BF16 happens on-the-fly, 
      requiring ~140 GB of temporary compute — hidden from memory accounting)

Actually, during QLoRA inference, the base model is dequantized on-the-fly, so peak memory includes the dequantized layer being processed. But only one layer at a time. Total ≈ 35 GB + max_layer_size ≈ 36-37 GB. **Training:**

NF4 base: 35 GB (frozen, no gradients/optimizer)
LoRA params (fp16): 0.17 GB
LoRA gradients (fp16): 0.17 GB
LoRA optimizer (fp32 m,v): 0.17 × 2 × 2 = 0.68 GB
Activations (B=1, seq=2048): ~2-3 GB
Total: ~38-39 GB

Fits on a single A100 80GB or H100 80GB with room to spare. Compare to full fine-tuning: 840 GB (11× A100s).

Problem 4

When would you choose adapters over LoRA for a production system? Provide at least two scenarios.

Answer

1. **Multi-task serving with rapid switching:** Adapters can be swapped without modifying the base model weights. For a system serving 20 different specialized tasks (e.g., translation, summarization, code generation), loading/unloading small adapter modules is faster than merging/unmerging LoRA weights for each request. 2. **Continual learning / model updates:** When you need to add new capabilities without retraining the entire system. New adapters can be added for new tasks without risking interference with existing adapters. LoRA merging would require storing full merged checkpoints per task. 3. **Fine-tuning when you CANNOT modify the base model:** Some deployment scenarios treat the base model as immutable. Adapters bolt on externally without touching the base weights. LoRA can also avoid touching base weights during training, but its typical use case involves eventual merging.

Problem 5

A prefix tuning setup uses p=20 prefix tokens per layer, d=4096, L=40 layers. How many trainable parameters? If the base model has 13B parameters, what's the trainable fraction?

Answer

$Per layer: K_prefix (20×4096) + V_prefix (20×4096) = 163,840
Total: 40 × 163,840 = 6,553,600 ≈ 6.6M parameters

Trainable fraction: 6.6M / 13B ≈ 0.0005 = 0.05%
$

Even less than LoRA (~0.1%). Prefix tuning is extremely parameter-efficient but may underfit on complex tasks — it can only influence the model through attention patterns, not through weight modifications.

Summary

PEFT freezes the base model and learns a small auxiliary parameter set — reducing trainable parameters from billions to millions and memory from hundreds of GB to tens
LoRA factorizes weight updates as ΔW = BA where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, r ≪ min(d,k) — typically applied to attention projections with r=8-64
QLoRA quantizes the base model to 4-bit NF4 while training LoRA adapters in full precision — reduces memory ~23×, enabling 70B fine-tuning on consumer GPUs
Adapters, prefix tuning, and prompt tuning provide alternative PEFT approaches with different tradeoffs in latency, parameter count, and flexibility
LoRA can be merged (W = W₀ + BA) for zero-cost inference — a key advantage over adapters and prefix-based methods

Pitfalls

Setting LoRA rank too high for small datasets. Higher rank (r=64 or 128) gives more capacity but also more risk of overfitting. With only a few hundred training examples, r=4 or 8 often performs better than r=64 because the low-rank constraint acts as implicit regularization. Match rank to dataset size: small datasets (hundreds of examples) → r=4-8; large datasets (tens of thousands) → r=16-64.
Forgetting that B must be initialized to zero. A is initialized randomly, but B MUST be initialized to zeros. If both A and B are random, BAx contributes a random perturbation at step 0, immediately degrading the pre-trained model's behavior. Zero-initialized B ensures the model starts from exactly the pre-trained weights — a crucial stability property that full fine-tuning cannot guarantee.
Applying LoRA to every linear layer indiscriminately. Standard practice is to apply LoRA only to attention projection matrices (Q, K, V, O). Applying LoRA to FFN layers, embedding matrices, or LayerNorm parameters adds parameters without proportional benefit. Some layers benefit more from adaptation than others — the attention mechanism is where most task-specific behavior lives. Start with attention-only and add other layers only if needed.
Confusing QLoRA memory savings with training speed. QLoRA's 4-bit quantization reduces memory footprint by ~4×, but the on-the-fly dequantization during forward/backward passes adds computational overhead. QLoRA training is often SLOWER per step than full-precision LoRA on the same hardware. The benefit is that you can train MUCH larger models, not that training is faster.
Using PEFT when the task requires fundamental knowledge that the base model lacks. PEFT adapts existing knowledge — it cannot teach a model entirely new capabilities. If the base model has never seen code in a particular programming language, LoRA cannot teach it that language from scratch. For tasks requiring new knowledge domains, consider continued pre-training or full fine-tuning before applying PEFT for task-specific adaptation.

Key Terms

Term	Definition
PEFT	Parameter-Efficient Fine-Tuning — adapts LLMs by training a tiny fraction of parameters while freezing the base model
LoRA	Low-Rank Adaptation: ΔW = BA with rank r ≪ d,k — most popular PEFT method
Intrinsic dimension	The minimum number of parameters needed to achieve good fine-tuning performance — much smaller than total parameter count
QLoRA	Quantized LoRA: base model in 4-bit NF4 + full-precision LoRA adapters — consumer GPU fine-tuning
NF4	NormalFloat4 — 4-bit data type with quantile-based bins optimized for normally distributed weights
Adapter	Bottleneck layer (W_up · σ(W_down · x)) inserted between transformer sublayers — adds small latency
Prefix tuning	Learned continuous vectors prepended to K and V in every attention layer — influences attention without changing weights
Prompt tuning	Prefix tuning applied only to the input embedding layer — simplest PEFT method
Merge	Absorbing LoRA into base weights (W_merged = W₀ + BA) — eliminates inference overhead

Next Steps

Continue to 21-01 — Bayesian Inference to learn the foundations of probabilistic reasoning — priors, posteriors, conjugate distributions, and how Bayesian thinking connects to modern ML.

Progress

Phases

20-09 — Parameter-Efficient Fine-Tuning (PEFT)

Learning Objectives

Core Content

1. Why PEFT?

2. LoRA (Low-Rank Adaptation)

3. QLoRA (Quantized LoRA)

4. Adapters

5. Prefix Tuning

6. PEFT Method Comparison

Worked Examples

Example 1: LoRA Parameter Count

Example 2: QLoRA Memory Budget

Example 3: LoRA Forward Pass Computation

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Key Terms

Next Steps