Math graphic
๐Ÿ“ Concept diagram

20-09 โ€” Parameter-Efficient Fine-Tuning (PEFT)

Phase: 20 โ€” Training & Fine-tuning Mathematics Subject: 20-09 Prerequisites: 20-05 (Instruction Tuning SFT), 20-08 (Constitutional AI and RLAIF), 17-05 (Attention Mathematics), 09-03 (SVD โ€” for low-rank concepts), 15-01 (Floating Point Arithmetic โ€” quantization) Next subject: 21-01 โ€” Bayesian Inference


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the LoRA weight update: W = Wโ‚€ + (ฮฑ/r)ยทBA and prove it preserves the original model behavior at initialization (B=0)
  2. Analyze why the intrinsic dimension of fine-tuning is much smaller than the full parameter count, connecting to the low-rank hypothesis
  3. Compute memory savings of QLoRA (NF4 quantization + LoRA adapters) vs full fine-tuning for a given model size
  4. Compare the five major PEFT methods (LoRA, QLoRA, Adapters, Prefix Tuning, Prompt Tuning) along the axes of trainable parameters, memory, and inference overhead
  5. Design a PEFT configuration given model size, GPU memory budget, and task-switching requirements

Core Content

1. Why PEFT?

Full fine-tuning of large language models is prohibitively expensive. A 70B parameter model with Adam optimizer requires storing:

Parameters (fp16):     70B ร— 2 bytes  = 140 GB
Gradients (fp16):      70B ร— 2 bytes  = 140 GB
Optimizer m (fp32):    70B ร— 4 bytes  = 280 GB
Optimizer v (fp32):    70B ร— 4 bytes  = 280 GB
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
TOTAL:                                  840 GB

This requires 11+ A100 80GB GPUs just for memory. PEFT (Parameter-Efficient Fine-Tuning) addresses this by FREEZING the base model and learning only a small auxiliary set of parameters, reducing trainable parameters from billions to millions.

โš ๏ธ THIS IS CRITICAL โ€” PEFT methods (especially LoRA/QLoRA) have become the standard approach for fine-tuning LLMs. Full fine-tuning is now rare outside of pre-training and large-scale RLHF runs.


2. LoRA (Low-Rank Adaptation)

The key insight (Hu et al., 2021): the weight update ฮ”W during fine-tuning has low "intrinsic rank." Rather than learning a full dร—k update matrix, we factor it as the product of two low-rank matrices.

Mathematical formulation:

For a pre-trained weight matrix Wโ‚€ โˆˆ โ„^{dร—k}:

$h = Wโ‚€x + ฮ”Wx = Wโ‚€x + BAx
$

where: - B โˆˆ โ„^{dร—r} (low-rank, r โ‰ช min(d,k)) - A โˆˆ โ„^{rร—k} - Wโ‚€ is FROZEN (no gradients) - Only A and B are trained

With scaling factor ฮฑ:

$h = Wโ‚€x + (ฮฑ/r) ยท BAx
$

Setting ฮฑ = r removes the scaling entirely. Tuning ฮฑ as a hyperparameter allows larger updates without instability โ€” ฮฑ acts as a learning-rate multiplier specific to the LoRA path.

Initialization: - A: random Gaussian initialization (small values) - B: ZEROS โ€” so at initialization, BA = 0 and the model behaves exactly like the pre-trained model

Why low-rank works: 1. The pre-trained model already captures general knowledge; adaptation only needs to specialize 2. The intrinsic dimension of fine-tuning tasks is often much smaller than the parameter count (measured in hundreds, not billions) 3. Low-rank decomposition is a form of implicit regularization โ€” it prevents overfitting to small fine-tuning datasets

Common practice: Apply LoRA to attention projection matrices (Q, K, V, O). Typical rank r โˆˆ {4, 8, 16, 32, 64}.


3. QLoRA (Quantized LoRA)

QLoRA (Dettmers et al., 2023) extends LoRA by quantizing the FROZEN base model to 4-bit precision, further reducing memory:

Base model Wโ‚€: stored in 4-bit NF4 (NormalFloat4)
LoRA adapters BA: stored and trained in BF16/FP16

Forward pass:
1. Dequantize Wโ‚€ from 4-bit โ†’ BF16 (on-the-fly)
2. Compute Wโ‚€x + BAx in BF16
3. Discard dequantized Wโ‚€

Memory breakdown (70B model):

Component FP16 Full FT QLoRA
Base model (Wโ‚€) 140 GB 35 GB (4-bit NF4)
Optimizer states 560 GB โ€” (frozen)
Gradients (Wโ‚€) 140 GB โ€” (frozen)
LoRA params โ€” ~0.17 GB (r=64)
LoRA opt states โ€” ~0.5 GB
Total 840 GB ~36 GB

A ~23ร— memory reduction. This is how a single consumer GPU (RTX 3090/4090 with 24GB) can fine-tune a 70B model.

NF4 (NormalFloat4): A 4-bit data type optimized for normally distributed weights. It uses quantile-based boundaries where each of the 2โด = 16 bins has equal probability mass under a Gaussian, minimizing quantization error for neural network weights (which are approximately normally distributed).

โš ๏ธ THIS IS CRITICAL โ€” QLoRA is the gateway that makes LLM fine-tuning accessible. Without QLoRA, fine-tuning a 70B model requires enterprise-grade hardware. With QLoRA, it fits on a high-end consumer GPU.


4. Adapters

Adapters (Houlsby et al., 2019) insert small bottleneck layers between transformer sublayers:

$Standard transformer block:
  x โ†’ LayerNorm โ†’ Attention โ†’ x + Attention(x)
  x โ†’ LayerNorm โ†’ FFN โ†’ x + FFN(x)

With adapters:
  x โ†’ LayerNorm โ†’ Attention โ†’ Adapter โ†’ x + Adapter(Attention(x))
  x โ†’ LayerNorm โ†’ FFN โ†’ Adapter โ†’ x + Adapter(FFN(x))
$

Each adapter is:

$Adapter(x) = W_up ยท ฯƒ(W_down ยท x)
$

where W_down: โ„^d โ†’ โ„^r (down-projection) and W_up: โ„^r โ†’ โ„^d (up-projection).

Parameter count per adapter: 2dr + r + d โ‰ˆ 2dr (ignoring bias terms). With two adapters per transformer layer and L layers: ~4drL parameters. For d=4096, r=64, L=32: ~33M trainable parameters โ€” still small compared to the full model.

Tradeoff: Adapters add a small latency overhead at inference (~1-3%) because the adapter computations are sequential (can't be merged into the weight matrix the way LoRA can).


5. Prefix Tuning

Prefix tuning (Li & Liang, 2021) prepends p learned continuous vectors to the key/value in EVERY attention layer:

$K_layer = [K_prefix; K_input]
V_layer = [V_prefix; V_input]
$

where K_prefix, V_prefix โˆˆ โ„^{pร—d} are learned parameters (one pair per layer). The attention mechanism attends to these prefix tokens as if they were additional context.

Why K and V (not Q)? The prefix influences what the model ATTENDS TO (through K, V) without changing what the model is "looking for" (Q remains computed from the actual input).

Parameter count: 2 ร— p ร— d ร— L. For GPT-2 Medium (L=24, d=1024, p=10): ~490K parameters.

Alternative โ€” Prompt Tuning (Lester et al., 2021): Prefix tuning applied only to the INPUT embedding layer (p learned embedding vectors prepended to the input). The rest of the model stays frozen. Works best with larger models (>10B params).


6. PEFT Method Comparison

Method Trainable Params Memory Inference Overhead Mergeable? Best For
Full FT All Very High None โ€” Pre-training
LoRA r(d+k) per layer Medium None (merged) Yes Most fine-tuning
QLoRA r(d+k) per layer Low None (merged) Yes Consumer GPU FT
Adapters ~4drL Medium Small latency No Multi-task switching
Prefix Tuning 2pdL Low None No Task conditioning
Prompt Tuning pd Very Low None No Large model FT

Key insight: LoRA can be MERGED after training (W_merged = Wโ‚€ + BA), making inference cost identical to the base model. Adapters, prefix tuning, and prompt tuning cannot be merged โ€” they always add some overhead.


Worked Examples

Example 1: LoRA Parameter Count

Problem: For a 7B parameter model with hidden dimension d=4096, applying LoRA with r=8 to all attention projection matrices (Q, K, V, O), each of shape 4096ร—4096. How many trainable parameters? What fraction of the total model?

Solution:

For each weight matrix:

$A: r ร— d = 8 ร— 4096 = 32,768
B: d ร— r = 4096 ร— 8 = 32,768
Per matrix: 65,536 params
$

4 matrices per layer, 32 layers (typical for 7B):

$4 ร— 32 ร— 65,536 = 8,388,608 trainable params โ‰ˆ 8.4M
$

Total model: 7B params. Trainable fraction = 8.4M / 7B โ‰ˆ 0.0012 = 0.12%.

Only ~0.1% of parameters are trained โ€” the rest are frozen.


Example 2: QLoRA Memory Budget

Problem: You have an RTX 4090 with 24GB VRAM. You want to fine-tune Llama 2 13B using QLoRA with r=16. The 13B model in NF4 takes 6.5 GB. LoRA adapters and optimizer states: ~0.3 GB. Batch size 1 activations: ~0.5 GB. Can this fit? What's the maximum rank r before exceeding memory?

Solution:

$NF4 base model:    6.5 GB
LoRA + optimizer:  0.3 GB
Activations:       0.5 GB
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Total:             7.3 GB
$

Easily fits in 24GB โ€” plenty of room. Available for larger rank: 24 โˆ’ 7.3 = 16.7 GB. But LoRA params scale as rร—(d+k) per matrix. Even r=256 would only add a few hundred MB. Memory is NOT the bottleneck for rank โ€” it's primarily about model capacity needs.

Maximum rank estimate: LoRA adapters at r=256 for this model: ~8.4M ร— (256/8) โ‰ˆ 268M params โ‰ˆ 0.5 GB. Total still under 8 GB โ€” well within 24 GB. So r=256 fits comfortably, but higher rank may overfit on small datasets.


Example 3: LoRA Forward Pass Computation

Problem: Given Wโ‚€ = [[1, 2], [3, 4]], A = [[0.1, 0.2]], B = [[0.3], [0.4]], ฮฑ = 1, r = 2, input x = [1, 1]^T. Compute the LoRA output and compare to the unadapted output.

Solution:

$Wโ‚€x = [[1, 2], [3, 4]] ยท [1, 1]^T = [1ยท1+2ยท1, 3ยท1+4ยท1] = [3, 7]

Ax = [0.1, 0.2] ยท [1, 1]^T = 0.3

BAx = B ยท (Ax) = [[0.3], [0.4]] ยท 0.3 = [0.09, 0.12]

Scaling: (ฮฑ/r) = 1/2 = 0.5

Output = Wโ‚€x + (ฮฑ/r)ยทBAx = [3, 7] + 0.5ยท[0.09, 0.12] = [3.045, 7.06]
$

The LoRA adaptation shifted the output slightly from [3, 7] to [3.045, 7.06] โ€” a ~1.5% change. This illustrates how LoRA makes small, targeted adjustments to the pre-trained model's behavior.



Quiz

Q1: What does the concept of Total primarily refer to in this subject?

A) A visual representation of Total B) A historical anecdote about Total C) A computational error related to Total D) The definition and application of Total

Correct: D)

Q2: What is the primary purpose of Full FT?

A) It replaces all other methods in this domain B) It is used only in advanced research contexts C) It is used to full ft in mathematical analysis D) It is primarily a historical notation system

Correct: C)

Q3: Which statement about QLoRA is TRUE?

A) QLoRA is a fundamental concept covered in this subject B) QLoRA is an advanced topic beyond this subject's scope C) QLoRA is mentioned only as a historical footnote D) QLoRA is not related to this subject

Correct: A)

Q4: Based on the worked examples in this subject, what is the correct result?

A) A different result from a common mistake B) An unrelated numerical value C) The inverse of the correct answer D) 65,536

Correct: D)

Q5: How are QLoRA and Adapters related?

A) QLoRA and Adapters are completely unrelated topics B) QLoRA is a special case of Adapters C) QLoRA is the inverse of Adapters D) QLoRA and Adapters are closely related concepts

Correct: D)

Q6: What is a common pitfall when working with Prefix Tuning?

A) Prefix Tuning is always computed the same way in all contexts B) A common mistake is confusing Prefix Tuning with a similar concept C) The main error with Prefix Tuning is using it when it is not needed D) Prefix Tuning has no common misconceptions

Correct: B)

Q7: When should you apply Prompt Tuning?

A) Avoid Prompt Tuning unless explicitly instructed B) Prompt Tuning is not practically useful C) Apply Prompt Tuning to solve problems in this subject's domain D) Use Prompt Tuning only in pure mathematics contexts

Correct: C)

Practice Problems

Problem 1

For a LoRA adapter on a weight matrix W โˆˆ โ„^{1024ร—4096} with r=16, compute the trainable parameter count. What compression ratio does this achieve vs full fine-tuning of this matrix?

Answer
$A: 16 ร— 4096 = 65,536
B: 1024 ร— 16 = 16,384
Total: 81,920 parameters

Full matrix: 1024 ร— 4096 = 4,194,304 parameters
Compression: 4,194,304 / 81,920 = 51.2ร—
$
LoRA uses 51ร— fewer parameters than full fine-tuning for this single matrix.

Problem 2

Prove that at initialization (B = 0), the LoRA-adapted model produces exactly the same output as the pre-trained model, regardless of A's initialization.

Answer LoRA output: h = Wโ‚€x + (ฮฑ/r)ยทBAx At initialization, B = 0 (zero matrix), so BAx = 0ยท(Ax) = 0. Therefore h = Wโ‚€x + 0 = Wโ‚€x โ€” identical to the pre-trained model. This is a key property: LoRA starts from the pre-trained model's exact behavior and gradually deviates as training progresses. Unlike full fine-tuning, there's no risk of catastrophic forgetting at step 0.

Problem 3

You apply QLoRA with r=64 to a 70B model. The NF4 quantization stores each weight in 4 bits (0.5 bytes). The LoRA adapters use ~168 MB. What's the total GPU memory for inference-only (no optimizer states)? For training?

Answer **Inference only:**
NF4 base: 70B ร— 0.5 bytes = 35 GB
LoRA (merged for inference): 0 GB (merged into Wโ‚€)
Total: 35 GB (but dequantization to BF16 happens on-the-fly, 
      requiring ~140 GB of temporary compute โ€” hidden from memory accounting)
Actually, during QLoRA inference, the base model is dequantized on-the-fly, so peak memory includes the dequantized layer being processed. But only one layer at a time. Total โ‰ˆ 35 GB + max_layer_size โ‰ˆ 36-37 GB. **Training:**
NF4 base: 35 GB (frozen, no gradients/optimizer)
LoRA params (fp16): 0.17 GB
LoRA gradients (fp16): 0.17 GB
LoRA optimizer (fp32 m,v): 0.17 ร— 2 ร— 2 = 0.68 GB
Activations (B=1, seq=2048): ~2-3 GB
Total: ~38-39 GB
Fits on a single A100 80GB or H100 80GB with room to spare. Compare to full fine-tuning: 840 GB (11ร— A100s).

Problem 4

When would you choose adapters over LoRA for a production system? Provide at least two scenarios.

Answer 1. **Multi-task serving with rapid switching:** Adapters can be swapped without modifying the base model weights. For a system serving 20 different specialized tasks (e.g., translation, summarization, code generation), loading/unloading small adapter modules is faster than merging/unmerging LoRA weights for each request. 2. **Continual learning / model updates:** When you need to add new capabilities without retraining the entire system. New adapters can be added for new tasks without risking interference with existing adapters. LoRA merging would require storing full merged checkpoints per task. 3. **Fine-tuning when you CANNOT modify the base model:** Some deployment scenarios treat the base model as immutable. Adapters bolt on externally without touching the base weights. LoRA can also avoid touching base weights during training, but its typical use case involves eventual merging.

Problem 5

A prefix tuning setup uses p=20 prefix tokens per layer, d=4096, L=40 layers. How many trainable parameters? If the base model has 13B parameters, what's the trainable fraction?

Answer
$Per layer: K_prefix (20ร—4096) + V_prefix (20ร—4096) = 163,840
Total: 40 ร— 163,840 = 6,553,600 โ‰ˆ 6.6M parameters

Trainable fraction: 6.6M / 13B โ‰ˆ 0.0005 = 0.05%
$
Even less than LoRA (~0.1%). Prefix tuning is extremely parameter-efficient but may underfit on complex tasks โ€” it can only influence the model through attention patterns, not through weight modifications.

Summary

  1. PEFT freezes the base model and learns a small auxiliary parameter set โ€” reducing trainable parameters from billions to millions and memory from hundreds of GB to tens
  2. LoRA factorizes weight updates as ฮ”W = BA where B โˆˆ โ„^{dร—r}, A โˆˆ โ„^{rร—k}, r โ‰ช min(d,k) โ€” typically applied to attention projections with r=8-64
  3. QLoRA quantizes the base model to 4-bit NF4 while training LoRA adapters in full precision โ€” reduces memory ~23ร—, enabling 70B fine-tuning on consumer GPUs
  4. Adapters, prefix tuning, and prompt tuning provide alternative PEFT approaches with different tradeoffs in latency, parameter count, and flexibility
  5. LoRA can be merged (W = Wโ‚€ + BA) for zero-cost inference โ€” a key advantage over adapters and prefix-based methods

Pitfalls


Key Terms

Term Definition
PEFT Parameter-Efficient Fine-Tuning โ€” adapts LLMs by training a tiny fraction of parameters while freezing the base model
LoRA Low-Rank Adaptation: ฮ”W = BA with rank r โ‰ช d,k โ€” most popular PEFT method
Intrinsic dimension The minimum number of parameters needed to achieve good fine-tuning performance โ€” much smaller than total parameter count
QLoRA Quantized LoRA: base model in 4-bit NF4 + full-precision LoRA adapters โ€” consumer GPU fine-tuning
NF4 NormalFloat4 โ€” 4-bit data type with quantile-based bins optimized for normally distributed weights
Adapter Bottleneck layer (W_up ยท ฯƒ(W_down ยท x)) inserted between transformer sublayers โ€” adds small latency
Prefix tuning Learned continuous vectors prepended to K and V in every attention layer โ€” influences attention without changing weights
Prompt tuning Prefix tuning applied only to the input embedding layer โ€” simplest PEFT method
Merge Absorbing LoRA into base weights (W_merged = Wโ‚€ + BA) โ€” eliminates inference overhead

Next Steps

Continue to 21-01 โ€” Bayesian Inference to learn the foundations of probabilistic reasoning โ€” priors, posteriors, conjugate distributions, and how Bayesian thinking connects to modern ML.