📐 Concept diagram

15-07 — GPU Computation Model

Phase: Numerical Methods for ML | Subject: 15-07 Prerequisites: 15-05-numerical-linear-algebra-ml.md, 17-01-convolutional-neural-networks.md Next subject: 16-01-perceptron-model.md

Learning Objectives

By the end of this subject, you will be able to:

Explain the SIMT (Single Instruction, Multiple Thread) execution model
Describe the GPU memory hierarchy and its bandwidth characteristics
Define warps, thread blocks, and grid — and how they map to hardware
Explain memory coalescing and why it's critical for performance
Analyze why matrix multiplication achieves near-peak throughput on GPUs

Core Content

Why GPUs?

A GPU is a throughput-oriented processor — optimized for running thousands of simple threads simultaneously, not for running a few threads fast.

	CPU	GPU
Cores	Few (~8–64), complex	Thousands (~10,000), simple
Threads	Few, heavy (OS-managed)	Many, lightweight (HW-managed)
Cache	Large (multi-MB L3)	Small (KB-level per SM)
Bandwidth	~50–100 GB/s (DDR)	~1–2 TB/s (HBM)
Best at	Serial, branch-heavy, low latency	Parallel, data-parallel, high throughput

⚠️ CRITICAL — The GPU philosophy: Hide memory latency with massive parallelism, not with caches. When one warp stalls waiting for memory, the SM immediately switches to another warp. With enough warps in flight, the memory latency is completely hidden.

SIMT Execution Model

SIMT (Single Instruction, Multiple Threads): A single instruction is executed by many threads simultaneously, each on different data. This is the GPU's fundamental execution paradigm.

A warp = 32 threads executing in lockstep (NVIDIA). All threads in a warp execute the same instruction. If threads diverge (e.g., if/else), both paths are executed serially with disabled threads — this is warp divergence and kills performance.

A thread block = up to 1024 threads (32 warps). All threads in a block run on the same Streaming Multiprocessor (SM) and can share data through shared memory and synchronize via __syncthreads().

A grid = collection of thread blocks that execute a kernel. Blocks within a grid are independent — they can run in any order on any SM.

GPU Memory Hierarchy

Memory Type	Size (A100)	Bandwidth	Scope	Latency
Registers	256 KB/SM	~8 TB/s (aggregate)	Per-thread	~0 cycles
Shared Memory (SRAM)	164 KB/SM	~10 TB/s	Per-block	~20 cycles
L1 Cache	192 KB/SM (combined with shared)	—	Per-SM	~30 cycles
L2 Cache	40 MB	~4 TB/s	All SMs	~200 cycles
Global Memory (HBM2e)	40–80 GB	~2 TB/s	All SMs + CPU	~300–800 cycles

The ratio of compute to memory bandwidth determines whether a kernel is compute-bound or memory-bound.

Memory Coalescing

When threads in a warp access global memory, the hardware combines (coalesces) requests into as few cache-line transactions as possible.

Coalesced access (good): Thread $i$ accesses address $\text{base} + i \cdot \text{stride}$. All 32 accesses fit into a few 32-byte transactions → 1–4 memory transactions for the whole warp.

Strided/non-coalesced (bad): Thread $i$ accesses address $\text{base} + i \cdot S$ where $S$ is large. Each thread's access hits a different cache line → up to 32 memory transactions for one warp → 8–32× bandwidth reduction.

⚠️ CRITICAL — In ML frameworks, memory coalescing is why row-major layout matters. PyTorch stores tensors row-major (C order). When processing a batch, successive threads read successive columns of the same row — coalesced access. If you transpose your data layout incorrectly, you can silently lose 5–10× performance.

Why Matrix Multiply Is So Fast on GPUs

Matrix multiply $C = AB$ where $A \in \mathbb{R}^{M \times K}$, $B \in \mathbb{R}^{K \times N}$ is the poster child for GPU efficiency.

The tiling strategy: 1. Divide output $C$ into tiles (e.g., $128 \times 128$) 2. Each thread block computes one tile 3. Load a tile of $A$ and $B$ into shared memory 4. Compute the partial outer product using warp-level matrix operations (Tensor Cores) 5. Accumulate in registers 6. Write tile to global memory

Computational intensity: Each tile requires $M_{\text{tile}} \times N_{\text{tile}} \times K$ multiply-adds (2 ops each) with $K \times (M_{\text{tile}} + N_{\text{tile}})$ bytes loaded from global memory.

$$\text{Ops/byte} = \frac{2 \cdot M_{\text{tile}} N_{\text{tile}} K}{K (M_{\text{tile}} + N_{\text{tile}})} = \frac{2 M_{\text{tile}} N_{\text{tile}}}{M_{\text{tile}} + N_{\text{tile}}}$$

For $M_{\text{tile}} = N_{\text{tile}} = 128$: Ops/byte = $2 \cdot 16384 / 256 = 128$ ops/byte. A100 HBM bandwidth is 2 TB/s, so arithmetic required = $2 \times 10^{12} \times 128 \approx 256$ TFLOPS. A100 peak is 312 TFLOPS (bf16 tensor cores) — matrix multiply is near the roofline!

Tensor Cores

Modern NVIDIA GPUs (V100, A100, H100) have dedicated hardware for matrix multiply-accumulate:

A100 Tensor Core: Computes $D = AB + C$ where $A, B$ are 16-bit (fp16/bf16) and $C, D$ are 32-bit. One tensor core instruction processes a $16 \times 16 \times 16$ matrix multiply in a single cycle. With hundreds of tensor cores running simultaneously, this gives the ~312 TFLOPS figure.

This is why mixed-precision training is 2–8× faster than float32 — the tensor cores are designed for 16-bit inputs.

Roofline Model

The roofline model plots attainable performance vs operational intensity (FLOPs/byte):

Memory-bound region: Low ops/byte — performance limited by memory bandwidth
Compute-bound region: High ops/byte — performance limited by peak FLOPs

Matrix multiply: high ops/byte → compute-bound → achieves peak FLOPs. Element-wise ops (ReLU, add, multiply): low ops/byte → memory-bound → limited by bandwidth, not compute.

A ReLU does 1 op per element. With fp32 (4 bytes per element), that's 0.25 ops/byte — deeply memory-bound. On A100: $0.25 \times 2$ TB/s = 0.5 TFLOPS → only 0.16% of peak! This is why "fusing" kernels (combining multiple element-wise ops into one kernel that reads once, writes once) improves performance dramatically.

Key Terms

Coalesced access
Registers
SIMT
Shared Memory (SRAM)
Streaming Multiprocessor (SM)

Worked Examples

Example 1: Warp Divergence

Analyze the performance of this kernel where each thread processes one element:

if (x[threadIdx.x] > 0)
    y[threadIdx.x] = sqrt(x[threadIdx.x]);
else
    y[threadIdx.x] = 0;

Assume a warp where half the values are positive, half negative.

Solution: All 32 threads execute the condition. Threads 0–15 take the if branch (threads 16–31 are masked off and idle). Then threads 16–31 take the else branch (threads 0–15 idle). Both paths are executed by the full warp serially → 2× the instruction count of a non-divergent warp.

For a warp where threads 0, 2, 4, ... (even indices) take one branch and odd indices take the other: all 32 threads go through BOTH branches with their complement masked → still 2× cost. Worst case for N-way branch: N× cost.

Mitigation: Structure data so that threads within a warp follow the same path (sort inputs, batch by condition).

Click for answer

Half the warp takes each branch → both branches execute serially → 2× instruction count. Even a single divergent thread in a warp forces both paths to execute.

Example 2: Memory Coalescing

A warp of 32 threads accesses a float array. Thread $i$ reads $data[i * 1024]$. Analyze coalescing.

Solution: data is an array of floats (4 bytes each). Thread 0 reads &data[0], thread 1 reads &data[1024] = $&data[0] + 4096$, thread 2 reads &data[2048] = $&data[0] + 8192$.

Each access is 4096 bytes apart. A cache line is 32 bytes (8 floats). Each thread's access hits a DIFFERENT cache line → 32 separate memory transactions for the warp → 32× the latency and 1/32nd of the effective bandwidth of coalesced access.

Fix: Access data[i] instead — successive threads access successive elements → all 32 accesses fit in $32 \times 4 = 128$ bytes → 4 cache line transactions (128/32 = 4).

Click for answer

Stride-1024 access: 32 separate transactions → 32× bandwidth waste. Stride-1 access: 4 transactions → full coalescing. This is a 32× performance difference from data layout alone.

Example 3: Roofline for Layer Normalization

Layer norm computes: $\mu = \frac{1}{d}\sum x_i$, $\sigma^2 = \frac{1}{d}\sum(x_i-\mu)^2$, $y_i = \gamma\frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta$. Analyze ops/byte.

Solution: Per element: reads $x_i$, writes $y_i$ (8 bytes). Plus reads of $\gamma$, $\beta$ (amortized). ~5 FLOPs per element (subtract, multiply, add, divide, multiply). Ops/byte = $5/8 \approx 0.625$ — deeply memory-bound.

Performance on A100: $0.625 \times 2$ TB/s ≈ 1.25 TFLOPS → ~0.4% of peak. Layer norm is bandwidth-limited, not compute-limited.

This is why fused kernels (e.g., FlashAttention, fused layernorm) that combine multiple operations into a single kernel pass are crucial — they reduce memory traffic.

Click for answer

Layer norm: ~0.625 ops/byte — memory-bound. Achieves only ~0.4% of A100's peak compute. Fused implementations improve by reducing redundant memory reads/writes.

Quiz

Q1: What does the concept of Streaming Multiprocessor (SM) primarily refer to in this subject?

A) The definition and application of Streaming Multiprocessor (SM) B) A visual representation of Streaming Multiprocessor (SM) C) A historical anecdote about Streaming Multiprocessor (SM) D) A computational error related to Streaming Multiprocessor (SM)

Correct: A)

If you chose A: Streaming Multiprocessor (SM) is defined as: the definition and application of streaming multiprocessor (sm). The other options describe different aspects that are not the primary focus. Correct!
If you chose B: This is incorrect. Streaming Multiprocessor (SM) is defined as: the definition and application of streaming multiprocessor (sm). The other options describe different aspects that are not the primary focus.
If you chose C: This is incorrect. Streaming Multiprocessor (SM) is defined as: the definition and application of streaming multiprocessor (sm). The other options describe different aspects that are not the primary focus.
If you chose D: This is incorrect. Streaming Multiprocessor (SM) is defined as: the definition and application of streaming multiprocessor (sm). The other options describe different aspects that are not the primary focus.

Q2: Which of the following is the key formula discussed in this subject?

A) A simplified version of \text{base} + i \cdot \text... B) An unrelated formula from a different topic C) \text{base} + i \cdot \text{stride} D) The inverse operation of the formula in question

Correct: C)

If you chose A: This is incorrect. The formula \text{base} + i \cdot \text{stride} is central to this subject. The other options are either simplified versions or unrelated.
If you chose B: This is incorrect. The formula \text{base} + i \cdot \text{stride} is central to this subject. The other options are either simplified versions or unrelated.
If you chose C: The formula \text{base} + i \cdot \text{stride} is central to this subject. The other options are either simplified versions or unrelated. Correct!
If you chose D: This is incorrect. The formula \text{base} + i \cdot \text{stride} is central to this subject. The other options are either simplified versions or unrelated.

Q3: What is the primary purpose of Registers?

A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It is used to registers in mathematical analysis D) It replaces all other methods in this domain

Correct: C)

If you chose A: This is incorrect. Registers serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. Registers serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: Registers serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose D: This is incorrect. Registers serves the purpose described in the correct answer. The other options misrepresent its role.

Q4: Which statement about Shared Memory (SRAM) is TRUE?

A) Shared Memory (SRAM) is a fundamental concept covered in this subject B) Shared Memory (SRAM) is mentioned only as a historical footnote C) Shared Memory (SRAM) is not related to this subject D) Shared Memory (SRAM) is an advanced topic beyond this subject's scope

Correct: A)

If you chose A: Shared Memory (SRAM) is a fundamental concept covered in this subject. This subject covers Shared Memory (SRAM) as part of its core content. Correct!
If you chose B: This is incorrect. Shared Memory (SRAM) is a fundamental concept covered in this subject. This subject covers Shared Memory (SRAM) as part of its core content.
If you chose C: This is incorrect. Shared Memory (SRAM) is a fundamental concept covered in this subject. This subject covers Shared Memory (SRAM) as part of its core content.
If you chose D: This is incorrect. Shared Memory (SRAM) is a fundamental concept covered in this subject. This subject covers Shared Memory (SRAM) as part of its core content.

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) A different result from a common mistake C) The inverse of the correct answer D) Memory Coalescing

Correct: D)

If you chose A: This is incorrect. The worked examples show that the result is Memory Coalescing. The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is Memory Coalescing. The other options represent common errors.
If you chose C: This is incorrect. The worked examples show that the result is Memory Coalescing. The other options represent common errors.
If you chose D: The worked examples show that the result is Memory Coalescing. The other options represent common errors. Correct!

Q6: How are Shared Memory (SRAM) and Coalesced access related?

A) Shared Memory (SRAM) is the inverse of Coalesced access B) Shared Memory (SRAM) and Coalesced access are closely related concepts C) Shared Memory (SRAM) is a special case of Coalesced access D) Shared Memory (SRAM) and Coalesced access are completely unrelated topics

Correct: B)

If you chose A: This is incorrect. Both Shared Memory (SRAM) and Coalesced access are covered in this subject as interconnected topics.
If you chose B: Both Shared Memory (SRAM) and Coalesced access are covered in this subject as interconnected topics. Correct!
If you chose C: This is incorrect. Both Shared Memory (SRAM) and Coalesced access are covered in this subject as interconnected topics.
If you chose D: This is incorrect. Both Shared Memory (SRAM) and Coalesced access are covered in this subject as interconnected topics.

Q7: What is a common pitfall when working with Why Gpus??

A) The main error with Why Gpus? is using it when it is not needed B) Why Gpus? is always computed the same way in all contexts C) A common mistake is confusing Why Gpus? with a similar concept D) Why Gpus? has no common misconceptions

Correct: C)

If you chose A: This is incorrect. Students often confuse Why Gpus? with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse Why Gpus? with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: Students often confuse Why Gpus? with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose D: This is incorrect. Students often confuse Why Gpus? with similar-sounding or related concepts. Pay attention to the precise definitions.

Q8: When should you apply Simt Execution Model?

A) Apply Simt Execution Model to solve problems in this subject's domain B) Use Simt Execution Model only in pure mathematics contexts C) Avoid Simt Execution Model unless explicitly instructed D) Simt Execution Model is not practically useful

Correct: A)

If you chose A: Simt Execution Model is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose B: This is incorrect. Simt Execution Model is a practical tool used throughout this subject to solve relevant problems.
If you chose C: This is incorrect. Simt Execution Model is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. Simt Execution Model is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

An A100 GPU has 108 SMs, each capable of 64 warps (2048 threads) in flight. What's the total number of threads that can be in flight simultaneously?

Click for answer
$108 \times 2048 = 221,184$ threads in flight. If each thread has 32 registers (32 × 4 bytes = 128 bytes), total register file size needed: $221,184 \times 128 \approx 28.3$ MB. A100 has 256 KB per SM × 108 SM = 27.6 MB total register file — matches closely. This massive thread count is how GPUs hide memory latency.
A kernel has ops/byte = 4. On an A100 (312 TFLOPS peak, 2 TB/s bandwidth), is it compute-bound or memory-bound? What's the maximum achievable TFLOPS?

Click for answer
Roofline ceiling from bandwidth: $4 \times 2$ TB/s = 8 TFLOPS. 8 TFLOPS < 312 TFLOPS → kernel is memory-bound. Maximum achievable is 8 TFLOPS — only 2.6% of peak. To become compute-bound, ops/byte must exceed $312/2 = 156$ ops/byte. Only operations like large matrix multiplies and convolutions reach this level.
Explain why a reduction (sum over a large array) has low ops/byte and how to optimize it.

Click for answer
Summing $N$ floats: 1 op per element (add), 4 bytes read (float), 4 bytes written (result — amortized over $N$). Ops/byte ≈ $1/4 = 0.25$ — extremely memory-bound. Optimization: tree-based reduction in shared memory. Load chunks into shared memory (coalesced), do hierarchical reduction within each block (exploiting shared memory bandwidth), then do a final atomic add. This reduces global memory traffic to ~1 read per element. Even so, reduction is always bandwidth-bound on GPUs.
A $1024 \times 1024$ matrix multiply uses tiles of $128 \times 128$. How many thread blocks are launched?

Click for answer
Output is $1024 \times 1024$. Each block computes a $128 \times 128$ tile. Blocks needed: $(1024/128) \times (1024/128) = 8 \times 8 = 64$ thread blocks. Each block may use e.g., $16 \times 16 = 256$ threads — each thread computes an $8 \times 8$ sub-tile of the output. Total threads: $64 \times 256 = 16,384$ — the GPU would launch many more blocks than SMs (occupancy) to hide latency.
Why do tensor cores require specific matrix dimensions (e.g., multiples of 16)?

Click for answer
Tensor cores process $16 \times 16 \times 16$ matrix multiplies (for fp16/bf16 on A100). The input matrices must align to this tile size. If dimensions aren't multiples of 16, the matrix must be padded, wasting some compute. This is why model hidden dimensions are often multiples of 64 or 128 — to maximize tensor core utilization. A hidden size of 1024 uses tensor cores perfectly; 1025 would leave the last row/column underutilized.

Summary

Key takeaways:

GPUs are throughput-oriented: thousands of simple cores, massive parallelism hides memory latency
SIMT: a single instruction executes across 32-thread warps in lockstep
Memory hierarchy: registers (fastest) → shared memory → L1/L2 → HBM (slowest, largest)
Memory coalescing: threads should access contiguous memory → 1–4 transactions per warp instead of 32
Warp divergence: if threads in a warp take different paths, both execute serially — avoid when possible
Matrix multiply achieves near-peak throughput via tiling in shared memory + tensor cores
Roofline model: ops/byte determines whether a kernel is compute-bound or memory-bound
Element-wise ops and reductions are memory-bound — fuse them to reduce memory traffic
Tensor cores accelerate $16 \times 16 \times 16$ matrix multiplies in hardware → model dims should be multiples of 16

Pitfalls

Ignoring memory coalescing in custom CUDA kernels: When threads in a warp access non-contiguous memory (e.g., $data[i * 1024]$), each thread hits a different cache line, resulting in 32 separate memory transactions instead of 1–4. This silently reduces effective bandwidth by 8–32×. In PyTorch, the most common cause is transposing tensors into a layout where the fastest-changing dimension is not the innermost. Always ensure row-major data layout for GPU kernels.
Assuming all GPU operations are fast: Element-wise ops (ReLU, add, multiply) have very low ops/byte (~0.25 for fp32) and are deeply memory-bound on modern GPUs. A 1000-layer network of pure element-wise ops would achieve < 1% of peak FLOPs. Fusing multiple element-wise ops into a single kernel (which reads once, writes once) is the solution — this is why frameworks provide fused operations like F.gelu() or fused layernorm.
Launching too few thread blocks to hide latency: GPUs hide memory latency by switching between warps. If you launch only enough blocks to fill each SM once, the GPU stalls whenever all warps are waiting for memory. As a rule of thumb, launch at least 4–8× more blocks than SMs to ensure sufficient occupancy. For an A100 with 108 SMs, aim for at least 400–800 blocks.
Writing kernels with warp divergence from data-dependent branching: When threads within a warp take different branches (e.g., if (x[i] > threshold)), both paths execute serially with masked threads. Even a single divergent thread forces the entire warp through both branches. For conditionally-executed code, restructure data so that threads in the same warp follow the same path, or use predicated instructions where possible.
Using non-aligned dimensions with tensor cores: Tensor cores on A100 process $16 \times 16 \times 16$ tiles (fp16/bf16). If your matrix dimensions are not multiples of 16, the hardware must pad or use slower fallback paths. Hidden dimensions like 1025 instead of 1024 waste ~6% of tensor core throughput on the last partial tile. Common practice is to use multiples of 64 or 128 for all model dimensions.

Next Steps

This concludes Section 15. The next phase is Neural Network Mathematics, starting with 16-01-perceptron-model.md — neuron models, activation functions, and the universal approximation theorem.

Progress

Phases

15-07 — GPU Computation Model

Learning Objectives

Core Content

Why GPUs?

SIMT Execution Model

GPU Memory Hierarchy

Memory Coalescing

Why Matrix Multiply Is So Fast on GPUs

Tensor Cores

Roofline Model

Key Terms

Worked Examples

Example 1: Warp Divergence

Example 2: Memory Coalescing

Example 3: Roofline for Layer Normalization

Quiz

Practice Problems

Summary

Pitfalls

Next Steps