Math graphic
📐 Concept diagram

15-07 — GPU Computation Model

Phase: Numerical Methods for ML | Subject: 15-07 Prerequisites: 15-05-numerical-linear-algebra-ml.md, 17-01-convolutional-neural-networks.md Next subject: 16-01-perceptron-model.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Explain the SIMT (Single Instruction, Multiple Thread) execution model
  2. Describe the GPU memory hierarchy and its bandwidth characteristics
  3. Define warps, thread blocks, and grid — and how they map to hardware
  4. Explain memory coalescing and why it's critical for performance
  5. Analyze why matrix multiplication achieves near-peak throughput on GPUs

Core Content

Why GPUs?

A GPU is a throughput-oriented processor — optimized for running thousands of simple threads simultaneously, not for running a few threads fast.

CPU GPU
Cores Few (~8–64), complex Thousands (~10,000), simple
Threads Few, heavy (OS-managed) Many, lightweight (HW-managed)
Cache Large (multi-MB L3) Small (KB-level per SM)
Bandwidth ~50–100 GB/s (DDR) ~1–2 TB/s (HBM)
Best at Serial, branch-heavy, low latency Parallel, data-parallel, high throughput

⚠️ CRITICAL — The GPU philosophy: Hide memory latency with massive parallelism, not with caches. When one warp stalls waiting for memory, the SM immediately switches to another warp. With enough warps in flight, the memory latency is completely hidden.

SIMT Execution Model

SIMT (Single Instruction, Multiple Threads): A single instruction is executed by many threads simultaneously, each on different data. This is the GPU's fundamental execution paradigm.

A warp = 32 threads executing in lockstep (NVIDIA). All threads in a warp execute the same instruction. If threads diverge (e.g., if/else), both paths are executed serially with disabled threads — this is warp divergence and kills performance.

A thread block = up to 1024 threads (32 warps). All threads in a block run on the same Streaming Multiprocessor (SM) and can share data through shared memory and synchronize via __syncthreads().

A grid = collection of thread blocks that execute a kernel. Blocks within a grid are independent — they can run in any order on any SM.

GPU Memory Hierarchy

Memory Type Size (A100) Bandwidth Scope Latency
Registers 256 KB/SM ~8 TB/s (aggregate) Per-thread ~0 cycles
Shared Memory (SRAM) 164 KB/SM ~10 TB/s Per-block ~20 cycles
L1 Cache 192 KB/SM (combined with shared) Per-SM ~30 cycles
L2 Cache 40 MB ~4 TB/s All SMs ~200 cycles
Global Memory (HBM2e) 40–80 GB ~2 TB/s All SMs + CPU ~300–800 cycles

The ratio of compute to memory bandwidth determines whether a kernel is compute-bound or memory-bound.

Memory Coalescing

When threads in a warp access global memory, the hardware combines (coalesces) requests into as few cache-line transactions as possible.

Coalesced access (good): Thread $i$ accesses address $\text{base} + i \cdot \text{stride}$. All 32 accesses fit into a few 32-byte transactions → 1–4 memory transactions for the whole warp.

Strided/non-coalesced (bad): Thread $i$ accesses address $\text{base} + i \cdot S$ where $S$ is large. Each thread's access hits a different cache line → up to 32 memory transactions for one warp → 8–32× bandwidth reduction.

⚠️ CRITICAL — In ML frameworks, memory coalescing is why row-major layout matters. PyTorch stores tensors row-major (C order). When processing a batch, successive threads read successive columns of the same row — coalesced access. If you transpose your data layout incorrectly, you can silently lose 5–10× performance.

Why Matrix Multiply Is So Fast on GPUs

Matrix multiply $C = AB$ where $A \in \mathbb{R}^{M \times K}$, $B \in \mathbb{R}^{K \times N}$ is the poster child for GPU efficiency.

The tiling strategy: 1. Divide output $C$ into tiles (e.g., $128 \times 128$) 2. Each thread block computes one tile 3. Load a tile of $A$ and $B$ into shared memory 4. Compute the partial outer product using warp-level matrix operations (Tensor Cores) 5. Accumulate in registers 6. Write tile to global memory

Computational intensity: Each tile requires $M_{\text{tile}} \times N_{\text{tile}} \times K$ multiply-adds (2 ops each) with $K \times (M_{\text{tile}} + N_{\text{tile}})$ bytes loaded from global memory.

$$\text{Ops/byte} = \frac{2 \cdot M_{\text{tile}} N_{\text{tile}} K}{K (M_{\text{tile}} + N_{\text{tile}})} = \frac{2 M_{\text{tile}} N_{\text{tile}}}{M_{\text{tile}} + N_{\text{tile}}}$$

For $M_{\text{tile}} = N_{\text{tile}} = 128$: Ops/byte = $2 \cdot 16384 / 256 = 128$ ops/byte. A100 HBM bandwidth is 2 TB/s, so arithmetic required = $2 \times 10^{12} \times 128 \approx 256$ TFLOPS. A100 peak is 312 TFLOPS (bf16 tensor cores) — matrix multiply is near the roofline!

Tensor Cores

Modern NVIDIA GPUs (V100, A100, H100) have dedicated hardware for matrix multiply-accumulate:

A100 Tensor Core: Computes $D = AB + C$ where $A, B$ are 16-bit (fp16/bf16) and $C, D$ are 32-bit. One tensor core instruction processes a $16 \times 16 \times 16$ matrix multiply in a single cycle. With hundreds of tensor cores running simultaneously, this gives the ~312 TFLOPS figure.

This is why mixed-precision training is 2–8× faster than float32 — the tensor cores are designed for 16-bit inputs.

Roofline Model

The roofline model plots attainable performance vs operational intensity (FLOPs/byte):

Matrix multiply: high ops/byte → compute-bound → achieves peak FLOPs. Element-wise ops (ReLU, add, multiply): low ops/byte → memory-bound → limited by bandwidth, not compute.

A ReLU does 1 op per element. With fp32 (4 bytes per element), that's 0.25 ops/byte — deeply memory-bound. On A100: $0.25 \times 2$ TB/s = 0.5 TFLOPS → only 0.16% of peak! This is why "fusing" kernels (combining multiple element-wise ops into one kernel that reads once, writes once) improves performance dramatically.



Key Terms

Worked Examples

Example 1: Warp Divergence

Analyze the performance of this kernel where each thread processes one element:

if (x[threadIdx.x] > 0)
    y[threadIdx.x] = sqrt(x[threadIdx.x]);
else
    y[threadIdx.x] = 0;

Assume a warp where half the values are positive, half negative.

Solution: All 32 threads execute the condition. Threads 0–15 take the if branch (threads 16–31 are masked off and idle). Then threads 16–31 take the else branch (threads 0–15 idle). Both paths are executed by the full warp serially → 2× the instruction count of a non-divergent warp.

For a warp where threads 0, 2, 4, ... (even indices) take one branch and odd indices take the other: all 32 threads go through BOTH branches with their complement masked → still 2× cost. Worst case for N-way branch: N× cost.

Mitigation: Structure data so that threads within a warp follow the same path (sort inputs, batch by condition).

Click for answer Half the warp takes each branch → both branches execute serially → 2× instruction count. Even a single divergent thread in a warp forces both paths to execute.

Example 2: Memory Coalescing

A warp of 32 threads accesses a float array. Thread $i$ reads $data[i * 1024]$. Analyze coalescing.

Solution: data is an array of floats (4 bytes each). Thread 0 reads &data[0], thread 1 reads &data[1024] = $&data[0] + 4096$, thread 2 reads &data[2048] = $&data[0] + 8192$.

Each access is 4096 bytes apart. A cache line is 32 bytes (8 floats). Each thread's access hits a DIFFERENT cache line → 32 separate memory transactions for the warp → 32× the latency and 1/32nd of the effective bandwidth of coalesced access.

Fix: Access data[i] instead — successive threads access successive elements → all 32 accesses fit in $32 \times 4 = 128$ bytes → 4 cache line transactions (128/32 = 4).

Click for answer Stride-1024 access: 32 separate transactions → 32× bandwidth waste. Stride-1 access: 4 transactions → full coalescing. This is a 32× performance difference from data layout alone.

Example 3: Roofline for Layer Normalization

Layer norm computes: $\mu = \frac{1}{d}\sum x_i$, $\sigma^2 = \frac{1}{d}\sum(x_i-\mu)^2$, $y_i = \gamma\frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta$. Analyze ops/byte.

Solution: Per element: reads $x_i$, writes $y_i$ (8 bytes). Plus reads of $\gamma$, $\beta$ (amortized). ~5 FLOPs per element (subtract, multiply, add, divide, multiply). Ops/byte = $5/8 \approx 0.625$ — deeply memory-bound.

Performance on A100: $0.625 \times 2$ TB/s ≈ 1.25 TFLOPS → ~0.4% of peak. Layer norm is bandwidth-limited, not compute-limited.

This is why fused kernels (e.g., FlashAttention, fused layernorm) that combine multiple operations into a single kernel pass are crucial — they reduce memory traffic.

Click for answer Layer norm: ~0.625 ops/byte — memory-bound. Achieves only ~0.4% of A100's peak compute. Fused implementations improve by reducing redundant memory reads/writes.


Quiz

Q1: What does the concept of Streaming Multiprocessor (SM) primarily refer to in this subject?

A) The definition and application of Streaming Multiprocessor (SM) B) A visual representation of Streaming Multiprocessor (SM) C) A historical anecdote about Streaming Multiprocessor (SM) D) A computational error related to Streaming Multiprocessor (SM)

Correct: A)

Q2: Which of the following is the key formula discussed in this subject?

A) A simplified version of \text{base} + i \cdot \text... B) An unrelated formula from a different topic C) \text{base} + i \cdot \text{stride} D) The inverse operation of the formula in question

Correct: C)

Q3: What is the primary purpose of Registers?

A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It is used to registers in mathematical analysis D) It replaces all other methods in this domain

Correct: C)

Q4: Which statement about Shared Memory (SRAM) is TRUE?

A) Shared Memory (SRAM) is a fundamental concept covered in this subject B) Shared Memory (SRAM) is mentioned only as a historical footnote C) Shared Memory (SRAM) is not related to this subject D) Shared Memory (SRAM) is an advanced topic beyond this subject's scope

Correct: A)

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) A different result from a common mistake C) The inverse of the correct answer D) Memory Coalescing

Correct: D)

Q6: How are Shared Memory (SRAM) and Coalesced access related?

A) Shared Memory (SRAM) is the inverse of Coalesced access B) Shared Memory (SRAM) and Coalesced access are closely related concepts C) Shared Memory (SRAM) is a special case of Coalesced access D) Shared Memory (SRAM) and Coalesced access are completely unrelated topics

Correct: B)

Q7: What is a common pitfall when working with Why Gpus??

A) The main error with Why Gpus? is using it when it is not needed B) Why Gpus? is always computed the same way in all contexts C) A common mistake is confusing Why Gpus? with a similar concept D) Why Gpus? has no common misconceptions

Correct: C)

Q8: When should you apply Simt Execution Model?

A) Apply Simt Execution Model to solve problems in this subject's domain B) Use Simt Execution Model only in pure mathematics contexts C) Avoid Simt Execution Model unless explicitly instructed D) Simt Execution Model is not practically useful

Correct: A)

Practice Problems

  1. An A100 GPU has 108 SMs, each capable of 64 warps (2048 threads) in flight. What's the total number of threads that can be in flight simultaneously?

    Click for answer $108 \times 2048 = 221,184$ threads in flight. If each thread has 32 registers (32 × 4 bytes = 128 bytes), total register file size needed: $221,184 \times 128 \approx 28.3$ MB. A100 has 256 KB per SM × 108 SM = 27.6 MB total register file — matches closely. This massive thread count is how GPUs hide memory latency.

  2. A kernel has ops/byte = 4. On an A100 (312 TFLOPS peak, 2 TB/s bandwidth), is it compute-bound or memory-bound? What's the maximum achievable TFLOPS?

    Click for answer Roofline ceiling from bandwidth: $4 \times 2$ TB/s = 8 TFLOPS. 8 TFLOPS < 312 TFLOPS → kernel is memory-bound. Maximum achievable is 8 TFLOPS — only 2.6% of peak. To become compute-bound, ops/byte must exceed $312/2 = 156$ ops/byte. Only operations like large matrix multiplies and convolutions reach this level.

  3. Explain why a reduction (sum over a large array) has low ops/byte and how to optimize it.

    Click for answer Summing $N$ floats: 1 op per element (add), 4 bytes read (float), 4 bytes written (result — amortized over $N$). Ops/byte ≈ $1/4 = 0.25$ — extremely memory-bound. Optimization: tree-based reduction in shared memory. Load chunks into shared memory (coalesced), do hierarchical reduction within each block (exploiting shared memory bandwidth), then do a final atomic add. This reduces global memory traffic to ~1 read per element. Even so, reduction is always bandwidth-bound on GPUs.

  4. A $1024 \times 1024$ matrix multiply uses tiles of $128 \times 128$. How many thread blocks are launched?

    Click for answer Output is $1024 \times 1024$. Each block computes a $128 \times 128$ tile. Blocks needed: $(1024/128) \times (1024/128) = 8 \times 8 = 64$ thread blocks. Each block may use e.g., $16 \times 16 = 256$ threads — each thread computes an $8 \times 8$ sub-tile of the output. Total threads: $64 \times 256 = 16,384$ — the GPU would launch many more blocks than SMs (occupancy) to hide latency.

  5. Why do tensor cores require specific matrix dimensions (e.g., multiples of 16)?

    Click for answer Tensor cores process $16 \times 16 \times 16$ matrix multiplies (for fp16/bf16 on A100). The input matrices must align to this tile size. If dimensions aren't multiples of 16, the matrix must be padded, wasting some compute. This is why model hidden dimensions are often multiples of 64 or 128 — to maximize tensor core utilization. A hidden size of 1024 uses tensor cores perfectly; 1025 would leave the last row/column underutilized.


Summary

Key takeaways:


Pitfalls



Next Steps

This concludes Section 15. The next phase is Neural Network Mathematics, starting with 16-01-perceptron-model.md — neuron models, activation functions, and the universal approximation theorem.