17-01 — Convolutional Neural Networks
Phase: 17 — Deep Learning Architectures (Math) Subject: 17-01 Prerequisites: 16-05 (Backpropagation), 16-02 (Activation Functions), 16-06 (Gradient Flow), 16-08 (Regularization) Next subject: 17-02 — Recurrent Neural Networks (RNNs)
Learning Objectives
By the end of this subject, you will be able to:
- Derive the discrete convolution (cross-correlation) operation and explain its role as a feature detector
- Compute output spatial dimensions given input size, kernel size, stride, and padding
- Derive the backward pass through convolution — gradient with respect to input (transposed convolution) and gradient with respect to weights
- Explain parameter sharing and receptive field growth mathematically
- Analyze the computational efficiency of convolution vs. dense layers
Core Content
1. The Convolution Operation
In deep learning, what we call "convolution" is technically cross-correlation. The true mathematical convolution flips the kernel; deep learning frameworks skip the flip because the kernel is learned anyway. We'll follow the DL convention.
2D Convolution (cross-correlation):
For input X ∈ ℝ^(H×W), kernel K ∈ ℝ^(k×k), the output Y at position (i,j) is:
Y[i,j] = Σ_{p=0}^{k-1} Σ_{q=0}^{k-1} X[i+p, j+q] · K[p,q]
This is simply sliding the kernel over the input and computing a dot product at each position.
Multi-channel case:
For input X ∈ ℝ^(C_in×H×W) and kernel K ∈ ℝ^(C_out×C_in×k×k), with bias b ∈ ℝ^C_out:
Y[c_out, i, j] = b[c_out] + Σ_{c_in=0}^{C_in-1} Σ_{p=0}^{k-1} Σ_{q=0}^{k-1} X[c_in, i+p, j+q] · K[c_out, c_in, p, q]
Each output channel is a weighted sum of all input channels, each convolved with its own 2D kernel.
2. Stride and Padding
Stride s: How many pixels we shift the kernel at each step. Output dimensions:
H_out = ⌊(H_in − k + 2P) / s⌋ + 1 W_out = ⌊(W_in − k + 2P) / s⌋ + 1
Where P is the padding width (pixels added to each side).
Padding types: - "Valid" (P=0): No padding. Output shrinks: H_out = H_in − k + 1 - "Same" (P = ⌊k/2⌋): Output size = input size (for stride=1). Requires P = ⌊k/2⌋ so that H_out = H_in.
Common case: k=3, P=1, s=1 → output same size as input. This is the standard building block.
3. Parameter Sharing — The Key Insight
Consider a 256×256 RGB image with 64 output channels using 3×3 kernels:
- Dense layer: (256×256×3) × (256×256×3×64) = 9.66 billion parameters — impossible
- Convolution: 64 × 3 × 3 × 3 + 64 = 1,792 parameters
Why sharing works: A feature detector (e.g., edge detector) that's useful at one spatial location is useful everywhere. Translation equivariance is baked in: if you shift the input, the output shifts by the same amount (modulo stride effects).
⚠️ THIS IS CRITICAL — Parameter sharing is what makes CNNs feasible. It also encodes the inductive bias of translation equivariance, which is why CNNs work so well on images but poorly on data without spatial structure.
4. Receptive Field
The receptive field of a neuron is the region of the original input that affects its value.
Recursive formula: For layer ℓ with kernel size k_ℓ and stride s_ℓ:
r_ℓ = r_{ℓ-1} + (k_ℓ − 1) · ∏_{i=1}^{ℓ-1} s_i
For a network with only stride-1 3×3 convolutions: - Layer 1: r₁ = 3 - Layer 2: r₂ = 3 + 2 = 5 - Layer 3: r₃ = 5 + 2 = 7 - Layer L: r_L = 1 + 2·L
Geometric growth: Stacking 3×3 convs grows receptive field by 2 per layer. Using dilated convolutions or strided convolutions can grow it faster. Two 3×3 convs have same receptive field (5×5) as one 5×5 conv, but with fewer parameters and more non-linearity.
5. Pooling Layers
Max Pooling: For kernel size k and stride s:
Y[i,j] = max_{p,q ∈ {0,...,k-1}} X[i·s+p, j·s+q]
Max pooling provides local translation invariance — small shifts in the input don't change the output if the maximum stays within the pooling window.
Average Pooling:
Y[i,j] = (1/k²) Σ_{p,q} X[i·s+p, j·s+q]
Backward pass for Max Pooling: The gradient flows entirely to the neuron that achieved the maximum (argmax position). All other positions get zero gradient. This is stored as a "switch" or "mask" in the forward pass.
6. Backprop Through Convolution
This is the mathematical heart. We need two gradients: ∂L/∂X (to propagate to earlier layers) and ∂L/∂K (to update weights).
Gradient w.r.t. weights (∂L/∂K): This is itself a convolution! The gradient of the loss with respect to the kernel is the convolution of the input with the output gradient:
∂L/∂K[c_out, c_in, p, q] = Σ_{i,j} X[c_in, i+p, j+q] · ∂L/∂Y[c_out, i, j]
This is exactly a valid convolution of X with ∂L/∂Y.
Gradient w.r.t. input (∂L/∂X): This is a transposed convolution (sometimes called "deconvolution" or "fractionally-strided convolution"):
∂L/∂X[c_in, i, j] = Σ_{c_out} Σ_{p,q} K[c_out, c_in, p, q] · ∂L/∂Y[c_out, i-p, j-q]
If we flip the kernel and treat it as a convolution with the output gradient, we get the gradient with respect to the input. With stride and padding, this corresponds to the "transposed convolution" operation.
Key insight: Both forward and backward passes through convolution layers are themselves convolution operations — no matrix multiplication needed. This is why CNNs are efficient to compute on GPUs: convolution is a highly optimized primitive.
7. Feature Hierarchy
CNNs learn a hierarchy of features:
- Layer 1-2: Edges, corners, color blobs (Gabor-like filters)
- Layer 3-4: Textures, patterns, simple shapes
- Layer 5-6: Object parts (eyes, wheels, handles)
- Layer 7+: Whole objects and semantic concepts
This emerges naturally from composition: higher layers combine lower-level features into more complex ones. The mathematical reason is that convolution + non-linearity + pooling is a compositional operation that builds increasingly abstract representations.
Key Terms
- 17 01 Convolutional Neural Networks
- Backprop Through Convolution
- C) 11
- C) 26
- End-of-Subject Quiz
- Example 1: Computing Output Dimensions
- Example 2: Parameter Count Comparison
- Example 3: Gradient of Convolution (Numerical Check)
- Feature Hierarchy
- Parameter Sharing — The Key Insight
- Pooling Layers
- Problem 1
Worked Examples
Example 1: Computing Output Dimensions
Problem: Input is 224×224 with 3 channels. We apply a conv layer with 64 filters of size 7×7, stride 2, padding 3. What are the output dimensions?
Solution: H_out = ⌊(224 − 7 + 2·3) / 2⌋ + 1 = ⌊223/2⌋ + 1 = 111 + 1 = 112
So output is 64 × 112 × 112. This matches the first layer of a standard ResNet.
Example 2: Parameter Count Comparison
Problem: Compare the parameter count of replacing a 5×5 convolution with two 3×3 convolutions. Assume 64 input channels and 64 output channels throughout.
Solution: - One 5×5: 64 × 64 × 5 × 5 + 64 = 102,400 + 64 = 102,464 parameters - Two 3×3: 64 × 64 × 3 × 3 + 64 = 36,864 + 64 = 36,928 per layer. Two layers: 73,856 parameters.
The two 3×3 layers use 28% fewer parameters while having the same receptive field (5×5) and more non-linearity (two ReLUs instead of one).
Example 3: Gradient of Convolution (Numerical Check)
Problem: For a 1D convolution with kernel K = [1, 2], input X = [3, 4, 5]. Stride=1, no padding. Verify the forward pass and compute ∂L/∂K if ∂L/∂Y = [1, -1].
Solution:
Forward: Y[0] = 3·1 + 4·2 = 11 Y[1] = 4·1 + 5·2 = 14 Y = [11, 14]
Gradient w.r.t. kernel: ∂L/∂K[0] = X[0]·1 + X[1]·(-1) = 3·1 + 4·(-1) = -1 ∂L/∂K[1] = X[1]·1 + X[2]·(-1) = 4·1 + 5·(-1) = -1
So ∂L/∂K = [-1, -1]. This is the convolution of X (as input) with ∂L/∂Y (as "kernel").
Quiz
Q1: What does the concept of Backprop Through Convolution primarily refer to in this subject?
A) A historical anecdote about Backprop Through Convolution B) A visual representation of Backprop Through Convolution C) The definition and application of Backprop Through Convolution D) A computational error related to Backprop Through Convolution
Correct: C)
- If you chose A: This is incorrect. Backprop Through Convolution is defined as: the definition and application of backprop through convolution. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Backprop Through Convolution is defined as: the definition and application of backprop through convolution. The other options describe different aspects that are not the primary focus.
- If you chose C: Backprop Through Convolution is defined as: the definition and application of backprop through convolution. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Backprop Through Convolution is defined as: the definition and application of backprop through convolution. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of End-of-Subject Quiz?
A) It is primarily a historical notation system B) It is used to end-of-subject quiz in mathematical analysis C) It replaces all other methods in this domain D) It is used only in advanced research contexts
Correct: B)
- If you chose A: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about Feature Hierarchy is TRUE?
A) Feature Hierarchy is not related to this subject B) Feature Hierarchy is mentioned only as a historical footnote C) Feature Hierarchy is an advanced topic beyond this subject's scope D) Feature Hierarchy is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. Feature Hierarchy is a fundamental concept covered in this subject. This subject covers Feature Hierarchy as part of its core content.
- If you chose B: This is incorrect. Feature Hierarchy is a fundamental concept covered in this subject. This subject covers Feature Hierarchy as part of its core content.
- If you chose C: This is incorrect. Feature Hierarchy is a fundamental concept covered in this subject. This subject covers Feature Hierarchy as part of its core content.
- If you chose D: Feature Hierarchy is a fundamental concept covered in this subject. This subject covers Feature Hierarchy as part of its core content. Correct!
Q4: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) A different result from a common mistake C) An unrelated numerical value D) 32 × 32 × 32.
Correct: D)
- If you chose A: This is incorrect. The worked examples show that the result is 32 × 32 × 32.. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is 32 × 32 × 32.. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is 32 × 32 × 32.. The other options represent common errors.
- If you chose D: The worked examples show that the result is 32 × 32 × 32.. The other options represent common errors. Correct!
Q5: How are Feature Hierarchy and Pooling Layers related?
A) Feature Hierarchy is a special case of Pooling Layers B) Feature Hierarchy and Pooling Layers are closely related concepts C) Feature Hierarchy and Pooling Layers are completely unrelated topics D) Feature Hierarchy is the inverse of Pooling Layers
Correct: B)
- If you chose A: This is incorrect. Both Feature Hierarchy and Pooling Layers are covered in this subject as interconnected topics.
- If you chose B: Both Feature Hierarchy and Pooling Layers are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both Feature Hierarchy and Pooling Layers are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Feature Hierarchy and Pooling Layers are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with The Convolution Operation?
A) The Convolution Operation has no common misconceptions B) A common mistake is confusing The Convolution Operation with a similar concept C) The main error with The Convolution Operation is using it when it is not needed D) The Convolution Operation is always computed the same way in all contexts
Correct: B)
- If you chose A: This is incorrect. Students often confuse The Convolution Operation with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse The Convolution Operation with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse The Convolution Operation with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse The Convolution Operation with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply Stride And Padding?
A) Stride And Padding is not practically useful B) Use Stride And Padding only in pure mathematics contexts C) Avoid Stride And Padding unless explicitly instructed D) Apply Stride And Padding to solve problems in this subject's domain
Correct: D)
- If you chose A: This is incorrect. Stride And Padding is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Stride And Padding is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Stride And Padding is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: Stride And Padding is a practical tool used throughout this subject to solve relevant problems. Correct!
Practice Problems
Problem 1
Input 32×32 with 16 channels. Apply conv with 32 filters, 3×3 kernel, stride 1, padding 1. What is the output shape?
Answer
H_out = ⌊(32 − 3 + 2)/1⌋ + 1 = 32. Output: 32 × 32 × 32.Problem 2
Calculate the receptive field at layer 4 if each layer uses 3×3 kernels with stride 1 (no pooling).
Answer
r₁ = 3, r₂ = 3+2 = 5, r₃ = 5+2 = 7, r₄ = 7+2 = 9.Problem 3
A max pooling layer with kernel 2×2, stride 2 receives input [[1,3,2,4],[5,6,7,8],[9,2,1,0],[4,5,6,7]]. What's the output?
Answer
[[6,8],[9,7]]. Max of each 2×2 block.Problem 4
If the backward gradient for the max pool in Problem 3 is ∂L/∂Y = [[α,β],[γ,δ]], what are the gradients at the positions of the winning neurons?
Answer
The winning positions: (1,1) gets α, (1,3) gets β, (2,0) gets γ, (3,3) gets δ. All other positions get 0.Problem 5
How many parameters does a depthwise-separable convolution have, compared to a standard convolution? Consider C_in = C_out = C, kernel size k. (Depthwise: one k×k per input channel, then 1×1 to mix channels.)
Answer
Standard: C·C·k² + C = C²k² + C. Depthwise-separable: C·1·k² + C·C·1² + C = Ck² + C² + C = C² + Ck² + C. Ratio: (C² + Ck²)/(C²k²) ≈ 1/k² + 1/C. For k=3: ~1/9 the parameters.Summary
- Convolution in CNNs is cross-correlation: dot product between kernel and local input patch at each sliding position
- Stride controls downsampling, padding controls boundary handling; output size = ⌊(H_in − k + 2P)/s⌋ + 1
- Parameter sharing gives translation equivariance and massive parameter efficiency — a 3×3 kernel with 64 channels has only ~1.8K params vs billions for dense
- Backprop through convolution is itself convolution: ∂L/∂K = X ∗ ∂L/∂Y and ∂L/∂X is a transposed convolution
- The receptive field grows linearly with layer depth for 3×3 convs; two 3×3 layers match one 5×5 with fewer parameters and more non-linearity
Pitfalls
- Computing output dimensions incorrectly. The formula H_out = ⌊(H_in − k + 2P)/s⌋ + 1 is the single most common source of off-by-one errors in CNN implementations. Forgetting the +1, or miscomputing the floor division with asymmetric padding, produces silently wrong tensor shapes.
- Using "same" padding with stride > 1. "Same" padding only preserves spatial dimensions when stride = 1. With stride = 2, the output is still halved regardless of padding. The padding formula P = ⌊k/2⌋ ensures "same" only for s = 1.
- Treating pooling as a learnable operation. Max pooling and average pooling have no parameters. If you add learnable weights to a pooling-like operation, you've created a strided convolution — a fundamentally different operation with different gradient dynamics. Don't confuse the two.
- Implementing backprop through convolution as matrix multiplication. The backward pass of convolution is itself a convolution (transposed convolution for ∂L/∂X, standard convolution for ∂L/∂K). Implementing it as an explicit matrix multiplication is asymptotically correct but dramatically slower than using optimized convolution primitives.
- Applying dropout directly to convolutional feature maps. Neighboring pixels in CNN feature maps are highly correlated. Standard dropout (zeroing individual neurons independently) is much less effective than for fully-connected layers. Use Spatial Dropout (dropping entire channels) or BatchNorm for regularization in CNNs.
Next Steps
Continue to 17-02 — Recurrent Neural Networks (RNNs) to learn how neural networks handle sequential data through recurrence.