📐 Concept diagram

17-01 — Convolutional Neural Networks

Phase: 17 — Deep Learning Architectures (Math) Subject: 17-01 Prerequisites: 16-05 (Backpropagation), 16-02 (Activation Functions), 16-06 (Gradient Flow), 16-08 (Regularization) Next subject: 17-02 — Recurrent Neural Networks (RNNs)

Learning Objectives

By the end of this subject, you will be able to:

Derive the discrete convolution (cross-correlation) operation and explain its role as a feature detector
Compute output spatial dimensions given input size, kernel size, stride, and padding
Derive the backward pass through convolution — gradient with respect to input (transposed convolution) and gradient with respect to weights
Explain parameter sharing and receptive field growth mathematically
Analyze the computational efficiency of convolution vs. dense layers

Core Content

1. The Convolution Operation

In deep learning, what we call "convolution" is technically cross-correlation. The true mathematical convolution flips the kernel; deep learning frameworks skip the flip because the kernel is learned anyway. We'll follow the DL convention.

2D Convolution (cross-correlation):

For input X ∈ ℝ^(H×W), kernel K ∈ ℝ^(k×k), the output Y at position (i,j) is:

Y[i,j] = Σ_{p=0}^{k-1} Σ_{q=0}^{k-1} X[i+p, j+q] · K[p,q]

This is simply sliding the kernel over the input and computing a dot product at each position.

Multi-channel case:

For input X ∈ ℝ^(C_in×H×W) and kernel K ∈ ℝ^(C_out×C_in×k×k), with bias b ∈ ℝ^C_out:

Y[c_out, i, j] = b[c_out] + Σ_{c_in=0}^{C_in-1} Σ_{p=0}^{k-1} Σ_{q=0}^{k-1} X[c_in, i+p, j+q] · K[c_out, c_in, p, q]

Each output channel is a weighted sum of all input channels, each convolved with its own 2D kernel.

2. Stride and Padding

Stride s: How many pixels we shift the kernel at each step. Output dimensions:

H_out = ⌊(H_in − k + 2P) / s⌋ + 1 W_out = ⌊(W_in − k + 2P) / s⌋ + 1

Where P is the padding width (pixels added to each side).

Padding types: - "Valid" (P=0): No padding. Output shrinks: H_out = H_in − k + 1 - "Same" (P = ⌊k/2⌋): Output size = input size (for stride=1). Requires P = ⌊k/2⌋ so that H_out = H_in.

Common case: k=3, P=1, s=1 → output same size as input. This is the standard building block.

Consider a 256×256 RGB image with 64 output channels using 3×3 kernels:

Dense layer: (256×256×3) × (256×256×3×64) = 9.66 billion parameters — impossible
Convolution: 64 × 3 × 3 × 3 + 64 = 1,792 parameters

Why sharing works: A feature detector (e.g., edge detector) that's useful at one spatial location is useful everywhere. Translation equivariance is baked in: if you shift the input, the output shifts by the same amount (modulo stride effects).

⚠️ THIS IS CRITICAL — Parameter sharing is what makes CNNs feasible. It also encodes the inductive bias of translation equivariance, which is why CNNs work so well on images but poorly on data without spatial structure.

4. Receptive Field

The receptive field of a neuron is the region of the original input that affects its value.

Recursive formula: For layer ℓ with kernel size k_ℓ and stride s_ℓ:

r_ℓ = r_{ℓ-1} + (k_ℓ − 1) · ∏_{i=1}^{ℓ-1} s_i

For a network with only stride-1 3×3 convolutions: - Layer 1: r₁ = 3 - Layer 2: r₂ = 3 + 2 = 5 - Layer 3: r₃ = 5 + 2 = 7 - Layer L: r_L = 1 + 2·L

Geometric growth: Stacking 3×3 convs grows receptive field by 2 per layer. Using dilated convolutions or strided convolutions can grow it faster. Two 3×3 convs have same receptive field (5×5) as one 5×5 conv, but with fewer parameters and more non-linearity.

5. Pooling Layers

Max Pooling: For kernel size k and stride s:

Y[i,j] = max_{p,q ∈ {0,...,k-1}} X[i·s+p, j·s+q]

Max pooling provides local translation invariance — small shifts in the input don't change the output if the maximum stays within the pooling window.

Average Pooling:

Y[i,j] = (1/k²) Σ_{p,q} X[i·s+p, j·s+q]

Backward pass for Max Pooling: The gradient flows entirely to the neuron that achieved the maximum (argmax position). All other positions get zero gradient. This is stored as a "switch" or "mask" in the forward pass.

6. Backprop Through Convolution

This is the mathematical heart. We need two gradients: ∂L/∂X (to propagate to earlier layers) and ∂L/∂K (to update weights).

Gradient w.r.t. weights (∂L/∂K): This is itself a convolution! The gradient of the loss with respect to the kernel is the convolution of the input with the output gradient:

∂L/∂K[c_out, c_in, p, q] = Σ_{i,j} X[c_in, i+p, j+q] · ∂L/∂Y[c_out, i, j]

This is exactly a valid convolution of X with ∂L/∂Y.

Gradient w.r.t. input (∂L/∂X): This is a transposed convolution (sometimes called "deconvolution" or "fractionally-strided convolution"):

∂L/∂X[c_in, i, j] = Σ_{c_out} Σ_{p,q} K[c_out, c_in, p, q] · ∂L/∂Y[c_out, i-p, j-q]

If we flip the kernel and treat it as a convolution with the output gradient, we get the gradient with respect to the input. With stride and padding, this corresponds to the "transposed convolution" operation.

Key insight: Both forward and backward passes through convolution layers are themselves convolution operations — no matrix multiplication needed. This is why CNNs are efficient to compute on GPUs: convolution is a highly optimized primitive.

7. Feature Hierarchy

CNNs learn a hierarchy of features:

Layer 1-2: Edges, corners, color blobs (Gabor-like filters)
Layer 3-4: Textures, patterns, simple shapes
Layer 5-6: Object parts (eyes, wheels, handles)
Layer 7+: Whole objects and semantic concepts

This emerges naturally from composition: higher layers combine lower-level features into more complex ones. The mathematical reason is that convolution + non-linearity + pooling is a compositional operation that builds increasingly abstract representations.

Key Terms

17 01 Convolutional Neural Networks
Backprop Through Convolution
C) 11
C) 26
End-of-Subject Quiz
Example 1: Computing Output Dimensions
Example 2: Parameter Count Comparison
Example 3: Gradient of Convolution (Numerical Check)
Feature Hierarchy
Parameter Sharing — The Key Insight
Pooling Layers
Problem 1

Worked Examples

Example 1: Computing Output Dimensions

Problem: Input is 224×224 with 3 channels. We apply a conv layer with 64 filters of size 7×7, stride 2, padding 3. What are the output dimensions?

Solution: H_out = ⌊(224 − 7 + 2·3) / 2⌋ + 1 = ⌊223/2⌋ + 1 = 111 + 1 = 112

So output is 64 × 112 × 112. This matches the first layer of a standard ResNet.

Example 2: Parameter Count Comparison

Problem: Compare the parameter count of replacing a 5×5 convolution with two 3×3 convolutions. Assume 64 input channels and 64 output channels throughout.

Solution: - One 5×5: 64 × 64 × 5 × 5 + 64 = 102,400 + 64 = 102,464 parameters - Two 3×3: 64 × 64 × 3 × 3 + 64 = 36,864 + 64 = 36,928 per layer. Two layers: 73,856 parameters.

The two 3×3 layers use 28% fewer parameters while having the same receptive field (5×5) and more non-linearity (two ReLUs instead of one).

Example 3: Gradient of Convolution (Numerical Check)

Problem: For a 1D convolution with kernel K = [1, 2], input X = [3, 4, 5]. Stride=1, no padding. Verify the forward pass and compute ∂L/∂K if ∂L/∂Y = [1, -1].

Solution:

Forward: Y[0] = 3·1 + 4·2 = 11 Y[1] = 4·1 + 5·2 = 14 Y = [11, 14]

Gradient w.r.t. kernel: ∂L/∂K[0] = X[0]·1 + X[1]·(-1) = 3·1 + 4·(-1) = -1 ∂L/∂K[1] = X[1]·1 + X[2]·(-1) = 4·1 + 5·(-1) = -1

So ∂L/∂K = [-1, -1]. This is the convolution of X (as input) with ∂L/∂Y (as "kernel").

Quiz

Q1: What does the concept of Backprop Through Convolution primarily refer to in this subject?

A) A historical anecdote about Backprop Through Convolution B) A visual representation of Backprop Through Convolution C) The definition and application of Backprop Through Convolution D) A computational error related to Backprop Through Convolution

Correct: C)

If you chose A: This is incorrect. Backprop Through Convolution is defined as: the definition and application of backprop through convolution. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Backprop Through Convolution is defined as: the definition and application of backprop through convolution. The other options describe different aspects that are not the primary focus.
If you chose C: Backprop Through Convolution is defined as: the definition and application of backprop through convolution. The other options describe different aspects that are not the primary focus. Correct!
If you chose D: This is incorrect. Backprop Through Convolution is defined as: the definition and application of backprop through convolution. The other options describe different aspects that are not the primary focus.

Q2: What is the primary purpose of End-of-Subject Quiz?

A) It is primarily a historical notation system B) It is used to end-of-subject quiz in mathematical analysis C) It replaces all other methods in this domain D) It is used only in advanced research contexts

Correct: B)

If you chose A: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose C: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. End-of-Subject Quiz serves the purpose described in the correct answer. The other options misrepresent its role.

Q3: Which statement about Feature Hierarchy is TRUE?

A) Feature Hierarchy is not related to this subject B) Feature Hierarchy is mentioned only as a historical footnote C) Feature Hierarchy is an advanced topic beyond this subject's scope D) Feature Hierarchy is a fundamental concept covered in this subject

Correct: D)

If you chose A: This is incorrect. Feature Hierarchy is a fundamental concept covered in this subject. This subject covers Feature Hierarchy as part of its core content.
If you chose B: This is incorrect. Feature Hierarchy is a fundamental concept covered in this subject. This subject covers Feature Hierarchy as part of its core content.
If you chose C: This is incorrect. Feature Hierarchy is a fundamental concept covered in this subject. This subject covers Feature Hierarchy as part of its core content.
If you chose D: Feature Hierarchy is a fundamental concept covered in this subject. This subject covers Feature Hierarchy as part of its core content. Correct!

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) A different result from a common mistake C) An unrelated numerical value D) 32 × 32 × 32.

Correct: D)

If you chose A: This is incorrect. The worked examples show that the result is 32 × 32 × 32.. The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is 32 × 32 × 32.. The other options represent common errors.
If you chose C: This is incorrect. The worked examples show that the result is 32 × 32 × 32.. The other options represent common errors.
If you chose D: The worked examples show that the result is 32 × 32 × 32.. The other options represent common errors. Correct!

Q5: How are Feature Hierarchy and Pooling Layers related?

A) Feature Hierarchy is a special case of Pooling Layers B) Feature Hierarchy and Pooling Layers are closely related concepts C) Feature Hierarchy and Pooling Layers are completely unrelated topics D) Feature Hierarchy is the inverse of Pooling Layers

Correct: B)

If you chose A: This is incorrect. Both Feature Hierarchy and Pooling Layers are covered in this subject as interconnected topics.
If you chose B: Both Feature Hierarchy and Pooling Layers are covered in this subject as interconnected topics. Correct!
If you chose C: This is incorrect. Both Feature Hierarchy and Pooling Layers are covered in this subject as interconnected topics.
If you chose D: This is incorrect. Both Feature Hierarchy and Pooling Layers are covered in this subject as interconnected topics.

Q6: What is a common pitfall when working with The Convolution Operation?

A) The Convolution Operation has no common misconceptions B) A common mistake is confusing The Convolution Operation with a similar concept C) The main error with The Convolution Operation is using it when it is not needed D) The Convolution Operation is always computed the same way in all contexts

Correct: B)

If you chose A: This is incorrect. Students often confuse The Convolution Operation with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: Students often confuse The Convolution Operation with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose C: This is incorrect. Students often confuse The Convolution Operation with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: This is incorrect. Students often confuse The Convolution Operation with similar-sounding or related concepts. Pay attention to the precise definitions.

Q7: When should you apply Stride And Padding?

A) Stride And Padding is not practically useful B) Use Stride And Padding only in pure mathematics contexts C) Avoid Stride And Padding unless explicitly instructed D) Apply Stride And Padding to solve problems in this subject's domain

Correct: D)

If you chose A: This is incorrect. Stride And Padding is a practical tool used throughout this subject to solve relevant problems.
If you chose B: This is incorrect. Stride And Padding is a practical tool used throughout this subject to solve relevant problems.
If you chose C: This is incorrect. Stride And Padding is a practical tool used throughout this subject to solve relevant problems.
If you chose D: Stride And Padding is a practical tool used throughout this subject to solve relevant problems. Correct!

Practice Problems

Problem 1

Input 32×32 with 16 channels. Apply conv with 32 filters, 3×3 kernel, stride 1, padding 1. What is the output shape?

Answer

H_out = ⌊(32 − 3 + 2)/1⌋ + 1 = 32. Output: 32 × 32 × 32.

Problem 2

Calculate the receptive field at layer 4 if each layer uses 3×3 kernels with stride 1 (no pooling).

Answer

r₁ = 3, r₂ = 3+2 = 5, r₃ = 5+2 = 7, r₄ = 7+2 = 9.

Problem 3

A max pooling layer with kernel 2×2, stride 2 receives input [[1,3,2,4],[5,6,7,8],[9,2,1,0],[4,5,6,7]]. What's the output?

Answer

[[6,8],[9,7]]. Max of each 2×2 block.

Problem 4

If the backward gradient for the max pool in Problem 3 is ∂L/∂Y = [[α,β],[γ,δ]], what are the gradients at the positions of the winning neurons?

Answer

The winning positions: (1,1) gets α, (1,3) gets β, (2,0) gets γ, (3,3) gets δ. All other positions get 0.

Problem 5

How many parameters does a depthwise-separable convolution have, compared to a standard convolution? Consider C_in = C_out = C, kernel size k. (Depthwise: one k×k per input channel, then 1×1 to mix channels.)

Answer

Standard: C·C·k² + C = C²k² + C. Depthwise-separable: C·1·k² + C·C·1² + C = Ck² + C² + C = C² + Ck² + C. Ratio: (C² + Ck²)/(C²k²) ≈ 1/k² + 1/C. For k=3: ~1/9 the parameters.

Summary

Convolution in CNNs is cross-correlation: dot product between kernel and local input patch at each sliding position
Stride controls downsampling, padding controls boundary handling; output size = ⌊(H_in − k + 2P)/s⌋ + 1
Parameter sharing gives translation equivariance and massive parameter efficiency — a 3×3 kernel with 64 channels has only ~1.8K params vs billions for dense
Backprop through convolution is itself convolution: ∂L/∂K = X ∗ ∂L/∂Y and ∂L/∂X is a transposed convolution
The receptive field grows linearly with layer depth for 3×3 convs; two 3×3 layers match one 5×5 with fewer parameters and more non-linearity

Pitfalls

Computing output dimensions incorrectly. The formula H_out = ⌊(H_in − k + 2P)/s⌋ + 1 is the single most common source of off-by-one errors in CNN implementations. Forgetting the +1, or miscomputing the floor division with asymmetric padding, produces silently wrong tensor shapes.
Using "same" padding with stride > 1. "Same" padding only preserves spatial dimensions when stride = 1. With stride = 2, the output is still halved regardless of padding. The padding formula P = ⌊k/2⌋ ensures "same" only for s = 1.
Treating pooling as a learnable operation. Max pooling and average pooling have no parameters. If you add learnable weights to a pooling-like operation, you've created a strided convolution — a fundamentally different operation with different gradient dynamics. Don't confuse the two.
Implementing backprop through convolution as matrix multiplication. The backward pass of convolution is itself a convolution (transposed convolution for ∂L/∂X, standard convolution for ∂L/∂K). Implementing it as an explicit matrix multiplication is asymptotically correct but dramatically slower than using optimized convolution primitives.
Applying dropout directly to convolutional feature maps. Neighboring pixels in CNN feature maps are highly correlated. Standard dropout (zeroing individual neurons independently) is much less effective than for fully-connected layers. Use Spatial Dropout (dropping entire channels) or BatchNorm for regularization in CNNs.

Next Steps

Continue to 17-02 — Recurrent Neural Networks (RNNs) to learn how neural networks handle sequential data through recurrence.

Progress

Phases

17-01 — Convolutional Neural Networks

Learning Objectives

Core Content

1. The Convolution Operation

2. Stride and Padding

4. Receptive Field

5. Pooling Layers

6. Backprop Through Convolution

7. Feature Hierarchy

Key Terms

Worked Examples

Example 1: Computing Output Dimensions

Example 2: Parameter Count Comparison

Example 3: Gradient of Convolution (Numerical Check)

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Next Steps

Progress

Phases

17-01 — Convolutional Neural Networks

Learning Objectives

Core Content

1. The Convolution Operation

2. Stride and Padding

3. Parameter Sharing — The Key Insight

4. Receptive Field

5. Pooling Layers

6. Backprop Through Convolution

7. Feature Hierarchy

Key Terms

Worked Examples

Example 1: Computing Output Dimensions

Example 2: Parameter Count Comparison

Example 3: Gradient of Convolution (Numerical Check)

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Next Steps