Math graphic
📐 Concept diagram

17-01 — Convolutional Neural Networks

Phase: 17 — Deep Learning Architectures (Math) Subject: 17-01 Prerequisites: 16-05 (Backpropagation), 16-02 (Activation Functions), 16-06 (Gradient Flow), 16-08 (Regularization) Next subject: 17-02 — Recurrent Neural Networks (RNNs)


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the discrete convolution (cross-correlation) operation and explain its role as a feature detector
  2. Compute output spatial dimensions given input size, kernel size, stride, and padding
  3. Derive the backward pass through convolution — gradient with respect to input (transposed convolution) and gradient with respect to weights
  4. Explain parameter sharing and receptive field growth mathematically
  5. Analyze the computational efficiency of convolution vs. dense layers

Core Content

1. The Convolution Operation

In deep learning, what we call "convolution" is technically cross-correlation. The true mathematical convolution flips the kernel; deep learning frameworks skip the flip because the kernel is learned anyway. We'll follow the DL convention.

2D Convolution (cross-correlation):

For input X ∈ ℝ^(H×W), kernel K ∈ ℝ^(k×k), the output Y at position (i,j) is:

Y[i,j] = Σ_{p=0}^{k-1} Σ_{q=0}^{k-1} X[i+p, j+q] · K[p,q]

This is simply sliding the kernel over the input and computing a dot product at each position.

Multi-channel case:

For input X ∈ ℝ^(C_in×H×W) and kernel K ∈ ℝ^(C_out×C_in×k×k), with bias b ∈ ℝ^C_out:

Y[c_out, i, j] = b[c_out] + Σ_{c_in=0}^{C_in-1} Σ_{p=0}^{k-1} Σ_{q=0}^{k-1} X[c_in, i+p, j+q] · K[c_out, c_in, p, q]

Each output channel is a weighted sum of all input channels, each convolved with its own 2D kernel.

2. Stride and Padding

Stride s: How many pixels we shift the kernel at each step. Output dimensions:

H_out = ⌊(H_in − k + 2P) / s⌋ + 1 W_out = ⌊(W_in − k + 2P) / s⌋ + 1

Where P is the padding width (pixels added to each side).

Padding types: - "Valid" (P=0): No padding. Output shrinks: H_out = H_in − k + 1 - "Same" (P = ⌊k/2⌋): Output size = input size (for stride=1). Requires P = ⌊k/2⌋ so that H_out = H_in.

Common case: k=3, P=1, s=1 → output same size as input. This is the standard building block.

3. Parameter Sharing — The Key Insight

Consider a 256×256 RGB image with 64 output channels using 3×3 kernels:

Why sharing works: A feature detector (e.g., edge detector) that's useful at one spatial location is useful everywhere. Translation equivariance is baked in: if you shift the input, the output shifts by the same amount (modulo stride effects).

⚠️ THIS IS CRITICAL — Parameter sharing is what makes CNNs feasible. It also encodes the inductive bias of translation equivariance, which is why CNNs work so well on images but poorly on data without spatial structure.

4. Receptive Field

The receptive field of a neuron is the region of the original input that affects its value.

Recursive formula: For layer ℓ with kernel size k_ℓ and stride s_ℓ:

r_ℓ = r_{ℓ-1} + (k_ℓ − 1) · ∏_{i=1}^{ℓ-1} s_i

For a network with only stride-1 3×3 convolutions: - Layer 1: r₁ = 3 - Layer 2: r₂ = 3 + 2 = 5 - Layer 3: r₃ = 5 + 2 = 7 - Layer L: r_L = 1 + 2·L

Geometric growth: Stacking 3×3 convs grows receptive field by 2 per layer. Using dilated convolutions or strided convolutions can grow it faster. Two 3×3 convs have same receptive field (5×5) as one 5×5 conv, but with fewer parameters and more non-linearity.

5. Pooling Layers

Max Pooling: For kernel size k and stride s:

Y[i,j] = max_{p,q ∈ {0,...,k-1}} X[i·s+p, j·s+q]

Max pooling provides local translation invariance — small shifts in the input don't change the output if the maximum stays within the pooling window.

Average Pooling:

Y[i,j] = (1/k²) Σ_{p,q} X[i·s+p, j·s+q]

Backward pass for Max Pooling: The gradient flows entirely to the neuron that achieved the maximum (argmax position). All other positions get zero gradient. This is stored as a "switch" or "mask" in the forward pass.

6. Backprop Through Convolution

This is the mathematical heart. We need two gradients: ∂L/∂X (to propagate to earlier layers) and ∂L/∂K (to update weights).

Gradient w.r.t. weights (∂L/∂K): This is itself a convolution! The gradient of the loss with respect to the kernel is the convolution of the input with the output gradient:

∂L/∂K[c_out, c_in, p, q] = Σ_{i,j} X[c_in, i+p, j+q] · ∂L/∂Y[c_out, i, j]

This is exactly a valid convolution of X with ∂L/∂Y.

Gradient w.r.t. input (∂L/∂X): This is a transposed convolution (sometimes called "deconvolution" or "fractionally-strided convolution"):

∂L/∂X[c_in, i, j] = Σ_{c_out} Σ_{p,q} K[c_out, c_in, p, q] · ∂L/∂Y[c_out, i-p, j-q]

If we flip the kernel and treat it as a convolution with the output gradient, we get the gradient with respect to the input. With stride and padding, this corresponds to the "transposed convolution" operation.

Key insight: Both forward and backward passes through convolution layers are themselves convolution operations — no matrix multiplication needed. This is why CNNs are efficient to compute on GPUs: convolution is a highly optimized primitive.

7. Feature Hierarchy

CNNs learn a hierarchy of features:

This emerges naturally from composition: higher layers combine lower-level features into more complex ones. The mathematical reason is that convolution + non-linearity + pooling is a compositional operation that builds increasingly abstract representations.



Key Terms

Worked Examples

Example 1: Computing Output Dimensions

Problem: Input is 224×224 with 3 channels. We apply a conv layer with 64 filters of size 7×7, stride 2, padding 3. What are the output dimensions?

Solution: H_out = ⌊(224 − 7 + 2·3) / 2⌋ + 1 = ⌊223/2⌋ + 1 = 111 + 1 = 112

So output is 64 × 112 × 112. This matches the first layer of a standard ResNet.

Example 2: Parameter Count Comparison

Problem: Compare the parameter count of replacing a 5×5 convolution with two 3×3 convolutions. Assume 64 input channels and 64 output channels throughout.

Solution: - One 5×5: 64 × 64 × 5 × 5 + 64 = 102,400 + 64 = 102,464 parameters - Two 3×3: 64 × 64 × 3 × 3 + 64 = 36,864 + 64 = 36,928 per layer. Two layers: 73,856 parameters.

The two 3×3 layers use 28% fewer parameters while having the same receptive field (5×5) and more non-linearity (two ReLUs instead of one).

Example 3: Gradient of Convolution (Numerical Check)

Problem: For a 1D convolution with kernel K = [1, 2], input X = [3, 4, 5]. Stride=1, no padding. Verify the forward pass and compute ∂L/∂K if ∂L/∂Y = [1, -1].

Solution:

Forward: Y[0] = 3·1 + 4·2 = 11 Y[1] = 4·1 + 5·2 = 14 Y = [11, 14]

Gradient w.r.t. kernel: ∂L/∂K[0] = X[0]·1 + X[1]·(-1) = 3·1 + 4·(-1) = -1 ∂L/∂K[1] = X[1]·1 + X[2]·(-1) = 4·1 + 5·(-1) = -1

So ∂L/∂K = [-1, -1]. This is the convolution of X (as input) with ∂L/∂Y (as "kernel").


Quiz

Q1: What does the concept of Backprop Through Convolution primarily refer to in this subject?

A) A historical anecdote about Backprop Through Convolution B) A visual representation of Backprop Through Convolution C) The definition and application of Backprop Through Convolution D) A computational error related to Backprop Through Convolution

Correct: C)

Q2: What is the primary purpose of End-of-Subject Quiz?

A) It is primarily a historical notation system B) It is used to end-of-subject quiz in mathematical analysis C) It replaces all other methods in this domain D) It is used only in advanced research contexts

Correct: B)

Q3: Which statement about Feature Hierarchy is TRUE?

A) Feature Hierarchy is not related to this subject B) Feature Hierarchy is mentioned only as a historical footnote C) Feature Hierarchy is an advanced topic beyond this subject's scope D) Feature Hierarchy is a fundamental concept covered in this subject

Correct: D)

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) A different result from a common mistake C) An unrelated numerical value D) 32 × 32 × 32.

Correct: D)

Q5: How are Feature Hierarchy and Pooling Layers related?

A) Feature Hierarchy is a special case of Pooling Layers B) Feature Hierarchy and Pooling Layers are closely related concepts C) Feature Hierarchy and Pooling Layers are completely unrelated topics D) Feature Hierarchy is the inverse of Pooling Layers

Correct: B)

Q6: What is a common pitfall when working with The Convolution Operation?

A) The Convolution Operation has no common misconceptions B) A common mistake is confusing The Convolution Operation with a similar concept C) The main error with The Convolution Operation is using it when it is not needed D) The Convolution Operation is always computed the same way in all contexts

Correct: B)

Q7: When should you apply Stride And Padding?

A) Stride And Padding is not practically useful B) Use Stride And Padding only in pure mathematics contexts C) Avoid Stride And Padding unless explicitly instructed D) Apply Stride And Padding to solve problems in this subject's domain

Correct: D)

Practice Problems

Problem 1

Input 32×32 with 16 channels. Apply conv with 32 filters, 3×3 kernel, stride 1, padding 1. What is the output shape?

Answer H_out = ⌊(32 − 3 + 2)/1⌋ + 1 = 32. Output: 32 × 32 × 32.

Problem 2

Calculate the receptive field at layer 4 if each layer uses 3×3 kernels with stride 1 (no pooling).

Answer r₁ = 3, r₂ = 3+2 = 5, r₃ = 5+2 = 7, r₄ = 7+2 = 9.

Problem 3

A max pooling layer with kernel 2×2, stride 2 receives input [[1,3,2,4],[5,6,7,8],[9,2,1,0],[4,5,6,7]]. What's the output?

Answer [[6,8],[9,7]]. Max of each 2×2 block.

Problem 4

If the backward gradient for the max pool in Problem 3 is ∂L/∂Y = [[α,β],[γ,δ]], what are the gradients at the positions of the winning neurons?

Answer The winning positions: (1,1) gets α, (1,3) gets β, (2,0) gets γ, (3,3) gets δ. All other positions get 0.

Problem 5

How many parameters does a depthwise-separable convolution have, compared to a standard convolution? Consider C_in = C_out = C, kernel size k. (Depthwise: one k×k per input channel, then 1×1 to mix channels.)

Answer Standard: C·C·k² + C = C²k² + C. Depthwise-separable: C·1·k² + C·C·1² + C = Ck² + C² + C = C² + Ck² + C. Ratio: (C² + Ck²)/(C²k²) ≈ 1/k² + 1/C. For k=3: ~1/9 the parameters.

Summary

  1. Convolution in CNNs is cross-correlation: dot product between kernel and local input patch at each sliding position
  2. Stride controls downsampling, padding controls boundary handling; output size = ⌊(H_in − k + 2P)/s⌋ + 1
  3. Parameter sharing gives translation equivariance and massive parameter efficiency — a 3×3 kernel with 64 channels has only ~1.8K params vs billions for dense
  4. Backprop through convolution is itself convolution: ∂L/∂K = X ∗ ∂L/∂Y and ∂L/∂X is a transposed convolution
  5. The receptive field grows linearly with layer depth for 3×3 convs; two 3×3 layers match one 5×5 with fewer parameters and more non-linearity

Pitfalls



Next Steps

Continue to 17-02 — Recurrent Neural Networks (RNNs) to learn how neural networks handle sequential data through recurrence.