Math graphic
📐 Concept diagram

16-08 — Regularization

Phase: 16 — Neural Network Mathematics Subject: 16-08 Prerequisites: 16-04 (Loss Functions), Phase 10-11 (probability, Bayesian concepts), Phase 14 (optimization) Next subject: 16-09 — Batch Normalization


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the L1 and L2 regularization penalties and their gradients, including their Bayesian interpretations
  2. Explain dropout mathematically, including the inverted dropout scaling and why it approximates model averaging
  3. Prove that L2 weight decay is equivalent to L2 regularization for SGD (but not for adaptive optimizers like Adam)
  4. Describe data augmentation, early stopping, and label smoothing as regularization techniques
  5. Choose appropriate regularization strategies based on network architecture and dataset characteristics

Core Content

1. What Is Regularization?

Regularization is any technique that reduces overfitting — when a model performs well on training data but poorly on unseen data. Formally, it modifies the learning objective to prefer "simpler" solutions:

L_reg(θ) = L_data(θ) + λ · R(θ)

Where L_data is the task loss, R(θ) is a regularization penalty on the parameters, and λ controls the regularization strength.

⚠️ THIS IS CRITICAL — In modern deep learning with massive overparameterized models, regularization is ESSENTIAL. Without it, large networks easily memorize the training data, achieving zero training loss while failing completely on new examples.

2. L2 Regularization (Weight Decay)

Penalty: R(θ) = (1/2) ||w||²₂ = (1/2) Σ wᵢ²

L_total = L_data + (λ/2) Σ wᵢ²

Note: Biases are typically NOT regularized — they don't cause overfitting in the same way.

Gradient: ∂L_total/∂w = ∂L_data/∂w + λ·w

Update rule (SGD): w ← w − η(∂L_data/∂w + λ·w) = (1 − ηλ)·w − η·∂L_data/∂w

The term (1 − ηλ) multiplies the weight at each step — hence the name "weight decay": weights decay toward zero by a factor of (1 − ηλ) each iteration.

Bayesian interpretation: L2 regularization corresponds to a Gaussian prior on weights: p(w) = N(0, 1/λ). Maximizing the posterior p(w|D) ∝ p(D|w) · p(w) is equivalent to minimizing L_data + (λ/2)||w||².

Effect on optimization: - Encourages all weights to be small but non-zero - Creates a "bowl-shaped" quadratic basin around the optimum, improving conditioning - Prevents any single weight from dominating (distributes importance across many weights)

3. L1 Regularization (Lasso)

Penalty: R(θ) = ||w||₁ = Σ |wᵢ|

L_total = L_data + λ Σ |wᵢ|

Gradient: ∂L_total/∂w = ∂L_data/∂w + λ · sign(w)

Where sign(w) = +1 if w > 0, −1 if w < 0, and subgradient [−1, 1] at w = 0.

Update rule: w ← w − η(∂L_data/∂w + λ·sign(w))

Key difference from L2: L1 pushes weights all the way to exactly ZERO, producing sparse solutions. This is because the L1 gradient is constant (±λ), while L2 gradient (λ·w) shrinks as w → 0.

Why sparsity? Consider a weight w far from zero. Both L1 and L2 push it toward zero. Near zero: - L2 gradient: λ·w ≈ 0 — the push weakens, so w settles near 0 but not exactly 0 - L1 gradient: λ·sign(w) = ±λ — the push stays CONSTANT, so w is pushed all the way to exactly 0

Bayesian interpretation: L1 corresponds to a Laplace prior p(w) ∝ exp(−λ|w|), which has a sharp peak at 0, favoring exact zeros.

Use cases: - Feature selection (sparse weights → only important features remain) - Model compression (many weights are exactly 0) - Interpretable models

4. L2 Weight Decay vs AdamW

Crucial distinction: For SGD, L2 regularization and weight decay are EQUIVALENT:

w ← (1 − ηλ)·w − η·∂L_data/∂w

For adaptive optimizers like Adam, they are NOT equivalent. Adam's per-parameter learning rates interact with the L2 penalty differently than simple weight decay.

AdamW (Loshchilov & Hutter, 2019) decouples weight decay from the adaptive gradient:

Standard L2 in Adam: gradient includes λ·w, which gets divided by the adaptive learning rate. AdamW weight decay: directly subtract ηλ·w from the weights, separately from the adaptive gradient step.

AdamW: w ← w − η · (m̂/(√(v̂) + ε) + λ·w)

This is now the standard approach for training transformers and large models.

5. Dropout

Idea: During training, randomly "drop" (zero out) each neuron independently with probability p (the dropout rate).

Forward pass during training:

a_drop = am / (1 − p)

Where mᵢ ∼ Bernoulli(1 − p) and the division by (1 − p) is the inverted dropout scaling.

Why scale by 1/(1 − p)? At inference, NO dropout is applied. The expected activation during training is E[a_drop] = a·(1−p)/(1−p) = a, matching the inference expectation. Without this "inverted" scaling, you'd need to scale DOWN at inference time.

Why dropout works: Dropout can be seen as: 1. Implicit ensemble: Each dropout mask creates a different sub-network. Training with dropout trains an exponential number of sub-networks that share parameters. At inference, you get an approximation of averaging all these sub-networks. 2. Prevents co-adaptation: A neuron cannot rely on other specific neurons being present — it must learn features that are independently useful. 3. Adds noise: Acts as a regularizer by injecting Bernoulli noise into activations.

Gradient through dropout: During backprop, dropped neurons have zero gradient (they didn't contribute). The surviving neurons get gradients scaled by 1/(1 − p).

Typical values: p = 0.5 for hidden layers (maximum regularization), p = 0.2 for input layer, no dropout on output layer.

6. Data Augmentation

Regularization doesn't have to come from the optimization objective — it can come from the DATA.

Principle: Apply label-preserving transformations to training examples, effectively increasing the dataset size and teaching the model desired invariances.

Image augmentations: Random crops, flips, rotations, color jitter, Cutout, Mixup (interpolate between two images), CutMix.

Mathematical view (Mixup):

= λ·xᵢ + (1−λ)·xⱼ ỹ = λ·yᵢ + (1−λ)·yⱼ

Where λ ∼ Beta(α, α). This trains the model to behave linearly between training examples, smoothing the decision boundary.

Text augmentations: Back-translation, synonym replacement, random deletion/swap, paraphrasing (via another model).

7. Early Stopping

Idea: Monitor validation loss during training. Stop when validation loss stops improving (or starts increasing), even though training loss continues to decrease.

Why it regularizes: As training progresses, the model first learns general patterns (low validation loss), then starts memorizing noise in the training data (validation loss plateaus or increases). Early stopping halts training at the point of optimal generalization.

The connection to L2 regularization: For linear models with quadratic loss, early stopping with small learning rate has been shown to be equivalent to L2 regularization. The number of training iterations acts as an inverse regularization parameter.

8. Label Smoothing

Idea: Instead of hard one-hot targets y = [0, ..., 0, 1, 0, ..., 0], use "soft" targets:

ỹₖ = { 1 − ε if k = true_class { ε/(K−1) otherwise

Where ε is the smoothing parameter (typically 0.1).

Effect on cross-entropy: The model is penalized less for being "too confident." Instead of pushing the predicted probability for the correct class to 1.0 (which requires infinite logits), it aims for 1 − ε. This:

Connection to knowledge distillation: Label smoothing with ε is roughly equivalent to adding a uniform distribution as a "teacher" with weight ε, and the hard labels as another "teacher" with weight (1 − ε).



Key Terms

Worked Examples

Example 1: L1 vs L2 Sparsity

Problem: Consider the simple 1D optimization: minimize f(w) = (w − 1)² with L1 regularization λ|w| and L2 regularization (λ/2)w². Find the optimal w for each case with λ = 0.5. Compare the sparsity.

Solution:

L2: f(w) = (w − 1)² + 0.25w² = w² − 2w + 1 + 0.25w² = 1.25w² − 2w + 1 f'(w) = 2.5w − 2 = 0 → w = 0.8

Weight is shrunk from 1.0 to 0.8, but NOT to zero.

L1: f(w) = (w − 1)² + 0.5|w|

Case w > 0: f(w) = (w−1)² + 0.5w = w² − 2w + 1 + 0.5w = w² − 1.5w + 1 f'(w) = 2w − 1.5 = 0 → w = 0.75 > 0, valid.

Case w < 0: f(w) = (w−1)² − 0.5w = w² − 2w + 1 − 0.5w = w² − 2.5w + 1 f'(w) = 2w − 2.5 = 0 → w = 1.25 > 0, contradicts w < 0 assumption.

Optimum: w = 0.75 for λ = 0.5.

Now try λ = 2: L1, w>0: f'(w) = 2w − 2 + 2 = 2w = 0 → w = 0. But check subgradient: at w=0, f'(0+) = 2, f'(0−) = −2. Subgradient [−2, 2] contains 0. So w=0 is optimal!

L2 with λ = 2: f'(w) = 2w − 2 + 2w = 4w − 2 = 0 → w = 0.5. Still not zero!

Conclusion: L1 produces exact zeros when λ is large enough. L2 only asymptotically approaches zero — never reaches it.

Example 2: Dropout as Ensemble Averaging

Problem: A tiny network with 2 hidden neurons (p = 0.5 dropout) has output y = w₁h₁ + w₂h₂. During training with dropout, enumerate all 4 possible dropout masks and compute the expected output. Show how this relates to the inference-time output.

Solution:

Masks (m₁, m₂) where mᵢ ∈ {0, 1} with P(0) = 0.5: 1. (0,0): y = 0 2. (0,1): y = w₂h₂/0.5 = 2w₂h₂ (scaled by 1/(1−p)) 3. (1,0): y = 2w₁h₁ 4. (1,1): y = 2w₁h₁ + 2w₂h₂

Expected output: E[y] = 0.25·0 + 0.25·2w₂h₂ + 0.25·2w₁h₁ + 0.25·(2w₁h₁+2w₂h₂) = w₁h₁ + w₂h₂

This matches the inference-time output (where dropout is off)! The inverted scaling makes the expected training output equal to the inference output — a property called "expectation linearity."

Example 3: Label Smoothing Gradient

Problem: For a 3-class problem with true class 1, compare the CCE gradients with and without label smoothing (ε = 0.1). = softmax(z) = [0.7, 0.2, 0.1]ᵀ.

Solution:

Without smoothing: y = [1, 0, 0]ᵀ ∇L = y = [−0.3, 0.2, 0.1]ᵀ

With smoothing (ε = 0.1, K = 3): ỹ = [0.9, 0.05, 0.05]ᵀ ∇L_smooth = − ỹ = [−0.2, 0.15, 0.05]ᵀ

The gradient toward the correct class is smaller (0.2 vs 0.3), and the model is also pushed to not be overconfident about incorrect classes. All gradients are less extreme — label smoothing reduces the variance of the gradient signal.

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: Show that for SGD, the L2-regularized update w ← (1−ηλ)w − η·∂L_data/∂w is equivalent to first computing the unregularized gradient and then applying weight decay.

Problem 2: Derive the optimal L1-regularized solution for f(w) = (1/2)(w − μ)² + λ|w|. Express the solution as a soft-thresholding operator.

Problem 3: With dropout probability p, show that the variance of a dropout neuron's output during training (with inverted scaling) is Var(a)·(p/(1−p)).

Problem 4: Prove that early stopping for a linear model with quadratic loss and gradient descent is equivalent to L2 regularization. (Sketch the proof conceptually.)

Problem 5: A model trained with label smoothing (ε = 0.1) achieves training cross-entropy of 1.5. However, the same model achieves training accuracy of 99%. Are these results consistent? Explain.

Answers (click to expand) **Problem 1:** Unregularized gradient step: w_tmp = w_t − η·∂L_data/∂w Weight decay: w_{t+1} = (1 − ηλ)·w_tmp = (1−ηλ)(w_t − η·∂L_data/∂w) = (1−ηλ)w_t − η(1−ηλ)·∂L_data/∂w This has an extra (1−ηλ) factor on the gradient. The exact equivalence is: w_{t+1} = w_t − η(∂L_data/∂w + λw_t) = (1−ηλ)w_t − η·∂L_data/∂w ✓ They're EXACTLY equivalent for SGD. **Problem 2:** For w > 0: ∂f/∂w = w − μ + λ = 0 → w = μ − λ (valid only if μ > λ) For w < 0: ∂f/∂w = w − μ − λ = 0 → w = μ + λ (valid only if μ < −λ) At w = 0: subgradient is [−μ−λ, −μ+λ]. 0 is in this interval if |μ| ≤ λ. Solution: w* = sign(μ)·max(|μ| − λ, 0) This is the soft-thresholding operator. It shrinks μ toward zero by λ, and if |μ| ≤ λ, it hard-thresholds to exactly 0. **Problem 3:** Let a have mean μ and variance σ². Dropout output: d = a·m/(1−p) where m ∼ Bernoulli(1−p), independent of a. E[d] = E[a]·E[m]/(1−p) = μ·(1−p)/(1−p) = μ ✓ E[d²] = E[a²]·E[m²]/(1−p)² = E[a²]·(1−p)/(1−p)² = (μ²+σ²)/(1−p) Var(d) = E[d²] − E[d]² = (μ²+σ²)/(1−p) − μ² = σ²/(1−p) + μ²·p/(1−p) If μ = 0: Var(d) = σ²/(1−p). Compared to no dropout (Var = σ²): the variance is inflated by 1/(1−p). The excess variance σ²/(1−p) − σ² = σ²·p/(1−p). ✓ **Problem 4:** For linear model ŷ = Xw with quadratic loss L = (1/2N)||y − Xw||², gradient descent with step size η starting from w₀ = 0: After t steps: w_t = (I − ηXᵀX/N)^t w₀ + regularization-like terms. The implicit regularization from stopping early at iteration t is equivalent to adding a penalty λ(t)||w||² where λ(t) depends on t, η, and the spectrum of XᵀX. As t → ∞, λ(t) → 0 (no regularization). Small t acts like strong L2 regularization. **Problem 5:** Yes, these results are consistent. With label smoothing, the model is trained to never output probability 1.0 — the maximum target is 0.9. Even if the model perfectly learns this (outputs 0.9 for the correct class), the cross-entropy is −log(0.9) ≈ 0.105 per example (assuming the remaining 0.1 is distributed). The observed 1.5 cross-entropy is higher than 0.105, suggesting some examples aren't perfectly classified. But accuracy only cares about argmax — if the model puts 0.6 on the correct class and 0.2 each on the wrong classes, accuracy is 100% while cross-entropy is −log(0.6) ≈ 0.51 per example. These metrics measure different things: accuracy measures decisions, cross-entropy measures confidence calibration. They can diverge significantly.

Summary

  1. L2 regularization adds λ||w||² to the loss (Gaussian prior), shrinking all weights toward zero; equivalent to weight decay in SGD but NOT in adaptive optimizers (use AdamW).
  2. L1 regularization adds λ|w| (Laplace prior), producing exact zeros → sparse solutions; gradient is constant λ·sign(w), unlike L2's diminishing gradient.
  3. Dropout randomly zeros neurons with probability p during training, scaling survivors by 1/(1−p); approximates ensemble averaging and prevents co-adaptation.
  4. Other techniques: Data augmentation (label-preserving transforms), early stopping (halt at optimal validation loss), label smoothing (soft targets prevent overconfidence).
  5. Choose L2 as default for deep networks; add dropout for fully-connected layers; use AdamW for Transformers; data augmentation is nearly always beneficial.

Pitfalls


Quiz

Q1: Which statement about L1 vs L2 regularization is TRUE?

A) L2 produces sparser solutions than L1 B) L1 gradient is proportional to |w| while L2 gradient is constant C) L1 corresponds to a Laplace prior; L2 corresponds to a Gaussian prior D) L1 and L2 are equivalent when using SGD

Answer and Explanations **Correct: C) L1 corresponds to a Laplace prior; L2 corresponds to a Gaussian prior** p(w) ∝ exp(−λ|w|) is Laplace (L1), p(w) ∝ exp(−λ||w||²/2) is Gaussian (L2). The MAP estimate under these priors gives the L1 and L2 regularized objectives. - A) Incorrect. L1 produces sparser solutions (exact zeros); L2 produces small but non-zero weights. - B) Incorrect. It's the opposite: L2 gradient ∝ w, L1 gradient = λ·sign(w) (constant magnitude). - C) ✓ Correct. This Bayesian connection unifies regularization with prior beliefs about parameters. - D) Incorrect. They have fundamentally different behaviors regardless of optimizer.

Q2: In inverted dropout with probability p = 0.5, what scaling is applied to surviving neurons during training?

A) Multiply by 0.5 B) Multiply by 2.0 C) Multiply by 1.0 (no scaling) D) Multiply by 0.25

Answer and Explanations **Correct: B) Multiply by 2.0** Inverted dropout scales survivors by 1/(1−p) = 1/0.5 = 2.0. This ensures E[a_drop] = a, so no scaling is needed at inference time. - A) That would be scaling DOWN, which would reduce expected output. - B) ✓ Correct. 1/(1−p) = 2.0 compensates for dropping half the neurons. - C) No scaling means E[a_drop] = 0.5·a, requiring rescaling at inference. - D) 0.25 = p², which is the probability that BOTH of two neurons are dropped, not a scaling factor.

Q3: Why is AdamW preferred over standard Adam with L2 regularization?

A) AdamW is faster to compute B) AdamW decouples weight decay from the adaptive gradient scaling, giving more consistent regularization C) AdamW uses L1 instead of L2 regularization D) AdamW doesn't require momentum

Answer and Explanations **Correct: B) AdamW decouples weight decay from the adaptive gradient scaling, giving more consistent regularization** In standard Adam, the L2 gradient λw gets divided by √(v̂) + ε — adaptive per-parameter scaling that weakens regularization for frequently-updated parameters. AdamW applies weight decay directly: w ← w − ηλw, independent of the adaptive gradient normalization. - A) Incorrect. AdamW is essentially the same computational cost. - B) ✓ Correct. Decoupling makes regularization strength consistent across all parameters. - C) Incorrect. AdamW still uses L2-style weight decay (subtracting ηλw). - D) Incorrect. AdamW retains momentum (it's still Adam).

Q4: Early stopping prevents overfitting by:

A) Adding noise to the weights B) Halting training before the model memorizes noise in the training data C) Reducing the number of parameters D) Increasing the learning rate over time

Answer and Explanations **Correct: B) Halting training before the model memorizes noise in the training data** As training continues, the model progressively fits first the signal (general patterns, validation loss decreases) then the noise (idiosyncrasies of training data, validation loss stops improving). Early stopping halts at the optimal point where generalization is best. - A) Incorrect. That's dropout or weight noise. - B) ✓ Correct. Early stopping finds the training duration that optimizes validation performance. - C) Incorrect. Early stopping doesn't change model architecture. - D) Incorrect. Early stopping doesn't modify the learning rate.

Q5: With label smoothing parameter ε = 0.1 and K = 10 classes, what is the target probability for the correct class?

A) 1.0 B) 0.9 C) 0.99 D) 0.1

Answer and Explanations **Correct: B) 0.9** Target for correct class = 1 − ε = 0.9. The remaining ε = 0.1 is distributed among K−1 = 9 incorrect classes: each gets ε/(K−1) = 0.1/9 ≈ 0.0111. This prevents the model from being infinitely confident about any prediction. - A) Without label smoothing. - B) ✓ Correct. 1 − ε = 0.9. - C) This would be the case if ε = 0.01. - D) This would be ε/(K−1) for one incorrect class.

Next Steps

Move on to 16-09 — Batch Normalization to learn how normalizing activations within each mini-batch stabilizes and accelerates training, making deep networks much easier to optimize.