📐 Concept diagram

16-04 — Loss Functions

Phase: 16 — Neural Network Mathematics Subject: 16-04 Prerequisites: 16-03 (Softmax), Phase 10-11 (probability, MLE), Phase 13 (cross-entropy) Next subject: 16-05 — Backpropagation Mathematics

Learning Objectives

By the end of this subject, you will be able to:

Derive and compute MSE, binary cross-entropy (BCE), categorical cross-entropy (CCE), hinge loss, and focal loss
Prove the connection between minimizing cross-entropy + softmax and maximizing likelihood (MLE)
Derive the gradient of each loss function with respect to logits
Explain why MSE is inappropriate for classification and what problems it causes
Understand focal loss as a modification of cross-entropy that down-weights easy examples

Core Content

1. What Is a Loss Function?

A loss function L(ŷ, y) measures how badly a model's prediction ŷ differs from the true label y. Training a neural network means finding parameters that minimize the average loss over the training set:

θ = argmin_θ (1/N) Σᵢ L(f_θ(x*⁽ⁱ⁾), y⁽ⁱ⁾)

⚠️ THIS IS CRITICAL — The choice of loss function determines what the model optimizes for. A poorly chosen loss leads to a model that solves the wrong problem, even if training converges perfectly. The loss function IS the specification of what you want the model to learn.

2. Mean Squared Error (MSE) — Regression

For regression tasks where y is a continuous value:

MSE = (1/N) Σᵢ (ŷᵢ − yᵢ)²

Or for a single example:

L_MSE(ŷ, y) = (ŷ − y)²

Gradient: ∂L/∂ŷ = 2(ŷ − y)

Probabilistic interpretation: Minimizing MSE is equivalent to maximum likelihood estimation assuming the target follows a Gaussian distribution with mean ŷ and constant variance:

p(y | x) = N(y; f_θ(x), σ²) Negative log-likelihood: −log p(y | x) ∝ (y − f_θ(x))² + const

Why NOT MSE for classification? Consider binary classification with y ∈ {0, 1} using a sigmoid output. The gradient of MSE combined with sigmoid is:

∂L/∂z = ∂L/∂ŷ · ∂ŷ/∂z = 2(ŷ − y) · σ'(z)

Since σ'(z) ≤ 0.25 and approaches 0 when saturated, the gradient is weak even when predictions are WRONG. This is the "vanishing gradient at saturation" problem — the model learns slowly when it's most confused, which is exactly when it needs to learn fastest.

3. Binary Cross-Entropy (BCE) — Binary Classification

For binary classification with y ∈ {0, 1}:

L_BCE(ŷ, y) = −[y·log(ŷ) + (1 − y)·log(1 − ŷ)]

Where ŷ is the predicted probability of the positive class (typically via sigmoid).

Intuitive meaning: - If y = 1: L = −log(ŷ). Penalizes low ŷ heavily; as ŷ → 0, L → ∞. - If y = 0: L = −log(1 − ŷ). Penalizes ŷ close to 1 heavily; as ŷ → 1, L → ∞.

Gradient (with sigmoid): When ŷ = σ(z):

∂L/∂z = ŷ − y

Derivation: ∂L/∂ŷ = −[y/ŷ + (1−y)·(−1)/(1−ŷ)] = −y/ŷ + (1−y)/(1−ŷ) = (ŷ − y)/(ŷ(1−ŷ))

∂ŷ/∂z = σ'(z) = ŷ(1−ŷ)

∂L/∂z = (ŷ − y)/(ŷ(1−ŷ)) · ŷ(1−ŷ) = ŷ − y ✓

This is a beautiful, clean gradient that is LINEAR in the prediction error. Compare with MSE+sigmoid where the gradient had the σ'(z) factor suppressing it at saturation.

4. Categorical Cross-Entropy (CCE) — Multi-Class

For multi-class classification with K classes, y is a one-hot vector and ŷ = softmax(z):

L_CCE(ŷ, y) = −Σₖ yₖ·log(ŷₖ)

Since y is one-hot (only y_c = 1 for the true class c), this simplifies to:

L_CCE(ŷ, c) = −log(ŷ_c)

The loss is simply the negative log of the predicted probability for the CORRECT class.

Gradient (with softmax): We derived in 16-03:

∂L/∂zₖ = ŷₖ − yₖ for all k

In vector form: ∇_z L = ŷ − y.

Connection to MLE: For a categorical distribution with softmax probabilities:

p(y = c | x) = ∏ₖ ŷₖ^{yₖ} = ŷ_c (since y is one-hot)

Negative log-likelihood: −log p(y | x) = −Σₖ yₖ·log(ŷₖ) = CCE loss.

Minimizing categorical cross-entropy ≡ Maximizing the likelihood of the data.

5. Connection to KL Divergence

Cross-entropy can be decomposed as:

H(y, ŷ) = H(y) + D_KL(y || ŷ)

Where H(y) is the entropy of the true distribution (constant w.r.t. model parameters) and D_KL is the KL divergence. Therefore:

Minimizing cross-entropy ≡ Minimizing KL divergence between true and predicted distributions.

Since H(y) is constant, minimizing CE is equivalent to minimizing D_KL(y || ŷ) — making the predicted distribution as close as possible to the true distribution in the information-theoretic sense.

6. Hinge Loss — Maximum Margin Classification

Used in Support Vector Machines (SVMs) and some neural network contexts:

For binary classification with y ∈ {+1, −1}:

L_hinge(z, y) = max(0, 1 − y·z)

Where z is the raw score (before any activation).

Interpretation: - If y·z ≥ 1 (correct prediction with sufficient margin): L = 0 (no loss) - If y·z < 1 (margin violation): L = 1 − y·z (linear penalty)

The "hinge" name comes from the shape: zero loss when confidently correct, linear loss otherwise. The gradient is:

∂L/∂z = { 0 if y·z ≥ 1 { −y if y·z < 1

Key difference from cross-entropy: Hinge loss stops caring once the model is "correct enough" (margin ≥ 1), encouraging sparse solutions (only the support vectors matter). Cross-entropy always pushes for higher confidence, even when already correct.

7. Focal Loss — Handling Class Imbalance

Focal loss modifies cross-entropy to down-weight the contribution of easy examples:

L_focal(ŷ, y) = −(1 − ŷ_t)^γ · log(ŷ_t)

Where ŷ_t = ŷ if y = 1, or 1−ŷ if y = 0 (the predicted probability of the CORRECT class), and γ ≥ 0 is a focusing parameter.

How it works: - When ŷ_t ≈ 1 (easy, well-classified example): (1 − ŷ_t)^γ ≈ 0 → loss heavily down-weighted - When ŷ_t ≈ 0 (hard, misclassified example): (1 − ŷ_t)^γ ≈ 1 → loss unaffected

Effect of γ: - γ = 0: Reduces to standard cross-entropy - γ = 2: Common default — easy examples contribute 100× less than under CE - Large γ: Extreme down-weighting of easy examples

Use case: Object detection with extreme foreground/background class imbalance. Most "background" examples are easy (the model quickly learns to say "no object"), and their loss would dominate. Focal loss lets the model focus on the few hard examples.

8. Loss Function Summary Table

Loss	Formula (single example)	Typical Output Activation	Gradient w.r.t. logits
MSE	(ŷ − y)²	Linear/Identity	— (used directly)
MSE + Sigmoid	(σ(z) − y)²	Sigmoid	2(σ(z)−y)·σ'(z)
BCE + Sigmoid	−[y·logσ(z) + (1−y)·log(1−σ(z))]	Sigmoid	σ(z) − y
CCE + Softmax	−Σ yₖ·log(softmax(z)ₖ)	Softmax	softmax(z) − y
Hinge	max(0, 1 − y·z)	Linear	0 or −y
Focal	−(1−σ(z))^γ·logσ(z) (for y=1)	Sigmoid	Scaled version of BCE gradient

Key Terms

Focal loss
Hinge loss
Loss functions

Worked Examples

Example 1: BCE Gradient Computation

Problem: For a single binary classification example with y = 1 and logit z = −2 (before sigmoid), compute the BCE loss and its gradient ∂L/∂z.

Solution:

ŷ = σ(−2) = 1/(1 + e²) ≈ 0.1192

L = −[1·log(0.1192) + 0·log(0.8808)] = −log(0.1192) = −(−2.1269) = 2.1269

The model is very confident (wrong), so the loss is high.

∂L/∂z = ŷ − y = 0.1192 − 1 = −0.8808

The gradient is strongly negative, pushing z upward (toward positive territory, where ŷ → 1). Compare with MSE: ∂L_MSE/∂z = 2(0.1192−1)·0.1192·(0.8808) ≈ −0.1849 — much smaller gradient despite the same error magnitude. BCE provides a much stronger learning signal.

Example 2: CCE with Three Classes

Problem: A 3-class classifier produces logits z = [2.0, 1.0, 0.1]ᵀ. The true class is class 2 (one-hot y = [0, 1, 0]ᵀ). Compute the loss and gradient.

Solution:

Step 1 — Softmax: ŷ₁ = e²/(e²+e¹+e^{0.1}) = 7.389/(7.389+2.718+1.105) = 7.389/11.212 ≈ 0.6590 ŷ₂ = 2.718/11.212 ≈ 0.2424 ŷ₃ = 1.105/11.212 ≈ 0.0986

Step 2 — Loss: L = −log(0.2424) ≈ 1.417

Step 3 — Gradient: ∇_z L = ŷ − y = [0.6590, 0.2424−1, 0.0986] = [0.6590, −0.7576, 0.0986]

Interpretation: The gradient pushes z₁ downward (it's overconfident), z₂ strongly upward (it's underconfident for the true class), and z₃ slightly downward.

Example 3: Focal Loss vs Cross-Entropy

Problem: For a well-classified example (y = 1, ŷ = 0.95) and a poorly classified example (y = 1, ŷ = 0.05), compare cross-entropy and focal loss (γ = 2).

Solution:

Easy example (ŷ = 0.95): CE: −log(0.95) ≈ 0.0513 Focal: −(1−0.95)²·log(0.95) = −0.0025·(−0.0513) ≈ 0.000128

Focal loss is 400× smaller! The model practically ignores this example.

Hard example (ŷ = 0.05): CE: −log(0.05) ≈ 2.996 Focal: −(1−0.05)²·log(0.05) = −0.9025·(−2.996) ≈ 2.704

Focal loss is only ~10% smaller. The model still learns from this example.

Bottom line: Focal loss preserves the learning signal for hard examples while nearly eliminating it for easy ones — exactly what you want under severe class imbalance.

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: Compute the MSE and BCE losses for a binary classification example where y = 0, the logit z = 3, and ŷ = σ(z). Compare the gradients.

Problem 2: Show that the gradient of CCE with softmax simplifies to ∂L/∂zₖ = ŷₖ − yₖ. Start from L = −Σⱼ yⱼ·log(ŷⱼ) and use the softmax Jacobian from 16-03.

Problem 3: For hinge loss L = max(0, 1 − y·z) with y = +1, plot L as a function of z. At what value of z does the loss become zero? What is the gradient for z < 1?

Problem 4: Prove that BCE loss is a proper scoring rule: the expected loss E_{y~p}[L(p̂, y)] is minimized when p̂ = p (the predicted probability equals the true probability).

Problem 5: For CCE on a dataset where class 0 appears 90% of the time and class 1 appears 10%, what is the loss of a naive model that always predicts ŷ = [0.9, 0.1]? What about a model that always predicts the majority class ŷ = [1, 0]? What does this tell you about evaluating models on imbalanced data?

Answers (click to expand)

**Problem 1:** ŷ = σ(3) ≈ 0.9526 MSE: (0.9526 − 0)² = 0.9074 BCE: −[0·log(0.9526) + 1·log(0.0474)] = −log(0.0474) ≈ 3.049 MSE gradient: 2(0.9526−0)·0.9526·0.0474 ≈ 0.0860 BCE gradient: 0.9526 − 0 = 0.9526 BCE gradient is ~11× larger — much stronger learning signal when the model is confidently wrong. **Problem 2:** ∂L/∂zₖ = Σⱼ ∂L/∂ŷⱼ · ∂ŷⱼ/∂zₖ = Σⱼ (−yⱼ/ŷⱼ) · ŷⱼ(δ_{jk} − ŷₖ) = Σⱼ (−yⱼ)(δ_{jk} − ŷₖ) = −yₖ + ŷₖ Σⱼ yⱼ = ŷₖ − yₖ (Since Σⱼ yⱼ = 1 for a one-hot vector.) ✓ **Problem 3:** For z ≥ 1: L = 0 (margin satisfied) For z < 1: L = 1 − z (linear penalty) At z = 1: loss hits zero. Gradient: ∂L/∂z = −1 for z < 1, 0 for z > 1 (subgradient at z=1 is [−1, 0]). **Problem 4:** E[L(p̂, y)] = p·(−log p̂) + (1−p)·(−log(1−p̂)) ∂/∂p̂ E[L] = −p/p̂ + (1−p)/(1−p̂) Set to zero: p/p̂ = (1−p)/(1−p̂) → p(1−p̂) = p̂(1−p) → p − p·p̂ = p̂ − p·p̂ → p = p̂ Second derivative is positive at p̂ = p, confirming minimum. ✓ **Problem 5:** Always predict [0.9, 0.1]: L = −0.9·log(0.9) − 0.1·log(0.1) = 0.325 (per-example average) Always predict [1, 0]: L = −0.9·log(1) − 0.1·log(0) = ∞ (1% of the time), so expected loss is infinite. This illustrates that cross-entropy heavily penalizes predicting probability 0 for a class that sometimes occurs. A good model must assign non-zero probability to all observed outcomes. On imbalanced data, reporting accuracy can be misleading (90% accuracy by always predicting majority) — loss provides a more nuanced measure.

Summary

Loss functions measure prediction error; the choice determines what the model optimizes for — they ARE the learning specification.
MSE for regression corresponds to Gaussian MLE; using MSE with sigmoid for classification gives weak gradients when the model is wrong (the vanishing gradient at saturation problem).
BCE + sigmoid gives clean gradient ŷ − y; CCE + softmax also gives ŷ − y — both correspond to maximum likelihood estimation.
Hinge loss (max-margin) stops at "good enough" (margin ≥ 1), unlike CE which always pushes for more confidence.
Focal loss down-weights easy examples via (1−ŷ_t)^γ, focusing learning on hard cases — essential for severe class imbalance.

Pitfalls

Using MSE for classification tasks: MSE + sigmoid produces gradients containing the factor $\sigma'(z)$, which is at most $0.25$ and approaches $0$ when the model is saturated. This means the model learns slowest when it's most confidently wrong — the worst possible behavior. Always use binary cross-entropy with sigmoid for binary classification and categorical cross-entropy with softmax for multi-class. If your classification model trains sluggishly, check that you haven't accidentally used MSELoss.
Not accounting for class imbalance when choosing a loss function: On a dataset with 99% negatives and 1% positives, standard cross-entropy can achieve 99% accuracy by always predicting "negative." The loss may still look reasonable because the model gets most examples right. Use class-weighted cross-entropy, focal loss, or balanced sampling. Always check per-class metrics (precision, recall) — not just overall loss — when classes are imbalanced.
Forgetting that cross-entropy penalizes zero-probability predictions with infinite loss: If $\hat{y}_c = 0$ for the true class $c$, then $-\log(0) = \infty$. In practice, this manifests as NaN or extremely large loss values when the model assigns exactly zero probability to the correct class (e.g., due to numerical underflow in softmax). The log-sum-exp trick prevents this, but if you compute softmax and cross-entropy separately, use $\epsilon$-clipping: torch.clamp(y_hat, min=1e-7, max=1-1e-7).
Confusing binary cross-entropy with categorical cross-entropy: BCE expects a single probability per example (shape [N] or [N, 1]) and internally applies sigmoid. CCE expects a vector of class probabilities (shape [N, K]) with softmax applied. Feeding one-hot targets to BCE or scalar targets to CCE produces silently wrong results because the loss functions interpret the tensor shapes differently.
Using default focal loss hyperparameters without tuning: Focal loss with $\gamma=2$ is a sensible default for object detection, but for severe imbalance (e.g., 1:1000), $\gamma=3$ or $4$ may be needed to adequately suppress easy examples. For mild imbalance, $\gamma=1$ may suffice. The $\alpha$ class-balancing weight interacts with $\gamma$ — tune both together on a validation set rather than using fixed defaults.

Quiz

Q1: Why is MSE a poor choice for binary classification when using sigmoid activation?

A) MSE is not differentiable B) The gradient of MSE+sigmoid is suppressed by σ'(z), which is small when predictions are confidently wrong C) MSE can produce negative values D) MSE requires the target to be normally distributed

Answer and Explanations

**Correct: B) The gradient of MSE+sigmoid is suppressed by σ'(z), which is small when predictions are confidently wrong** ∂L_MSE/∂z = 2(ŷ − y)·σ'(z). When the model is confidently wrong (e.g., y=1, z=−5, ŷ≈0), the error ŷ−y is large BUT σ'(z)≈0, so the gradient nearly vanishes. The model learns slowest when it's most wrong. - A) Incorrect. MSE is everywhere differentiable. - B) ✓ Correct. The σ'(z) factor kills the gradient exactly when learning should be fastest. - C) Incorrect. MSE is always non-negative. - D) Incorrect. While MSE corresponds to Gaussian MLE, that doesn't prevent its use — the actual problem is gradient suppression.

Q2: What is the gradient ∂L/∂z of BCE + sigmoid for a single example?

A) σ(z)(1 − σ(z)) B) −y·log(σ(z)) C) σ(z) − y D) (σ(z) − y)/σ'(z)

Answer and Explanations

**Correct: C) σ(z) − y** The derivation: ∂L/∂ŷ = −y/ŷ + (1−y)/(1−ŷ), ∂ŷ/∂z = ŷ(1−ŷ). The ŷ(1−ŷ) factors cancel, leaving ŷ − y. This linear gradient is why BCE trains faster than MSE for classification. - A) That's σ'(z), not the full gradient. - B) That's not a gradient — it's part of the loss. - C) ✓ Correct. The beautifully simple gradient is prediction minus target. - D) This equals ŷ−y when you expand, but it's needlessly complex — the cancellation is the point.

Q3: Cross-entropy H(p, q) can be decomposed as:

A) H(p, q) = H(p) + H(q) B) H(p, q) = H(p) − D_KL(p || q) C) H(p, q) = H(p) + D_KL(p || q) D) H(p, q) = H(q) + D_KL(q || p)

Answer and Explanations

**Correct: C) H(p, q) = H(p) + D_KL(p || q)** H(p, q) = −Σ p(x)·log q(x) = −Σ p(x)·log p(x) − Σ p(x)·log(q(x)/p(x)) = H(p) + D_KL(p||q). Since H(p) is constant w.r.t. model parameters q, minimizing cross-entropy is equivalent to minimizing D_KL(p||q). - A) Incorrect. H(p)+H(q) would be the sum of independent entropies — no interaction. - B) Incorrect. It's addition, not subtraction. - C) ✓ Correct. This decomposition connects CE loss to KL divergence minimization. - D) Incorrect. The KL divergence uses p (true) as the reference, not q.

Q4: Focal loss modifies cross-entropy by multiplying by (1 − ŷ_t)^γ. When γ = 0, what does focal loss reduce to?

A) Hinge loss B) Mean squared error C) Standard cross-entropy D) Zero

Answer and Explanations

**Correct: C) Standard cross-entropy** (1 − ŷ_t)^0 = 1 regardless of ŷ_t, so focal loss = −log(ŷ_t) = standard cross-entropy. Focal loss is a generalization that recovers CE at γ=0. - A) Hinge loss is a different functional form entirely. - B) MSE is quadratic, not logarithmic. - C) ✓ Correct. γ=0 removes the modulating factor. - D) The log term remains — it's not zero.

Q5: For hinge loss L = max(0, 1 − y·z) with y = −1, what is the loss and gradient when z = 0.5?

A) L = 0, ∂L/∂z = 0 B) L = 1.5, ∂L/∂z = 1 C) L = 1.5, ∂L/∂z = −1 D) L = 0.5, ∂L/∂z = 0

Answer and Explanations

**Correct: B) L = 1.5, ∂L/∂z = 1** y·z = (−1)·0.5 = −0.5 < 1, so margin is violated. L = 1 − (−0.5) = 1.5. ∂L/∂z = −y = −(−1) = 1. Note: For gradient descent, the weight update is w ← w − η·∂L/∂z = w − η·1. Since y = −1 (negative class), z = wᵀx + b should be pushed downward (more negative) so that y·z = −z becomes larger. The positive ∂L/∂z means w is reduced (w ← w − η), pushing z lower — which is the correct direction. - A) Incorrect. Margin is violated. - B) ✓ Correct. L = 1.5, ∂L/∂z = −y = 1. - C) Incorrect. ∂L/∂z = −y, not y. (If ∂L/∂z = −1, that would push z higher, worsening the margin.) - D) Incorrect. Loss is 1.5, not 0.5.

Next Steps

Move on to 16-05 — Backpropagation Mathematics to learn how gradients flow backward through a neural network using the chain rule — the central algorithm that makes deep learning possible.

Progress

Phases

16-04 — Loss Functions

Learning Objectives

Core Content

1. What Is a Loss Function?

2. Mean Squared Error (MSE) — Regression

3. Binary Cross-Entropy (BCE) — Binary Classification

4. Categorical Cross-Entropy (CCE) — Multi-Class

5. Connection to KL Divergence

6. Hinge Loss — Maximum Margin Classification

7. Focal Loss — Handling Class Imbalance

8. Loss Function Summary Table

Key Terms

Worked Examples

Example 1: BCE Gradient Computation

Example 2: CCE with Three Classes

Example 3: Focal Loss vs Cross-Entropy

Practice Problems

Summary

Pitfalls

Quiz

Next Steps