Math graphic
📐 Concept diagram

16-04 — Loss Functions

Phase: 16 — Neural Network Mathematics Subject: 16-04 Prerequisites: 16-03 (Softmax), Phase 10-11 (probability, MLE), Phase 13 (cross-entropy) Next subject: 16-05 — Backpropagation Mathematics


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive and compute MSE, binary cross-entropy (BCE), categorical cross-entropy (CCE), hinge loss, and focal loss
  2. Prove the connection between minimizing cross-entropy + softmax and maximizing likelihood (MLE)
  3. Derive the gradient of each loss function with respect to logits
  4. Explain why MSE is inappropriate for classification and what problems it causes
  5. Understand focal loss as a modification of cross-entropy that down-weights easy examples

Core Content

1. What Is a Loss Function?

A loss function L(ŷ, y) measures how badly a model's prediction ŷ differs from the true label y. Training a neural network means finding parameters that minimize the average loss over the training set:

θ = argmin_θ (1/N) Σᵢ L(f_θ(x*⁽ⁱ⁾), y⁽ⁱ⁾)

⚠️ THIS IS CRITICAL — The choice of loss function determines what the model optimizes for. A poorly chosen loss leads to a model that solves the wrong problem, even if training converges perfectly. The loss function IS the specification of what you want the model to learn.

2. Mean Squared Error (MSE) — Regression

For regression tasks where y is a continuous value:

MSE = (1/N) Σᵢ (ŷᵢ − yᵢ)²

Or for a single example:

L_MSE(ŷ, y) = (ŷ − y)²

Gradient: ∂L/∂ŷ = 2(ŷ − y)

Probabilistic interpretation: Minimizing MSE is equivalent to maximum likelihood estimation assuming the target follows a Gaussian distribution with mean ŷ and constant variance:

p(y | x) = N(y; f_θ(x), σ²) Negative log-likelihood: −log p(y | x) ∝ (y − f_θ(x))² + const

Why NOT MSE for classification? Consider binary classification with y ∈ {0, 1} using a sigmoid output. The gradient of MSE combined with sigmoid is:

∂L/∂z = ∂L/∂ŷ · ∂ŷ/∂z = 2(ŷ − y) · σ'(z)

Since σ'(z) ≤ 0.25 and approaches 0 when saturated, the gradient is weak even when predictions are WRONG. This is the "vanishing gradient at saturation" problem — the model learns slowly when it's most confused, which is exactly when it needs to learn fastest.

3. Binary Cross-Entropy (BCE) — Binary Classification

For binary classification with y ∈ {0, 1}:

L_BCE(ŷ, y) = −[y·log(ŷ) + (1 − y)·log(1 − ŷ)]

Where ŷ is the predicted probability of the positive class (typically via sigmoid).

Intuitive meaning: - If y = 1: L = −log(ŷ). Penalizes low ŷ heavily; as ŷ → 0, L → ∞. - If y = 0: L = −log(1 − ŷ). Penalizes ŷ close to 1 heavily; as ŷ → 1, L → ∞.

Gradient (with sigmoid): When ŷ = σ(z):

∂L/∂z = ŷ − y

Derivation: ∂L/∂ŷ = −[y/ŷ + (1−y)·(−1)/(1−ŷ)] = −y/ŷ + (1−y)/(1−ŷ) = (ŷ − y)/(ŷ(1−ŷ))

∂ŷ/∂z = σ'(z) = ŷ(1−ŷ)

∂L/∂z = (ŷ − y)/(ŷ(1−ŷ)) · ŷ(1−ŷ) = ŷ − y ✓

This is a beautiful, clean gradient that is LINEAR in the prediction error. Compare with MSE+sigmoid where the gradient had the σ'(z) factor suppressing it at saturation.

4. Categorical Cross-Entropy (CCE) — Multi-Class

For multi-class classification with K classes, y is a one-hot vector and ŷ = softmax(z):

L_CCE(, y) = −Σₖ yₖ·log(ŷₖ)

Since y is one-hot (only y_c = 1 for the true class c), this simplifies to:

L_CCE(, c) = −log(ŷ_c)

The loss is simply the negative log of the predicted probability for the CORRECT class.

Gradient (with softmax): We derived in 16-03:

∂L/∂zₖ = ŷₖ − yₖ for all k

In vector form: ∇_z L = y.

Connection to MLE: For a categorical distribution with softmax probabilities:

p(y = c | x) = ∏ₖ ŷₖ^{yₖ} = ŷ_c (since y is one-hot)

Negative log-likelihood: −log p(y | x) = −Σₖ yₖ·log(ŷₖ) = CCE loss.

Minimizing categorical cross-entropy ≡ Maximizing the likelihood of the data.

5. Connection to KL Divergence

Cross-entropy can be decomposed as:

H(y, ŷ) = H(y) + D_KL(y || ŷ)

Where H(y) is the entropy of the true distribution (constant w.r.t. model parameters) and D_KL is the KL divergence. Therefore:

Minimizing cross-entropy ≡ Minimizing KL divergence between true and predicted distributions.

Since H(y) is constant, minimizing CE is equivalent to minimizing D_KL(y || ŷ) — making the predicted distribution as close as possible to the true distribution in the information-theoretic sense.

6. Hinge Loss — Maximum Margin Classification

Used in Support Vector Machines (SVMs) and some neural network contexts:

For binary classification with y ∈ {+1, −1}:

L_hinge(z, y) = max(0, 1 − y·z)

Where z is the raw score (before any activation).

Interpretation: - If y·z ≥ 1 (correct prediction with sufficient margin): L = 0 (no loss) - If y·z < 1 (margin violation): L = 1 − y·z (linear penalty)

The "hinge" name comes from the shape: zero loss when confidently correct, linear loss otherwise. The gradient is:

∂L/∂z = { 0 if y·z ≥ 1 { −y if y·z < 1

Key difference from cross-entropy: Hinge loss stops caring once the model is "correct enough" (margin ≥ 1), encouraging sparse solutions (only the support vectors matter). Cross-entropy always pushes for higher confidence, even when already correct.

7. Focal Loss — Handling Class Imbalance

Focal loss modifies cross-entropy to down-weight the contribution of easy examples:

L_focal(ŷ, y) = −(1 − ŷ_t)^γ · log(ŷ_t)

Where ŷ_t = ŷ if y = 1, or 1−ŷ if y = 0 (the predicted probability of the CORRECT class), and γ ≥ 0 is a focusing parameter.

How it works: - When ŷ_t ≈ 1 (easy, well-classified example): (1 − ŷ_t)^γ ≈ 0 → loss heavily down-weighted - When ŷ_t ≈ 0 (hard, misclassified example): (1 − ŷ_t)^γ ≈ 1 → loss unaffected

Effect of γ: - γ = 0: Reduces to standard cross-entropy - γ = 2: Common default — easy examples contribute 100× less than under CE - Large γ: Extreme down-weighting of easy examples

Use case: Object detection with extreme foreground/background class imbalance. Most "background" examples are easy (the model quickly learns to say "no object"), and their loss would dominate. Focal loss lets the model focus on the few hard examples.

8. Loss Function Summary Table

Loss Formula (single example) Typical Output Activation Gradient w.r.t. logits
MSE (ŷ − y)² Linear/Identity — (used directly)
MSE + Sigmoid (σ(z) − y)² Sigmoid 2(σ(z)−y)·σ'(z)
BCE + Sigmoid −[y·logσ(z) + (1−y)·log(1−σ(z))] Sigmoid σ(z) − y
CCE + Softmax −Σ yₖ·log(softmax(z)ₖ) Softmax softmax(z) − y
Hinge max(0, 1 − y·z) Linear 0 or −y
Focal −(1−σ(z))^γ·logσ(z) (for y=1) Sigmoid Scaled version of BCE gradient


Key Terms

Worked Examples

Example 1: BCE Gradient Computation

Problem: For a single binary classification example with y = 1 and logit z = −2 (before sigmoid), compute the BCE loss and its gradient ∂L/∂z.

Solution:

ŷ = σ(−2) = 1/(1 + e²) ≈ 0.1192

L = −[1·log(0.1192) + 0·log(0.8808)] = −log(0.1192) = −(−2.1269) = 2.1269

The model is very confident (wrong), so the loss is high.

∂L/∂z = ŷ − y = 0.1192 − 1 = −0.8808

The gradient is strongly negative, pushing z upward (toward positive territory, where ŷ → 1). Compare with MSE: ∂L_MSE/∂z = 2(0.1192−1)·0.1192·(0.8808) ≈ −0.1849 — much smaller gradient despite the same error magnitude. BCE provides a much stronger learning signal.

Example 2: CCE with Three Classes

Problem: A 3-class classifier produces logits z = [2.0, 1.0, 0.1]ᵀ. The true class is class 2 (one-hot y = [0, 1, 0]ᵀ). Compute the loss and gradient.

Solution:

Step 1 — Softmax: ŷ₁ = e²/(e²+e¹+e^{0.1}) = 7.389/(7.389+2.718+1.105) = 7.389/11.212 ≈ 0.6590 ŷ₂ = 2.718/11.212 ≈ 0.2424 ŷ₃ = 1.105/11.212 ≈ 0.0986

Step 2 — Loss: L = −log(0.2424) ≈ 1.417

Step 3 — Gradient: ∇_z L = y = [0.6590, 0.2424−1, 0.0986] = [0.6590, −0.7576, 0.0986]

Interpretation: The gradient pushes z₁ downward (it's overconfident), z₂ strongly upward (it's underconfident for the true class), and z₃ slightly downward.

Example 3: Focal Loss vs Cross-Entropy

Problem: For a well-classified example (y = 1, ŷ = 0.95) and a poorly classified example (y = 1, ŷ = 0.05), compare cross-entropy and focal loss (γ = 2).

Solution:

Easy example (ŷ = 0.95): CE: −log(0.95) ≈ 0.0513 Focal: −(1−0.95)²·log(0.95) = −0.0025·(−0.0513) ≈ 0.000128

Focal loss is 400× smaller! The model practically ignores this example.

Hard example (ŷ = 0.05): CE: −log(0.05) ≈ 2.996 Focal: −(1−0.05)²·log(0.05) = −0.9025·(−2.996) ≈ 2.704

Focal loss is only ~10% smaller. The model still learns from this example.

Bottom line: Focal loss preserves the learning signal for hard examples while nearly eliminating it for easy ones — exactly what you want under severe class imbalance.

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: Compute the MSE and BCE losses for a binary classification example where y = 0, the logit z = 3, and ŷ = σ(z). Compare the gradients.

Problem 2: Show that the gradient of CCE with softmax simplifies to ∂L/∂zₖ = ŷₖ − yₖ. Start from L = −Σⱼ yⱼ·log(ŷⱼ) and use the softmax Jacobian from 16-03.

Problem 3: For hinge loss L = max(0, 1 − y·z) with y = +1, plot L as a function of z. At what value of z does the loss become zero? What is the gradient for z < 1?

Problem 4: Prove that BCE loss is a proper scoring rule: the expected loss E_{y~p}[L(p̂, y)] is minimized when p̂ = p (the predicted probability equals the true probability).

Problem 5: For CCE on a dataset where class 0 appears 90% of the time and class 1 appears 10%, what is the loss of a naive model that always predicts ŷ = [0.9, 0.1]? What about a model that always predicts the majority class ŷ = [1, 0]? What does this tell you about evaluating models on imbalanced data?

Answers (click to expand) **Problem 1:** ŷ = σ(3) ≈ 0.9526 MSE: (0.9526 − 0)² = 0.9074 BCE: −[0·log(0.9526) + 1·log(0.0474)] = −log(0.0474) ≈ 3.049 MSE gradient: 2(0.9526−0)·0.9526·0.0474 ≈ 0.0860 BCE gradient: 0.9526 − 0 = 0.9526 BCE gradient is ~11× larger — much stronger learning signal when the model is confidently wrong. **Problem 2:** ∂L/∂zₖ = Σⱼ ∂L/∂ŷⱼ · ∂ŷⱼ/∂zₖ = Σⱼ (−yⱼ/ŷⱼ) · ŷⱼ(δ_{jk} − ŷₖ) = Σⱼ (−yⱼ)(δ_{jk} − ŷₖ) = −yₖ + ŷₖ Σⱼ yⱼ = ŷₖ − yₖ (Since Σⱼ yⱼ = 1 for a one-hot vector.) ✓ **Problem 3:** For z ≥ 1: L = 0 (margin satisfied) For z < 1: L = 1 − z (linear penalty) At z = 1: loss hits zero. Gradient: ∂L/∂z = −1 for z < 1, 0 for z > 1 (subgradient at z=1 is [−1, 0]). **Problem 4:** E[L(p̂, y)] = p·(−log p̂) + (1−p)·(−log(1−p̂)) ∂/∂p̂ E[L] = −p/p̂ + (1−p)/(1−p̂) Set to zero: p/p̂ = (1−p)/(1−p̂) → p(1−p̂) = p̂(1−p) → p − p·p̂ = p̂ − p·p̂ → p = p̂ Second derivative is positive at p̂ = p, confirming minimum. ✓ **Problem 5:** Always predict [0.9, 0.1]: L = −0.9·log(0.9) − 0.1·log(0.1) = 0.325 (per-example average) Always predict [1, 0]: L = −0.9·log(1) − 0.1·log(0) = ∞ (1% of the time), so expected loss is infinite. This illustrates that cross-entropy heavily penalizes predicting probability 0 for a class that sometimes occurs. A good model must assign non-zero probability to all observed outcomes. On imbalanced data, reporting accuracy can be misleading (90% accuracy by always predicting majority) — loss provides a more nuanced measure.

Summary

  1. Loss functions measure prediction error; the choice determines what the model optimizes for — they ARE the learning specification.
  2. MSE for regression corresponds to Gaussian MLE; using MSE with sigmoid for classification gives weak gradients when the model is wrong (the vanishing gradient at saturation problem).
  3. BCE + sigmoid gives clean gradient ŷ − y; CCE + softmax also gives y — both correspond to maximum likelihood estimation.
  4. Hinge loss (max-margin) stops at "good enough" (margin ≥ 1), unlike CE which always pushes for more confidence.
  5. Focal loss down-weights easy examples via (1−ŷ_t)^γ, focusing learning on hard cases — essential for severe class imbalance.

Pitfalls


Quiz

Q1: Why is MSE a poor choice for binary classification when using sigmoid activation?

A) MSE is not differentiable B) The gradient of MSE+sigmoid is suppressed by σ'(z), which is small when predictions are confidently wrong C) MSE can produce negative values D) MSE requires the target to be normally distributed

Answer and Explanations **Correct: B) The gradient of MSE+sigmoid is suppressed by σ'(z), which is small when predictions are confidently wrong** ∂L_MSE/∂z = 2(ŷ − y)·σ'(z). When the model is confidently wrong (e.g., y=1, z=−5, ŷ≈0), the error ŷ−y is large BUT σ'(z)≈0, so the gradient nearly vanishes. The model learns slowest when it's most wrong. - A) Incorrect. MSE is everywhere differentiable. - B) ✓ Correct. The σ'(z) factor kills the gradient exactly when learning should be fastest. - C) Incorrect. MSE is always non-negative. - D) Incorrect. While MSE corresponds to Gaussian MLE, that doesn't prevent its use — the actual problem is gradient suppression.

Q2: What is the gradient ∂L/∂z of BCE + sigmoid for a single example?

A) σ(z)(1 − σ(z)) B) −y·log(σ(z)) C) σ(z) − y D) (σ(z) − y)/σ'(z)

Answer and Explanations **Correct: C) σ(z) − y** The derivation: ∂L/∂ŷ = −y/ŷ + (1−y)/(1−ŷ), ∂ŷ/∂z = ŷ(1−ŷ). The ŷ(1−ŷ) factors cancel, leaving ŷ − y. This linear gradient is why BCE trains faster than MSE for classification. - A) That's σ'(z), not the full gradient. - B) That's not a gradient — it's part of the loss. - C) ✓ Correct. The beautifully simple gradient is prediction minus target. - D) This equals ŷ−y when you expand, but it's needlessly complex — the cancellation is the point.

Q3: Cross-entropy H(p, q) can be decomposed as:

A) H(p, q) = H(p) + H(q) B) H(p, q) = H(p) − D_KL(p || q) C) H(p, q) = H(p) + D_KL(p || q) D) H(p, q) = H(q) + D_KL(q || p)

Answer and Explanations **Correct: C) H(p, q) = H(p) + D_KL(p || q)** H(p, q) = −Σ p(x)·log q(x) = −Σ p(x)·log p(x) − Σ p(x)·log(q(x)/p(x)) = H(p) + D_KL(p||q). Since H(p) is constant w.r.t. model parameters q, minimizing cross-entropy is equivalent to minimizing D_KL(p||q). - A) Incorrect. H(p)+H(q) would be the sum of independent entropies — no interaction. - B) Incorrect. It's addition, not subtraction. - C) ✓ Correct. This decomposition connects CE loss to KL divergence minimization. - D) Incorrect. The KL divergence uses p (true) as the reference, not q.

Q4: Focal loss modifies cross-entropy by multiplying by (1 − ŷ_t)^γ. When γ = 0, what does focal loss reduce to?

A) Hinge loss B) Mean squared error C) Standard cross-entropy D) Zero

Answer and Explanations **Correct: C) Standard cross-entropy** (1 − ŷ_t)^0 = 1 regardless of ŷ_t, so focal loss = −log(ŷ_t) = standard cross-entropy. Focal loss is a generalization that recovers CE at γ=0. - A) Hinge loss is a different functional form entirely. - B) MSE is quadratic, not logarithmic. - C) ✓ Correct. γ=0 removes the modulating factor. - D) The log term remains — it's not zero.

Q5: For hinge loss L = max(0, 1 − y·z) with y = −1, what is the loss and gradient when z = 0.5?

A) L = 0, ∂L/∂z = 0 B) L = 1.5, ∂L/∂z = 1 C) L = 1.5, ∂L/∂z = −1 D) L = 0.5, ∂L/∂z = 0

Answer and Explanations **Correct: B) L = 1.5, ∂L/∂z = 1** y·z = (−1)·0.5 = −0.5 < 1, so margin is violated. L = 1 − (−0.5) = 1.5. ∂L/∂z = −y = −(−1) = 1. Note: For gradient descent, the weight update is w ← w − η·∂L/∂z = w − η·1. Since y = −1 (negative class), z = wᵀx + b should be pushed downward (more negative) so that y·z = −z becomes larger. The positive ∂L/∂z means w is reduced (w ← w − η), pushing z lower — which is the correct direction. - A) Incorrect. Margin is violated. - B) ✓ Correct. L = 1.5, ∂L/∂z = −y = 1. - C) Incorrect. ∂L/∂z = −y, not y. (If ∂L/∂z = −1, that would push z higher, worsening the margin.) - D) Incorrect. Loss is 1.5, not 0.5.

Next Steps

Move on to 16-05 — Backpropagation Mathematics to learn how gradients flow backward through a neural network using the chain rule — the central algorithm that makes deep learning possible.