### 13.5 — Cross-Entropy and Applications
Phase: Information Theory Prerequisites: 13-04-kl-divergence, 13-01-entropy
Learning Objectives
By the end of this subject, you will be able to:
- Explain why cross-entropy is the standard loss function for classification
- Derive the relationship between cross-entropy minimisation and maximum likelihood
- Compute and interpret perplexity as a measure of language model quality
- Apply bits-per-character and bits-per-dimension metrics
- Recognise when cross-entropy is preferred over accuracy for model evaluation
Core Content
⚠️ CRITICAL: Cross-Entropy as Maximum Likelihood
Minimising cross-entropy loss is mathematically equivalent to maximising the likelihood (MLE).
For i.i.d. data ${x_i}$ and model $Q_\theta$:
Log-likelihood: $\ell(\theta) = \sum_{i=1}^{n} \log Q_\theta(x_i)$
Cross-entropy (empirical): $-\frac{1}{n}\sum_{i=1}^{n} \log Q_\theta(x_i) = -\frac{1}{n}\ell(\theta)$
Maximising $\ell(\theta)$ $\iff$ Minimising $-\frac{1}{n}\ell(\theta)$ $\iff$ Minimising cross-entropy.
This is why training a neural network with cross-entropy loss IS maximum likelihood estimation.
Cross-Entropy in Classification
For $K$-class classification with true label $y$ and predicted probabilities $\hat{y}$:
$$\mathcal{L}{\text{CE}} = -\sum{k=1}^{K} y_k \log \hat{y}_k$$
With one-hot encoding: $\mathcal{L}{\text{CE}} = -\log \hat{y}{\text{true}}$
Gradient with softmax: The gradient of cross-entropy with softmax outputs is beautifully simple:
$$\frac{\partial \mathcal{L}}{\partial z_k} = \hat{y}_k - y_k$$
This is the difference between predicted probability and true label — the network learns by correcting its mistakes proportionally to how wrong it is.
Perplexity
Perplexity is the standard evaluation metric for language models:
$$\text{PPL} = 2^{H(P, Q)} = 2^{-\frac{1}{N}\sum \log_2 Q(x_i)}$$
Interpretation: A perplexity of $K$ means the model is as confused on average as if it were choosing uniformly among $K$ equally likely options.
| Perplexity | Interpretation |
|---|---|
| 1 | Perfect prediction |
| 10 | Choosing among ~10 options each time |
| 100 | Very uncertain (~100 options) |
| 1000+ | Poor model |
🚩 Common Pitfall: Perplexity is vocabulary-dependent. A model with a 50k vocabulary and PPL=50 is better than a model with a 5k vocabulary and PPL=50 — the former distinguishes among far more options.
Bits-per-Character (BPC) and Bits-per-Dimension
For comparing models across different tokenisers/vocabularies:
$$\text{BPC} = \frac{-\sum \log_2 P(\text{chars})}{\text{total characters}} = \frac{\text{total bits}}{\text{total characters}}$$
$$\text{BPD} = \frac{-\sum \log_2 P(\text{data})}{\text{total dimensions}}$$
These normalise by the actual amount of data, making comparisons fair across different tokenisation schemes.
Why Cross-Entropy Beats Accuracy
| Metric | Granularity | Captures confidence? | Differentiable? |
|---|---|---|---|
| Accuracy | Binary (right/wrong) | No | No |
| Cross-entropy | Continuous | Yes | Yes |
A model that predicts the correct class with 51% confidence is scored the same as one with 99% confidence under accuracy. Cross-entropy correctly penalises the uncertain prediction more heavily.
Key Terms
- Accuracy
- Cross-entropy
- Perplexity
Worked Examples
Example 1: Cross-entropy computation
3-class problem, true label = class 0.
Model A predictions: $[0.8, 0.1, 0.1]$ Model B predictions: $[0.4, 0.3, 0.3]$
$\mathcal{L}_A = -\log 0.8 \approx 0.223$ nats $\mathcal{L}_B = -\log 0.4 \approx 0.916$ nats
Both have 100% accuracy (argmax is correct), but Model A is much better — its loss is 4× lower. Cross-entropy captures the quality that accuracy misses.
Example 2: Perplexity calculation
A language model assigns the following probabilities to a 5-word sequence:
| Word | $P(\text{word} \mid \text{context})$ |
|---|---|
| the | 0.15 |
| cat | 0.08 |
| sat | 0.12 |
| on | 0.20 |
| mat | 0.04 |
Cross-entropy (in bits): $-\frac{1}{5}(\log_2 0.15 + \log_2 0.08 + \log_2 0.12 + \log_2 0.20 + \log_2 0.04)$
$= -\frac{1}{5}(-2.737 + -3.644 + -3.059 + -2.322 + -4.644)$
$= -\frac{1}{5}(-16.406) = 3.281$ bits per word
Perplexity: $2^{3.281} \approx 9.72$
The model is as uncertain as choosing uniformly among ~10 options per word.
Example 3: MLE = Cross-entropy minimisation
For a binary classifier with parameter $\theta$:
$P(Y=1 \mid X) = \sigma(w^T X) = \hat{y}$, $P(Y=0 \mid X) = 1 - \hat{y}$
Log-likelihood for one example: $\ell = y\log\hat{y} + (1-y)\log(1-\hat{y})$
Negative log-likelihood: $-\ell = -[y\log\hat{y} + (1-y)\log(1-\hat{y})]$
This is exactly the binary cross-entropy loss. Minimising NLL $\iff$ maximising likelihood $\iff$ minimising cross-entropy.
Quiz
Q1: What does the concept of Perplexity primarily refer to in this subject?
A) The definition and application of Perplexity B) A visual representation of Perplexity C) A computational error related to Perplexity D) A historical anecdote about Perplexity
Correct: A)
- If you chose A: Perplexity is defined as: the definition and application of perplexity. The other options describe different aspects that are not the primary focus. Correct!
- If you chose B: This is incorrect. Perplexity is defined as: the definition and application of perplexity. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Perplexity is defined as: the definition and application of perplexity. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Perplexity is defined as: the definition and application of perplexity. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) \{x_i\} B) The inverse operation of the formula in question C) A simplified version of \{x_i\}... D) An unrelated formula from a different topic
Correct: A)
- If you chose A: The formula \{x_i\} is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose B: This is incorrect. The formula \{x_i\} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula \{x_i\} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula \{x_i\} is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Accuracy?
A) It replaces all other methods in this domain B) It is used to accuracy in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system
Correct: B)
- If you chose A: This is incorrect. Accuracy serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Accuracy serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Accuracy serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Accuracy serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Cross-entropy is TRUE?
A) Cross-entropy is an advanced topic beyond this subject's scope B) Cross-entropy is not related to this subject C) Cross-entropy is a fundamental concept covered in this subject D) Cross-entropy is mentioned only as a historical footnote
Correct: C)
- If you chose A: This is incorrect. Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content.
- If you chose B: This is incorrect. Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content.
- If you chose C: Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content. Correct!
- If you chose D: This is incorrect. Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) A different result from a common mistake C) \hat{y}j(\delta{jk} - \hat{y}_k)$ D) An unrelated numerical value
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is \hat{y}j(\delta{jk} - \hat{y}_k)$. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is \hat{y}j(\delta{jk} - \hat{y}_k)$. The other options represent common errors.
- If you chose C: The worked examples show that the result is \hat{y}j(\delta{jk} - \hat{y}_k)$. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is \hat{y}j(\delta{jk} - \hat{y}_k)$. The other options represent common errors.
Q6: How are Cross-entropy and ⚠️ Critical: Cross-Entropy As Maximum Likelihood related?
A) Cross-entropy and ⚠️ Critical: Cross-Entropy As Maximum Likelihood are completely unrelated topics B) Cross-entropy is the inverse of ⚠️ Critical: Cross-Entropy As Maximum Likelihood C) Cross-entropy is a special case of ⚠️ Critical: Cross-Entropy As Maximum Likelihood D) Cross-entropy and ⚠️ Critical: Cross-Entropy As Maximum Likelihood are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Cross-entropy and ⚠️ Critical: Cross-Entropy As Maximum Likelihood are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Cross-entropy and ⚠️ Critical: Cross-Entropy As Maximum Likelihood are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Cross-entropy and ⚠️ Critical: Cross-Entropy As Maximum Likelihood are covered in this subject as interconnected topics.
- If you chose D: Both Cross-entropy and ⚠️ Critical: Cross-Entropy As Maximum Likelihood are covered in this subject as interconnected topics. Correct!
Q7: What is a common pitfall when working with Cross-Entropy In Classification?
A) Cross-Entropy In Classification has no common misconceptions B) Cross-Entropy In Classification is always computed the same way in all contexts C) A common mistake is confusing Cross-Entropy In Classification with a similar concept D) The main error with Cross-Entropy In Classification is using it when it is not needed
Correct: C)
- If you chose A: This is incorrect. Students often confuse Cross-Entropy In Classification with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Cross-Entropy In Classification with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse Cross-Entropy In Classification with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse Cross-Entropy In Classification with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Bits-Per-Character (Bpc) And Bits-Per-Dimension?
A) Avoid Bits-Per-Character (Bpc) And Bits-Per-Dimension unless explicitly instructed B) Apply Bits-Per-Character (Bpc) And Bits-Per-Dimension to solve problems in this subject's domain C) Bits-Per-Character (Bpc) And Bits-Per-Dimension is not practically useful D) Use Bits-Per-Character (Bpc) And Bits-Per-Dimension only in pure mathematics contexts
Correct: B)
- If you chose A: This is incorrect. Bits-Per-Character (Bpc) And Bits-Per-Dimension is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Bits-Per-Character (Bpc) And Bits-Per-Dimension is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Bits-Per-Character (Bpc) And Bits-Per-Dimension is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Bits-Per-Character (Bpc) And Bits-Per-Dimension is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
A binary classifier outputs $\hat{y} = 0.7$ for a positive example ($y=1$). What is the cross-entropy loss?
Click for answer
$\mathcal{L} = -[1 \cdot \log 0.7 + 0 \cdot \log 0.3] = -\log 0.7 \approx 0.357$ nats ($\approx 0.515$ bits) -
Language model A has perplexity 30 on a test set. Model B has perplexity 15. How much better is Model B?
Click for answer
Model B uses $\log_2 15 \approx 3.91$ bits per word vs $\log_2 30 \approx 4.91$ bits for Model A. Model B saves 1 bit per word — it's twice as certain on average (half the effective vocabulary size per prediction). -
Why can't you compare perplexity across models with different tokenisers directly?
Click for answer
Perplexity is measured per TOKEN. A model using a subword tokeniser (more tokens per sentence) will have artificially lower perplexity because there are fewer choices per token. BPC or BPD normalise by the actual data length, enabling fair cross-tokeniser comparisons. -
Model A has 95% accuracy; Model B has 93% accuracy but lower cross-entropy loss. Which is better?
Click for answer
Likely Model B. Lower cross-entropy means more confident correct predictions and less confident wrong ones. Model A might be "just barely right" on many examples while Model B is confidently right. Cross-entropy is generally preferred as a training objective and evaluation metric. -
Show that the cross-entropy gradient with softmax is $\hat{y}_k - y_k$.
Click for answer
$\mathcal{L} = -\sum_j y_j \log \hat{y}_j$, where $\hat{y}_j = e^{z_j} / \sum_m e^{z_m}$ $\frac{\partial \mathcal{L}}{\partial z_k} = \sum_j \frac{\partial \mathcal{L}}{\partial \hat{y}_j} \frac{\partial \hat{y}_j}{\partial z_k}$ $\frac{\partial \mathcal{L}}{\partial \hat{y}_j} = -\frac{y_j}{\hat{y}_j}$ For softmax: $\frac{\partial \hat{y}_j}{\partial z_k} = \hat{y}_j(\delta_{jk} - \hat{y}_k)$ $\frac{\partial \mathcal{L}}{\partial z_k} = \sum_j (-\frac{y_j}{\hat{y}_j})\hat{y}_j(\delta_{jk} - \hat{y}_k) = -\sum_j y_j(\delta_{jk} - \hat{y}_k) = -\sum_j y_j\delta_{jk} + \hat{y}_k\sum_j y_j = -y_k + \hat{y}_k = \hat{y}_k - y_k$ ✓
Summary
Key takeaways:
- Cross-entropy $H(P, Q) = -\sum P(x)\log Q(x)$ is the standard loss for classification and language modelling
- Cross-entropy minimisation $\iff$ MLE — they are mathematically identical
- Perplexity $= 2^{H(P,Q)}$ measures a language model's average branching factor
- BPC/BPD normalise by data length for fair cross-model comparison
- Cross-entropy captures confidence, not just correctness — better than accuracy for training
- The softmax + cross-entropy gradient is simply $\hat{y} - y$, enabling efficient backpropagation
Pitfalls
-
Comparing perplexity across models with different vocabularies or tokenisers: Perplexity is measured per token. A model using subword tokenisation (more tokens per word) will have artificially lower perplexity because there are fewer choices per token. Use bits-per-character (BPC) or bits-per-dimension (BPD) to normalise by the actual data length for fair cross-model comparisons.
-
Using accuracy instead of cross-entropy as a training loss: Accuracy is piecewise constant — its gradient is zero almost everywhere, making it useless for backpropagation. Cross-entropy is smooth, differentiable, and rewards confident correct predictions. Even for evaluation, cross-entropy captures model calibration that accuracy misses: two models both at 95% accuracy can have vastly different cross-entropy.
-
Forgetting that cross-entropy minimisation IS maximum likelihood estimation: $- \frac{1}{n}\sum \log Q_\theta(x_i) = -\frac{1}{n}\ell(\theta)$. Minimising cross-entropy loss is mathematically identical to maximising the log-likelihood. This equivalence is why neural network training with CE loss is doing MLE — a fact that connects deep learning to classical statistics.
-
Using hard one-hot targets without label smoothing: When true labels are one-hot ($P(\text{true class}) = 1$, all others 0), the model is encouraged toward infinity-confidence, which can cause overfitting and poor calibration. Label smoothing ($P(\text{true class}) = 1 - \varepsilon$, others $= \varepsilon/(K-1)$) prevents the model from becoming overconfident and often improves generalisation.
-
Computing cross-entropy and softmax separately: Computing softmax first and then taking the log can cause numerical underflow/overflow. The correct approach is the log-sum-exp trick: $\log \text{softmax}(z)_k = z_k - \log\sum_j e^{z_j}$, implemented as
log_softmaxin all major frameworks. Always use the fusedCrossEntropyLoss(which expects raw logits, not probabilities) rather than chaining softmax and NLLLoss.
Next Steps
Next up: 13-06-source-coding-theorem.md