Math graphic
📐 Concept diagram

### 13.5 — Cross-Entropy and Applications

Phase: Information Theory Prerequisites: 13-04-kl-divergence, 13-01-entropy

Learning Objectives

By the end of this subject, you will be able to:

  1. Explain why cross-entropy is the standard loss function for classification
  2. Derive the relationship between cross-entropy minimisation and maximum likelihood
  3. Compute and interpret perplexity as a measure of language model quality
  4. Apply bits-per-character and bits-per-dimension metrics
  5. Recognise when cross-entropy is preferred over accuracy for model evaluation

Core Content

⚠️ CRITICAL: Cross-Entropy as Maximum Likelihood

Minimising cross-entropy loss is mathematically equivalent to maximising the likelihood (MLE).

For i.i.d. data ${x_i}$ and model $Q_\theta$:

Log-likelihood: $\ell(\theta) = \sum_{i=1}^{n} \log Q_\theta(x_i)$

Cross-entropy (empirical): $-\frac{1}{n}\sum_{i=1}^{n} \log Q_\theta(x_i) = -\frac{1}{n}\ell(\theta)$

Maximising $\ell(\theta)$ $\iff$ Minimising $-\frac{1}{n}\ell(\theta)$ $\iff$ Minimising cross-entropy.

This is why training a neural network with cross-entropy loss IS maximum likelihood estimation.

Cross-Entropy in Classification

For $K$-class classification with true label $y$ and predicted probabilities $\hat{y}$:

$$\mathcal{L}{\text{CE}} = -\sum{k=1}^{K} y_k \log \hat{y}_k$$

With one-hot encoding: $\mathcal{L}{\text{CE}} = -\log \hat{y}{\text{true}}$

Gradient with softmax: The gradient of cross-entropy with softmax outputs is beautifully simple:

$$\frac{\partial \mathcal{L}}{\partial z_k} = \hat{y}_k - y_k$$

This is the difference between predicted probability and true label — the network learns by correcting its mistakes proportionally to how wrong it is.

Perplexity

Perplexity is the standard evaluation metric for language models:

$$\text{PPL} = 2^{H(P, Q)} = 2^{-\frac{1}{N}\sum \log_2 Q(x_i)}$$

Interpretation: A perplexity of $K$ means the model is as confused on average as if it were choosing uniformly among $K$ equally likely options.

Perplexity Interpretation
1 Perfect prediction
10 Choosing among ~10 options each time
100 Very uncertain (~100 options)
1000+ Poor model

🚩 Common Pitfall: Perplexity is vocabulary-dependent. A model with a 50k vocabulary and PPL=50 is better than a model with a 5k vocabulary and PPL=50 — the former distinguishes among far more options.

Bits-per-Character (BPC) and Bits-per-Dimension

For comparing models across different tokenisers/vocabularies:

$$\text{BPC} = \frac{-\sum \log_2 P(\text{chars})}{\text{total characters}} = \frac{\text{total bits}}{\text{total characters}}$$

$$\text{BPD} = \frac{-\sum \log_2 P(\text{data})}{\text{total dimensions}}$$

These normalise by the actual amount of data, making comparisons fair across different tokenisation schemes.

Why Cross-Entropy Beats Accuracy

Metric Granularity Captures confidence? Differentiable?
Accuracy Binary (right/wrong) No No
Cross-entropy Continuous Yes Yes

A model that predicts the correct class with 51% confidence is scored the same as one with 99% confidence under accuracy. Cross-entropy correctly penalises the uncertain prediction more heavily.



Key Terms

Worked Examples

Example 1: Cross-entropy computation

3-class problem, true label = class 0.

Model A predictions: $[0.8, 0.1, 0.1]$ Model B predictions: $[0.4, 0.3, 0.3]$

$\mathcal{L}_A = -\log 0.8 \approx 0.223$ nats $\mathcal{L}_B = -\log 0.4 \approx 0.916$ nats

Both have 100% accuracy (argmax is correct), but Model A is much better — its loss is 4× lower. Cross-entropy captures the quality that accuracy misses.

Example 2: Perplexity calculation

A language model assigns the following probabilities to a 5-word sequence:

Word $P(\text{word} \mid \text{context})$
the 0.15
cat 0.08
sat 0.12
on 0.20
mat 0.04

Cross-entropy (in bits): $-\frac{1}{5}(\log_2 0.15 + \log_2 0.08 + \log_2 0.12 + \log_2 0.20 + \log_2 0.04)$

$= -\frac{1}{5}(-2.737 + -3.644 + -3.059 + -2.322 + -4.644)$

$= -\frac{1}{5}(-16.406) = 3.281$ bits per word

Perplexity: $2^{3.281} \approx 9.72$

The model is as uncertain as choosing uniformly among ~10 options per word.

Example 3: MLE = Cross-entropy minimisation

For a binary classifier with parameter $\theta$:

$P(Y=1 \mid X) = \sigma(w^T X) = \hat{y}$, $P(Y=0 \mid X) = 1 - \hat{y}$

Log-likelihood for one example: $\ell = y\log\hat{y} + (1-y)\log(1-\hat{y})$

Negative log-likelihood: $-\ell = -[y\log\hat{y} + (1-y)\log(1-\hat{y})]$

This is exactly the binary cross-entropy loss. Minimising NLL $\iff$ maximising likelihood $\iff$ minimising cross-entropy.



Quiz

Q1: What does the concept of Perplexity primarily refer to in this subject?

A) The definition and application of Perplexity B) A visual representation of Perplexity C) A computational error related to Perplexity D) A historical anecdote about Perplexity

Correct: A)

Q2: Which of the following is the key formula discussed in this subject?

A) \{x_i\} B) The inverse operation of the formula in question C) A simplified version of \{x_i\}... D) An unrelated formula from a different topic

Correct: A)

Q3: What is the primary purpose of Accuracy?

A) It replaces all other methods in this domain B) It is used to accuracy in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system

Correct: B)

Q4: Which statement about Cross-entropy is TRUE?

A) Cross-entropy is an advanced topic beyond this subject's scope B) Cross-entropy is not related to this subject C) Cross-entropy is a fundamental concept covered in this subject D) Cross-entropy is mentioned only as a historical footnote

Correct: C)

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) A different result from a common mistake C) \hat{y}j(\delta{jk} - \hat{y}_k)$ D) An unrelated numerical value

Correct: C)

Q6: How are Cross-entropy and ⚠️ Critical: Cross-Entropy As Maximum Likelihood related?

A) Cross-entropy and ⚠️ Critical: Cross-Entropy As Maximum Likelihood are completely unrelated topics B) Cross-entropy is the inverse of ⚠️ Critical: Cross-Entropy As Maximum Likelihood C) Cross-entropy is a special case of ⚠️ Critical: Cross-Entropy As Maximum Likelihood D) Cross-entropy and ⚠️ Critical: Cross-Entropy As Maximum Likelihood are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with Cross-Entropy In Classification?

A) Cross-Entropy In Classification has no common misconceptions B) Cross-Entropy In Classification is always computed the same way in all contexts C) A common mistake is confusing Cross-Entropy In Classification with a similar concept D) The main error with Cross-Entropy In Classification is using it when it is not needed

Correct: C)

Q8: When should you apply Bits-Per-Character (Bpc) And Bits-Per-Dimension?

A) Avoid Bits-Per-Character (Bpc) And Bits-Per-Dimension unless explicitly instructed B) Apply Bits-Per-Character (Bpc) And Bits-Per-Dimension to solve problems in this subject's domain C) Bits-Per-Character (Bpc) And Bits-Per-Dimension is not practically useful D) Use Bits-Per-Character (Bpc) And Bits-Per-Dimension only in pure mathematics contexts

Correct: B)

Practice Problems

  1. A binary classifier outputs $\hat{y} = 0.7$ for a positive example ($y=1$). What is the cross-entropy loss?

    Click for answer $\mathcal{L} = -[1 \cdot \log 0.7 + 0 \cdot \log 0.3] = -\log 0.7 \approx 0.357$ nats ($\approx 0.515$ bits)

  2. Language model A has perplexity 30 on a test set. Model B has perplexity 15. How much better is Model B?

    Click for answer Model B uses $\log_2 15 \approx 3.91$ bits per word vs $\log_2 30 \approx 4.91$ bits for Model A. Model B saves 1 bit per word — it's twice as certain on average (half the effective vocabulary size per prediction).

  3. Why can't you compare perplexity across models with different tokenisers directly?

    Click for answer Perplexity is measured per TOKEN. A model using a subword tokeniser (more tokens per sentence) will have artificially lower perplexity because there are fewer choices per token. BPC or BPD normalise by the actual data length, enabling fair cross-tokeniser comparisons.

  4. Model A has 95% accuracy; Model B has 93% accuracy but lower cross-entropy loss. Which is better?

    Click for answer Likely Model B. Lower cross-entropy means more confident correct predictions and less confident wrong ones. Model A might be "just barely right" on many examples while Model B is confidently right. Cross-entropy is generally preferred as a training objective and evaluation metric.

  5. Show that the cross-entropy gradient with softmax is $\hat{y}_k - y_k$.

    Click for answer $\mathcal{L} = -\sum_j y_j \log \hat{y}_j$, where $\hat{y}_j = e^{z_j} / \sum_m e^{z_m}$ $\frac{\partial \mathcal{L}}{\partial z_k} = \sum_j \frac{\partial \mathcal{L}}{\partial \hat{y}_j} \frac{\partial \hat{y}_j}{\partial z_k}$ $\frac{\partial \mathcal{L}}{\partial \hat{y}_j} = -\frac{y_j}{\hat{y}_j}$ For softmax: $\frac{\partial \hat{y}_j}{\partial z_k} = \hat{y}_j(\delta_{jk} - \hat{y}_k)$ $\frac{\partial \mathcal{L}}{\partial z_k} = \sum_j (-\frac{y_j}{\hat{y}_j})\hat{y}_j(\delta_{jk} - \hat{y}_k) = -\sum_j y_j(\delta_{jk} - \hat{y}_k) = -\sum_j y_j\delta_{jk} + \hat{y}_k\sum_j y_j = -y_k + \hat{y}_k = \hat{y}_k - y_k$ ✓


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 13-06-source-coding-theorem.md