Math graphic
📐 Concept diagram

16-03 — The Softmax Function

Phase: 16 — Neural Network Mathematics Subject: 16-03 Prerequisites: 16-02 (Activation Functions), Phase 4 (derivatives), Phase 10 (probability basics) Next subject: 16-04 — Loss Functions


Learning Objectives

By the end of this subject, you will be able to:

  1. Write the softmax definition and explain why it produces a valid probability distribution
  2. Derive the softmax Jacobian and explain why the off-diagonal terms exist (unlike sigmoid)
  3. Apply the log-sum-exp trick for numerically stable softmax computation
  4. Show that 2-class softmax reduces to the sigmoid function
  5. Explain the effect of the temperature parameter on the softmax distribution

Core Content

1. Definition and Motivation

In multi-class classification with K classes, we want a neural network to output a probability distribution over the K classes — K numbers that are all non-negative and sum to 1. The softmax function does exactly this.

Given a vector of raw scores (logits) z = [z₁, z₂, ..., z_K]ᵀ:

softmax(z)ᵢ = exp(zᵢ) / Σⱼ₌₁^K exp(zⱼ)

The output = softmax(z) has: - ŷᵢ > 0 for all i (exponential is always positive) - Σᵢ ŷᵢ = 1 (normalized by the sum of exponentials)

⚠️ THIS IS CRITICAL — Softmax is used in virtually every multi-class classifier, every language model (as the final layer producing next-token probabilities), and every attention mechanism. You will use this constantly.

Why "softmax"? It's a "soft" version of the argmax function. As we'll see with temperature, softmax smoothly interpolates between uniform distribution and hard max.

2. Relationship to Sigmoid

For K = 2 classes, softmax reduces to the sigmoid function:

Let z = [z₁, z₂]ᵀ. Then:

softmax(z₁) = e^(z₁)/(e^(z₁) + e^(z₂)) = 1/(1 + e^(z₂ − z₁)) = σ(z₁ − z₂)

Derivation:

softmax(z₁) = e^(z₁)/(e^(z₁) + e^(z₂)) = 1/(1 + e^(z₂)/e^(z₁)) = 1/(1 + e^(z₂ − z₁)) = σ(z₁ − z₂)

This shows that for binary classification: P(class 1) = σ(z₁ − z₂), which depends only on the DIFFERENCE between logits. This is why binary classifiers often use a single output with sigmoid rather than two outputs with softmax — the extra degree of freedom is redundant (adding a constant to both logits doesn't change the softmax output).

3. The Softmax Jacobian

Unlike element-wise activations (sigmoid, ReLU, tanh) where ∂ŷᵢ/∂zⱼ = 0 for i ≠ j, the softmax couples all inputs together. The derivative is a full K × K Jacobian matrix:

∂ŷᵢ/∂zⱼ = { ŷᵢ(1 − ŷᵢ) if i = j { −ŷᵢ·ŷⱼ if i ≠ j

Derivation:

Let ŷᵢ = e^(zᵢ)/S where S = Σₖ e^(zₖ).

Case 1: i = j (diagonal):

∂ŷᵢ/∂zᵢ = ∂/∂zᵢ [e^(zᵢ)/S] = (e^(zᵢ)·S − e^(zᵢ)·e^(zᵢ))/S² (quotient rule: ∂/∂zᵢ of S is e^(zᵢ)) = e^(zᵢ)/S − (e^(zᵢ)/S)² = ŷᵢ − ŷᵢ² = ŷᵢ(1 − ŷᵢ)

Case 2: i ≠ j (off-diagonal):

∂ŷᵢ/∂zⱼ = ∂/∂zⱼ [e^(zᵢ)/S] = (0·S − e^(zᵢ)·e^(zⱼ))/S² (e^(zᵢ) doesn't depend on zⱼ, so numerator derivative is 0) = −e^(zᵢ)·e^(zⱼ)/S² = −(e^(zᵢ)/S)(e^(zⱼ)/S) = −ŷᵢ·ŷⱼ

Compact matrix form: J = diag() −

Why off-diagonal terms are negative: Increasing zⱼ increases the denominator S for all outputs, which decreases ŷᵢ (for i ≠ j). The probabilities must sum to 1 — pushing one up pushes others down.

4. Numerical Stability: The Log-Sum-Exp Trick

Computing softmax naively can cause numerical overflow. If zᵢ is large (e.g., zᵢ = 1000), e^(1000) overflows floating-point representation.

The trick: Subtract the maximum from all logits before exponentiating:

softmax(z)ᵢ = exp(zᵢ − m) / Σⱼ exp(zⱼ − m)

where m = maxₖ(zₖ).

Proof of equivalence:

Let m = maxₖ(zₖ). Then:

softmax(zᵢ) = e^(zᵢ)/Σⱼ e^(zⱼ) = e^(zᵢ − m) · e^m / (Σⱼ e^(zⱼ − m) · e^m) = e^(zᵢ − m) / Σⱼ e^(zⱼ − m) ✓

Since zᵢ − m ≤ 0 for all i, the maximum exponentiated value is e⁰ = 1 — no overflow. And at least one term is exactly 1, preventing underflow to all zeros.

In log space: The denominator's log is the log-sum-exp function:

LSE(z) = log(Σⱼ exp(zⱼ)) = m + log(Σⱼ exp(zⱼ − m))

This is a smooth approximation of the maximum function and appears widely in ML (log-likelihood computation, attention, etc.).

5. Temperature Parameter

The temperature T controls the "sharpness" of the softmax distribution:

softmax(z; T)ᵢ = exp(zᵢ/T) / Σⱼ exp(zⱼ/T)

Effect of temperature: - T → 0: The distribution approaches a one-hot vector at argmax(z). Only the largest logit survives — softmax → argmax. - T = 1: Standard softmax. - T → ∞: The distribution approaches uniform — every class gets probability 1/K. - T > 1: "Softens" the distribution, making it more uniform (higher entropy). Used in knowledge distillation to reveal dark knowledge from teacher models. - T < 1: "Sharpens" the distribution, making it more peaked. Used in some sampling strategies.

Why "temperature"? This comes from statistical mechanics: the Boltzmann distribution for a system at temperature T is pᵢ ∝ exp(−Eᵢ/(kT)). The softmax is a Boltzmann distribution over energy states Eᵢ = −zᵢ.

6. Softmax in the Context of Cross-Entropy

The combination of softmax + categorical cross-entropy loss is ubiquitous in classification and language modeling:

L = −Σᵢ yᵢ·log(ŷᵢ) where ŷ = softmax(z)

A beautiful simplification occurs when we compute ∂L/∂z (see 16-04 for the full derivation):

∂L/∂zᵢ = ŷᵢ − yᵢ

The gradient is simply the difference between the predicted and true probabilities. This elegant result is one reason why softmax + cross-entropy is so widely used — it provides clean, well-behaved gradients.

7. Softmax in Attention Mechanisms

In scaled dot-product attention (Phase 17):

Attention(Q, K, V) = softmax(QKᵀ/√dₖ)V

The softmax here normalizes the attention scores into a probability distribution over input positions — each position "attends" to others with weights summing to 1. The scaling factor √dₖ prevents the dot products from growing too large (which would push softmax into the saturated, nearly-one-hot regime where gradients vanish).



Key Terms

Worked Examples

Example 1: Computing Softmax

Problem: Compute softmax([2, 1, 0.5]ᵀ).

Solution:

Step 1 — Exponentiate: e² ≈ 7.3891, e¹ ≈ 2.7183, e^{0.5} ≈ 1.6487

Step 2 — Sum: S = 7.3891 + 2.7183 + 1.6487 = 11.7561

Step 3 — Normalize: - ŷ₁ = 7.3891/11.7561 ≈ 0.6286 - ŷ₂ = 2.7183/11.7561 ≈ 0.2313 - ŷ₃ = 1.6487/11.7561 ≈ 0.1402

Check: 0.6286 + 0.2313 + 0.1402 = 1.0001 ✓ (rounding)

The largest logit (2) gets the most probability mass, but all classes get some.

Example 2: Log-Sum-Exp Trick

Problem: Compute softmax([1000, 1001, 1002]ᵀ) safely.

Solution:

Without the trick: e^{1000}, e^{1001}, e^{1002} would all overflow.

With m = 1002: - z₁ − m = 1000 − 1002 = −2 → e⁻² ≈ 0.1353 - z₂ − m = 1001 − 1002 = −1 → e⁻¹ ≈ 0.3679 - z₃ − m = 1002 − 1002 = 0 → e⁰ = 1

Sum: 0.1353 + 0.3679 + 1 = 1.5032

All stable, no overflow. The highest logit still dominates the distribution.

Example 3: Softmax with Temperature

Problem: Given logits z = [3, 1, 0]ᵀ, compute softmax at T = 0.5, T = 1, and T = 2. Interpret the results.

Solution:

T = 0.5 (cold/sharp): - z/T = [6, 2, 0] → exp: [403.43, 7.389, 1] → sum = 411.82 - ŷ ≈ [0.9796, 0.0179, 0.0024] → nearly one-hot

T = 1 (standard): - exp([3,1,0]) = [20.086, 2.718, 1] → sum = 23.804 - ŷ ≈ [0.8438, 0.1142, 0.0420]

T = 2 (hot/soft): - z/T = [1.5, 0.5, 0] → exp: [4.482, 1.649, 1] → sum = 7.131 - ŷ ≈ [0.6285, 0.2313, 0.1402]

As T increases, the distribution becomes more uniform (entropy increases). As T decreases, it approaches a hard argmax.

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: Compute the softmax Jacobian for z = [1, 0, −1]ᵀ. Give the full 3×3 matrix.

Problem 2: Prove that softmax(z + c·1) = softmax(z) where 1 is a vector of all ones and c is any constant. This is the "translation invariance" of softmax. What does this imply about the number of effective parameters?

Problem 3: For K = 2, derive the full 2×2 softmax Jacobian and show that ∂ŷ₁/∂z₁ = ŷ₁ŷ₂ (note: this matches the diagonal formula ŷ₁(1 − ŷ₁) since ŷ₂ = 1 − ŷ₁).

Problem 4: Compute the entropy H() = −Σ ŷᵢ·log(ŷᵢ) of the softmax distribution for z = [2, 2, 2]ᵀ. Compare with the entropy for z = [5, 0, −5]ᵀ.

Problem 5: Show that ∂²/∂zⱼ∂zₖ LSE(z) is the covariance matrix of the softmax distribution. (LSE = log-sum-exp.)

Answers (click to expand) **Problem 1:** First compute **ŷ**: e¹=2.7183, e⁰=1, e⁻¹=0.3679, sum=4.0862 ŷ₁=0.6653, ŷ₂=0.2447, ŷ₃=0.0900 Diagonal: ŷᵢ(1−ŷᵢ): [0.2227, 0.1848, 0.0819] Off-diagonal (i,j): −ŷᵢ·ŷⱼ: −(1,2): −0.1628, −(1,3): −0.0599 −(2,1): −0.1628, −(2,3): −0.0220 −(3,1): −0.0599, −(3,2): −0.0220 J = [[ 0.2227, −0.1628, −0.0599], [−0.1628, 0.1848, −0.0220], [−0.0599, −0.0220, 0.0819]] **Problem 2:** softmax(zᵢ + c) = e^(zᵢ+c)/Σⱼ e^(zⱼ+c) = eᶜ·e^(zᵢ)/(eᶜ·Σⱼ e^(zⱼ)) = e^(zᵢ)/Σⱼ e^(zⱼ) Implication: K logits have only K−1 effective degrees of freedom. You can fix one logit to 0 without loss of expressiveness. **Problem 3:** For K=2, ŷ₂ = 1−ŷ₁. ∂ŷ₁/∂z₁ = ŷ₁(1−ŷ₁) = ŷ₁ŷ₂ ✓ ∂ŷ₁/∂z₂ = −ŷ₁ŷ₂ ∂ŷ₂/∂z₁ = −ŷ₁ŷ₂ ∂ŷ₂/∂z₂ = ŷ₂(1−ŷ₂) = ŷ₁ŷ₂ Note that columns sum to 0 (since ŷ₁+ŷ₂=1, derivative must preserve this). **Problem 4:** For [2,2,2]: ŷ = [⅓, ⅓, ⅓]. H = −3·⅓·log(⅓) = log(3) ≈ 1.099 nats. For [5,0,−5]: e⁵=148.41, 1, e⁻⁵=0.0067, sum=149.42. ŷ=[0.9934, 0.0067, 0.0000]. H ≈ −0.9934·log(0.9934) − 0.0067·log(0.0067) ≈ 0.0066 + 0.0336 ≈ 0.040 nats. The first is high entropy (uniform), the second is low entropy (peaked). **Problem 5:** LSE(z) = log(Σ exp(zⱼ)) ∂LSE/∂zₖ = exp(zₖ)/Σ exp(zⱼ) = ŷₖ (the softmax probability) ∂²LSE/∂zⱼ∂zₖ = ∂ŷₖ/∂zⱼ = ŷₖ·δ_{jk} − ŷⱼ·ŷₖ This is exactly the softmax Jacobian, and its negative is the covariance of a categorical distribution: Cov(δ_{i}, δ_{j}) = −ŷᵢ·ŷⱼ for i≠j. So ∂²LSE = Cov(one-hot indicator variables). ✓

Summary

  1. Softmax converts logits to a probability distribution: ŷᵢ = exp(zᵢ)/Σⱼexp(zⱼ), with all outputs >0 and summing to 1.
  2. The Jacobian is ∂ŷᵢ/∂zⱼ = ŷᵢ(δ_{ij} − ŷⱼ) — diagonal terms are positive, off-diagonals negative (probabilities compete).
  3. The log-sum-exp trick (subtract max before exp) prevents numerical overflow without changing the result.
  4. Temperature T controls sharpness: T→0 gives argmax, T→∞ gives uniform; used in knowledge distillation and sampling.
  5. Combined with cross-entropy loss, the gradient simplifies beautifully to ∂L/∂z = y.

Quiz

Q1: The softmax of [0, 0, 0] outputs what distribution?

A) [0, 0, 0] B) [⅓, ⅓, ⅓] C) [1, 0, 0] D) [0.5, 0.5, 0]

Answer & Explanation **B** — e⁰ = 1 for all entries. Sum = 3. Each output = 1/3. Equal logits produce a uniform distribution. A is false (softmax never outputs zero). C would require one logit to dominate. D is incorrect for three classes.

Q2: Why are the off-diagonal entries of the softmax Jacobian negative?

A) Because the exponential function decreases for negative values B) Because increasing zⱼ pushes probability mass toward class j, reducing it for other classes (sum must be 1) C) Because softmax outputs must sum to zero D) They are not always negative — it depends on the logits

Answer & Explanation **B** — ∂ŷᵢ/∂zⱼ = −ŷᵢ·ŷⱼ < 0 for i ≠ j. Increasing zⱼ increases ŷⱼ, and since probabilities must sum to 1, the others must decrease. A is false (e^(zᵢ) doesn't change with zⱼ for i ≠ j). D is false (the formula is always negative for positive probabilities).

Q3: What is the purpose of the temperature parameter T in softmax?

A) To prevent numerical overflow B) To control the variance of the logits C) To adjust the entropy/sharpness of the output distribution D) To normalize the input logits

Answer & Explanation **C** — softmax(z/T). Low T amplifies differences (sharper, one-hot-like). High T diminishes differences (softer, more uniform). Used in knowledge distillation (high T reveals "dark knowledge") and temperature sampling. A describes the log-sum-exp trick, not temperature.

Q4: The log-sum-exp trick computes softmax by subtracting m = max(zᵢ). What is the maximum possible value of exp(zᵢ − m)?

A) 0 B) e C) 1 D) Depends on the logits

Answer & Explanation **C** — Since m = max(zᵢ), we have zᵢ − m ≤ 0 for all i. The maximum is exactly 0 for at least one entry (the max itself), giving exp(0) = 1. No overflow possible — all exponentiated values are ≤ 1.

Q5: The gradient of cross-entropy loss combined with softmax simplifies to ∂L/∂z = ŷ − y. What is this gradient when the prediction is perfect (ŷ = y)?

A) 0 (all zeros) B) 1 (all ones) C) ŷ D) −ŷ

Answer & Explanation **A** — If ŷ = y exactly (perfect prediction), then ∂L/∂z = 0. The network has nothing to learn from this example — gradients vanish. This elegant simplification is one reason softmax + cross-entropy is so widely used.

Pitfalls



Next Steps

Move on to 16-04 — Loss Functions to understand MSE, cross-entropy, and how loss functions connect to maximum likelihood estimation and information theory.