16-03 — The Softmax Function
Phase: 16 — Neural Network Mathematics Subject: 16-03 Prerequisites: 16-02 (Activation Functions), Phase 4 (derivatives), Phase 10 (probability basics) Next subject: 16-04 — Loss Functions
Learning Objectives
By the end of this subject, you will be able to:
- Write the softmax definition and explain why it produces a valid probability distribution
- Derive the softmax Jacobian and explain why the off-diagonal terms exist (unlike sigmoid)
- Apply the log-sum-exp trick for numerically stable softmax computation
- Show that 2-class softmax reduces to the sigmoid function
- Explain the effect of the temperature parameter on the softmax distribution
Core Content
1. Definition and Motivation
In multi-class classification with K classes, we want a neural network to output a probability distribution over the K classes — K numbers that are all non-negative and sum to 1. The softmax function does exactly this.
Given a vector of raw scores (logits) z = [z₁, z₂, ..., z_K]ᵀ:
softmax(z)ᵢ = exp(zᵢ) / Σⱼ₌₁^K exp(zⱼ)
The output ŷ = softmax(z) has: - ŷᵢ > 0 for all i (exponential is always positive) - Σᵢ ŷᵢ = 1 (normalized by the sum of exponentials)
⚠️ THIS IS CRITICAL — Softmax is used in virtually every multi-class classifier, every language model (as the final layer producing next-token probabilities), and every attention mechanism. You will use this constantly.
Why "softmax"? It's a "soft" version of the argmax function. As we'll see with temperature, softmax smoothly interpolates between uniform distribution and hard max.
2. Relationship to Sigmoid
For K = 2 classes, softmax reduces to the sigmoid function:
Let z = [z₁, z₂]ᵀ. Then:
softmax(z₁) = e^(z₁)/(e^(z₁) + e^(z₂)) = 1/(1 + e^(z₂ − z₁)) = σ(z₁ − z₂)
Derivation:
softmax(z₁) = e^(z₁)/(e^(z₁) + e^(z₂)) = 1/(1 + e^(z₂)/e^(z₁)) = 1/(1 + e^(z₂ − z₁)) = σ(z₁ − z₂)
This shows that for binary classification: P(class 1) = σ(z₁ − z₂), which depends only on the DIFFERENCE between logits. This is why binary classifiers often use a single output with sigmoid rather than two outputs with softmax — the extra degree of freedom is redundant (adding a constant to both logits doesn't change the softmax output).
3. The Softmax Jacobian
Unlike element-wise activations (sigmoid, ReLU, tanh) where ∂ŷᵢ/∂zⱼ = 0 for i ≠ j, the softmax couples all inputs together. The derivative is a full K × K Jacobian matrix:
∂ŷᵢ/∂zⱼ = { ŷᵢ(1 − ŷᵢ) if i = j { −ŷᵢ·ŷⱼ if i ≠ j
Derivation:
Let ŷᵢ = e^(zᵢ)/S where S = Σₖ e^(zₖ).
Case 1: i = j (diagonal):
∂ŷᵢ/∂zᵢ = ∂/∂zᵢ [e^(zᵢ)/S] = (e^(zᵢ)·S − e^(zᵢ)·e^(zᵢ))/S² (quotient rule: ∂/∂zᵢ of S is e^(zᵢ)) = e^(zᵢ)/S − (e^(zᵢ)/S)² = ŷᵢ − ŷᵢ² = ŷᵢ(1 − ŷᵢ)
Case 2: i ≠ j (off-diagonal):
∂ŷᵢ/∂zⱼ = ∂/∂zⱼ [e^(zᵢ)/S] = (0·S − e^(zᵢ)·e^(zⱼ))/S² (e^(zᵢ) doesn't depend on zⱼ, so numerator derivative is 0) = −e^(zᵢ)·e^(zⱼ)/S² = −(e^(zᵢ)/S)(e^(zⱼ)/S) = −ŷᵢ·ŷⱼ
Compact matrix form: J = diag(ŷ) − ŷŷᵀ
Why off-diagonal terms are negative: Increasing zⱼ increases the denominator S for all outputs, which decreases ŷᵢ (for i ≠ j). The probabilities must sum to 1 — pushing one up pushes others down.
4. Numerical Stability: The Log-Sum-Exp Trick
Computing softmax naively can cause numerical overflow. If zᵢ is large (e.g., zᵢ = 1000), e^(1000) overflows floating-point representation.
The trick: Subtract the maximum from all logits before exponentiating:
softmax(z)ᵢ = exp(zᵢ − m) / Σⱼ exp(zⱼ − m)
where m = maxₖ(zₖ).
Proof of equivalence:
Let m = maxₖ(zₖ). Then:
softmax(zᵢ) = e^(zᵢ)/Σⱼ e^(zⱼ) = e^(zᵢ − m) · e^m / (Σⱼ e^(zⱼ − m) · e^m) = e^(zᵢ − m) / Σⱼ e^(zⱼ − m) ✓
Since zᵢ − m ≤ 0 for all i, the maximum exponentiated value is e⁰ = 1 — no overflow. And at least one term is exactly 1, preventing underflow to all zeros.
In log space: The denominator's log is the log-sum-exp function:
LSE(z) = log(Σⱼ exp(zⱼ)) = m + log(Σⱼ exp(zⱼ − m))
This is a smooth approximation of the maximum function and appears widely in ML (log-likelihood computation, attention, etc.).
5. Temperature Parameter
The temperature T controls the "sharpness" of the softmax distribution:
softmax(z; T)ᵢ = exp(zᵢ/T) / Σⱼ exp(zⱼ/T)
Effect of temperature: - T → 0: The distribution approaches a one-hot vector at argmax(z). Only the largest logit survives — softmax → argmax. - T = 1: Standard softmax. - T → ∞: The distribution approaches uniform — every class gets probability 1/K. - T > 1: "Softens" the distribution, making it more uniform (higher entropy). Used in knowledge distillation to reveal dark knowledge from teacher models. - T < 1: "Sharpens" the distribution, making it more peaked. Used in some sampling strategies.
Why "temperature"? This comes from statistical mechanics: the Boltzmann distribution for a system at temperature T is pᵢ ∝ exp(−Eᵢ/(kT)). The softmax is a Boltzmann distribution over energy states Eᵢ = −zᵢ.
6. Softmax in the Context of Cross-Entropy
The combination of softmax + categorical cross-entropy loss is ubiquitous in classification and language modeling:
L = −Σᵢ yᵢ·log(ŷᵢ) where ŷ = softmax(z)
A beautiful simplification occurs when we compute ∂L/∂z (see 16-04 for the full derivation):
∂L/∂zᵢ = ŷᵢ − yᵢ
The gradient is simply the difference between the predicted and true probabilities. This elegant result is one reason why softmax + cross-entropy is so widely used — it provides clean, well-behaved gradients.
7. Softmax in Attention Mechanisms
In scaled dot-product attention (Phase 17):
Attention(Q, K, V) = softmax(QKᵀ/√dₖ)V
The softmax here normalizes the attention scores into a probability distribution over input positions — each position "attends" to others with weights summing to 1. The scaling factor √dₖ prevents the dot products from growing too large (which would push softmax into the saturated, nearly-one-hot regime where gradients vanish).
Key Terms
- Jacobian
- Softmax
- Temperature
Worked Examples
Example 1: Computing Softmax
Problem: Compute softmax([2, 1, 0.5]ᵀ).
Solution:
Step 1 — Exponentiate: e² ≈ 7.3891, e¹ ≈ 2.7183, e^{0.5} ≈ 1.6487
Step 2 — Sum: S = 7.3891 + 2.7183 + 1.6487 = 11.7561
Step 3 — Normalize: - ŷ₁ = 7.3891/11.7561 ≈ 0.6286 - ŷ₂ = 2.7183/11.7561 ≈ 0.2313 - ŷ₃ = 1.6487/11.7561 ≈ 0.1402
Check: 0.6286 + 0.2313 + 0.1402 = 1.0001 ✓ (rounding)
The largest logit (2) gets the most probability mass, but all classes get some.
Example 2: Log-Sum-Exp Trick
Problem: Compute softmax([1000, 1001, 1002]ᵀ) safely.
Solution:
Without the trick: e^{1000}, e^{1001}, e^{1002} would all overflow.
With m = 1002: - z₁ − m = 1000 − 1002 = −2 → e⁻² ≈ 0.1353 - z₂ − m = 1001 − 1002 = −1 → e⁻¹ ≈ 0.3679 - z₃ − m = 1002 − 1002 = 0 → e⁰ = 1
Sum: 0.1353 + 0.3679 + 1 = 1.5032
- ŷ₁ = 0.1353/1.5032 ≈ 0.0900
- ŷ₂ = 0.3679/1.5032 ≈ 0.2447
- ŷ₃ = 1/1.5032 ≈ 0.6652
All stable, no overflow. The highest logit still dominates the distribution.
Example 3: Softmax with Temperature
Problem: Given logits z = [3, 1, 0]ᵀ, compute softmax at T = 0.5, T = 1, and T = 2. Interpret the results.
Solution:
T = 0.5 (cold/sharp): - z/T = [6, 2, 0] → exp: [403.43, 7.389, 1] → sum = 411.82 - ŷ ≈ [0.9796, 0.0179, 0.0024] → nearly one-hot
T = 1 (standard): - exp([3,1,0]) = [20.086, 2.718, 1] → sum = 23.804 - ŷ ≈ [0.8438, 0.1142, 0.0420]
T = 2 (hot/soft): - z/T = [1.5, 0.5, 0] → exp: [4.482, 1.649, 1] → sum = 7.131 - ŷ ≈ [0.6285, 0.2313, 0.1402]
As T increases, the distribution becomes more uniform (entropy increases). As T decreases, it approaches a hard argmax.
Practice Problems
(Answers are below. Try each problem before checking.)
Problem 1: Compute the softmax Jacobian for z = [1, 0, −1]ᵀ. Give the full 3×3 matrix.
Problem 2: Prove that softmax(z + c·1) = softmax(z) where 1 is a vector of all ones and c is any constant. This is the "translation invariance" of softmax. What does this imply about the number of effective parameters?
Problem 3: For K = 2, derive the full 2×2 softmax Jacobian and show that ∂ŷ₁/∂z₁ = ŷ₁ŷ₂ (note: this matches the diagonal formula ŷ₁(1 − ŷ₁) since ŷ₂ = 1 − ŷ₁).
Problem 4: Compute the entropy H(ŷ) = −Σ ŷᵢ·log(ŷᵢ) of the softmax distribution for z = [2, 2, 2]ᵀ. Compare with the entropy for z = [5, 0, −5]ᵀ.
Problem 5: Show that ∂²/∂zⱼ∂zₖ LSE(z) is the covariance matrix of the softmax distribution. (LSE = log-sum-exp.)
Answers (click to expand)
**Problem 1:** First compute **ŷ**: e¹=2.7183, e⁰=1, e⁻¹=0.3679, sum=4.0862 ŷ₁=0.6653, ŷ₂=0.2447, ŷ₃=0.0900 Diagonal: ŷᵢ(1−ŷᵢ): [0.2227, 0.1848, 0.0819] Off-diagonal (i,j): −ŷᵢ·ŷⱼ: −(1,2): −0.1628, −(1,3): −0.0599 −(2,1): −0.1628, −(2,3): −0.0220 −(3,1): −0.0599, −(3,2): −0.0220 J = [[ 0.2227, −0.1628, −0.0599], [−0.1628, 0.1848, −0.0220], [−0.0599, −0.0220, 0.0819]] **Problem 2:** softmax(zᵢ + c) = e^(zᵢ+c)/Σⱼ e^(zⱼ+c) = eᶜ·e^(zᵢ)/(eᶜ·Σⱼ e^(zⱼ)) = e^(zᵢ)/Σⱼ e^(zⱼ) Implication: K logits have only K−1 effective degrees of freedom. You can fix one logit to 0 without loss of expressiveness. **Problem 3:** For K=2, ŷ₂ = 1−ŷ₁. ∂ŷ₁/∂z₁ = ŷ₁(1−ŷ₁) = ŷ₁ŷ₂ ✓ ∂ŷ₁/∂z₂ = −ŷ₁ŷ₂ ∂ŷ₂/∂z₁ = −ŷ₁ŷ₂ ∂ŷ₂/∂z₂ = ŷ₂(1−ŷ₂) = ŷ₁ŷ₂ Note that columns sum to 0 (since ŷ₁+ŷ₂=1, derivative must preserve this). **Problem 4:** For [2,2,2]: ŷ = [⅓, ⅓, ⅓]. H = −3·⅓·log(⅓) = log(3) ≈ 1.099 nats. For [5,0,−5]: e⁵=148.41, 1, e⁻⁵=0.0067, sum=149.42. ŷ=[0.9934, 0.0067, 0.0000]. H ≈ −0.9934·log(0.9934) − 0.0067·log(0.0067) ≈ 0.0066 + 0.0336 ≈ 0.040 nats. The first is high entropy (uniform), the second is low entropy (peaked). **Problem 5:** LSE(z) = log(Σ exp(zⱼ)) ∂LSE/∂zₖ = exp(zₖ)/Σ exp(zⱼ) = ŷₖ (the softmax probability) ∂²LSE/∂zⱼ∂zₖ = ∂ŷₖ/∂zⱼ = ŷₖ·δ_{jk} − ŷⱼ·ŷₖ This is exactly the softmax Jacobian, and its negative is the covariance of a categorical distribution: Cov(δ_{i}, δ_{j}) = −ŷᵢ·ŷⱼ for i≠j. So ∂²LSE = Cov(one-hot indicator variables). ✓Summary
- Softmax converts logits to a probability distribution: ŷᵢ = exp(zᵢ)/Σⱼexp(zⱼ), with all outputs >0 and summing to 1.
- The Jacobian is ∂ŷᵢ/∂zⱼ = ŷᵢ(δ_{ij} − ŷⱼ) — diagonal terms are positive, off-diagonals negative (probabilities compete).
- The log-sum-exp trick (subtract max before exp) prevents numerical overflow without changing the result.
- Temperature T controls sharpness: T→0 gives argmax, T→∞ gives uniform; used in knowledge distillation and sampling.
- Combined with cross-entropy loss, the gradient simplifies beautifully to ∂L/∂z = ŷ − y.
Quiz
Q1: The softmax of [0, 0, 0] outputs what distribution?
A) [0, 0, 0] B) [⅓, ⅓, ⅓] C) [1, 0, 0] D) [0.5, 0.5, 0]
Answer & Explanation
**B** — e⁰ = 1 for all entries. Sum = 3. Each output = 1/3. Equal logits produce a uniform distribution. A is false (softmax never outputs zero). C would require one logit to dominate. D is incorrect for three classes.Q2: Why are the off-diagonal entries of the softmax Jacobian negative?
A) Because the exponential function decreases for negative values B) Because increasing zⱼ pushes probability mass toward class j, reducing it for other classes (sum must be 1) C) Because softmax outputs must sum to zero D) They are not always negative — it depends on the logits
Answer & Explanation
**B** — ∂ŷᵢ/∂zⱼ = −ŷᵢ·ŷⱼ < 0 for i ≠ j. Increasing zⱼ increases ŷⱼ, and since probabilities must sum to 1, the others must decrease. A is false (e^(zᵢ) doesn't change with zⱼ for i ≠ j). D is false (the formula is always negative for positive probabilities).Q3: What is the purpose of the temperature parameter T in softmax?
A) To prevent numerical overflow B) To control the variance of the logits C) To adjust the entropy/sharpness of the output distribution D) To normalize the input logits
Answer & Explanation
**C** — softmax(z/T). Low T amplifies differences (sharper, one-hot-like). High T diminishes differences (softer, more uniform). Used in knowledge distillation (high T reveals "dark knowledge") and temperature sampling. A describes the log-sum-exp trick, not temperature.Q4: The log-sum-exp trick computes softmax by subtracting m = max(zᵢ). What is the maximum possible value of exp(zᵢ − m)?
A) 0 B) e C) 1 D) Depends on the logits
Answer & Explanation
**C** — Since m = max(zᵢ), we have zᵢ − m ≤ 0 for all i. The maximum is exactly 0 for at least one entry (the max itself), giving exp(0) = 1. No overflow possible — all exponentiated values are ≤ 1.Q5: The gradient of cross-entropy loss combined with softmax simplifies to ∂L/∂z = ŷ − y. What is this gradient when the prediction is perfect (ŷ = y)?
A) 0 (all zeros) B) 1 (all ones) C) ŷ D) −ŷ
Answer & Explanation
**A** — If ŷ = y exactly (perfect prediction), then ∂L/∂z = 0. The network has nothing to learn from this example — gradients vanish. This elegant simplification is one reason softmax + cross-entropy is so widely used.Pitfalls
- Computing softmax naively without the log-sum-exp trick: For logits like $[1000, 1001, 1002]$, $e^{1002}$ exceeds float32's maximum ($\sim 3.4 \times 10^{38}$) and produces
inf. The resulting softmax is[NaN, NaN, NaN]. Always subtract $\max(\mathbf{z})$ before exponentiating — this guarantees all exponentiated values are $\leq 1$ and at least one equals exactly $1$, preventing both overflow and underflow. - Forgetting temperature when sampling from language models: At $T=1$, softmax over logits from a trained language model produces a reasonable distribution. But for creative text generation, $T > 1$ (e.g., $0.8$–$1.2$) flattens the distribution; for greedy/safe outputs, $T < 1$ sharpens it. Using $T=1$ as a default without tuning is a missed opportunity. Temperature is the primary knob for controlling the diversity–quality tradeoff in text generation.
- Treating softmax as element-wise when computing gradients: Unlike sigmoid where $\partial \hat{y}_i / \partial z_j = 0$ for $i \neq j$, softmax has a full $K \times K$ Jacobian: increasing $z_j$ affects ALL outputs because of the shared denominator. If you compute per-element gradients without accounting for off-diagonal terms ($-\hat{y}_i \hat{y}_j$), your gradient is wrong. Fortunately, the combined softmax + cross-entropy gradient simplifies to $\mathbf{\hat{y}} - \mathbf{y}$, hiding this complexity.
- Using two-class softmax when sigmoid is simpler and more efficient: For $K=2$, $\text{softmax}(\mathbf{z})_1 = \sigma(z_1 - z_2)$. The two logits have only one effective degree of freedom. A single sigmoid output is equivalent but uses fewer parameters, avoids the redundant degree of freedom, and produces the same probability. Two-class softmax is correct but wasteful — use sigmoid for binary classification.
- Ignoring translation invariance when interpreting logits: $\text{softmax}(\mathbf{z} + c\mathbf{1}) = \text{softmax}(\mathbf{z})$ for any constant $c$. This means individual logit values are meaningless in isolation — only differences between logits matter. If you inspect a trained model's logits and try to interpret $z_3 = 5.2$ as "the model is confident about class 3", you're missing that $z_3 - \max(\mathbf{z})$ is what drives the probability. Always mean-center or max-center logits before interpretation.
Next Steps
Move on to 16-04 — Loss Functions to understand MSE, cross-entropy, and how loss functions connect to maximum likelihood estimation and information theory.