📐 Concept diagram

16-03 — The Softmax Function

Phase: 16 — Neural Network Mathematics Subject: 16-03 Prerequisites: 16-02 (Activation Functions), Phase 4 (derivatives), Phase 10 (probability basics) Next subject: 16-04 — Loss Functions

Learning Objectives

By the end of this subject, you will be able to:

Write the softmax definition and explain why it produces a valid probability distribution
Derive the softmax Jacobian and explain why the off-diagonal terms exist (unlike sigmoid)
Apply the log-sum-exp trick for numerically stable softmax computation
Show that 2-class softmax reduces to the sigmoid function
Explain the effect of the temperature parameter on the softmax distribution

Core Content

1. Definition and Motivation

In multi-class classification with K classes, we want a neural network to output a probability distribution over the K classes — K numbers that are all non-negative and sum to 1. The softmax function does exactly this.

Given a vector of raw scores (logits) z = [z₁, z₂, ..., z_K]ᵀ:

softmax(z)ᵢ = exp(zᵢ) / Σⱼ₌₁^K exp(zⱼ)

The output ŷ = softmax(z) has: - ŷᵢ > 0 for all i (exponential is always positive) - Σᵢ ŷᵢ = 1 (normalized by the sum of exponentials)

⚠️ THIS IS CRITICAL — Softmax is used in virtually every multi-class classifier, every language model (as the final layer producing next-token probabilities), and every attention mechanism. You will use this constantly.

Why "softmax"? It's a "soft" version of the argmax function. As we'll see with temperature, softmax smoothly interpolates between uniform distribution and hard max.

2. Relationship to Sigmoid

For K = 2 classes, softmax reduces to the sigmoid function:

Let z = [z₁, z₂]ᵀ. Then:

softmax(z₁) = e^(z₁)/(e^(z₁) + e^(z₂)) = 1/(1 + e^(z₂ − z₁)) = σ(z₁ − z₂)

Derivation:

softmax(z₁) = e^(z₁)/(e^(z₁) + e^(z₂)) = 1/(1 + e^(z₂)/e^(z₁)) = 1/(1 + e^(z₂ − z₁)) = σ(z₁ − z₂)

This shows that for binary classification: P(class 1) = σ(z₁ − z₂), which depends only on the DIFFERENCE between logits. This is why binary classifiers often use a single output with sigmoid rather than two outputs with softmax — the extra degree of freedom is redundant (adding a constant to both logits doesn't change the softmax output).

3. The Softmax Jacobian

Unlike element-wise activations (sigmoid, ReLU, tanh) where ∂ŷᵢ/∂zⱼ = 0 for i ≠ j, the softmax couples all inputs together. The derivative is a full K × K Jacobian matrix:

∂ŷᵢ/∂zⱼ = { ŷᵢ(1 − ŷᵢ) if i = j { −ŷᵢ·ŷⱼ if i ≠ j

Derivation:

Let ŷᵢ = e^(zᵢ)/S where S = Σₖ e^(zₖ).

Case 1: i = j (diagonal):

∂ŷᵢ/∂zᵢ = ∂/∂zᵢ [e^(zᵢ)/S] = (e^(zᵢ)·S − e^(zᵢ)·e^(zᵢ))/S² (quotient rule: ∂/∂zᵢ of S is e^(zᵢ)) = e^(zᵢ)/S − (e^(zᵢ)/S)² = ŷᵢ − ŷᵢ² = ŷᵢ(1 − ŷᵢ)

Case 2: i ≠ j (off-diagonal):

∂ŷᵢ/∂zⱼ = ∂/∂zⱼ [e^(zᵢ)/S] = (0·S − e^(zᵢ)·e^(zⱼ))/S² (e^(zᵢ) doesn't depend on zⱼ, so numerator derivative is 0) = −e^(zᵢ)·e^(zⱼ)/S² = −(e^(zᵢ)/S)(e^(zⱼ)/S) = −ŷᵢ·ŷⱼ

Compact matrix form: J = diag(ŷ) − ŷŷᵀ

Why off-diagonal terms are negative: Increasing zⱼ increases the denominator S for all outputs, which decreases ŷᵢ (for i ≠ j). The probabilities must sum to 1 — pushing one up pushes others down.

4. Numerical Stability: The Log-Sum-Exp Trick

Computing softmax naively can cause numerical overflow. If zᵢ is large (e.g., zᵢ = 1000), e^(1000) overflows floating-point representation.

The trick: Subtract the maximum from all logits before exponentiating:

softmax(z)ᵢ = exp(zᵢ − m) / Σⱼ exp(zⱼ − m)

where m = maxₖ(zₖ).

Proof of equivalence:

Let m = maxₖ(zₖ). Then:

softmax(zᵢ) = e^(zᵢ)/Σⱼ e^(zⱼ) = e^(zᵢ − m) · e^m / (Σⱼ e^(zⱼ − m) · e^m) = e^(zᵢ − m) / Σⱼ e^(zⱼ − m) ✓

Since zᵢ − m ≤ 0 for all i, the maximum exponentiated value is e⁰ = 1 — no overflow. And at least one term is exactly 1, preventing underflow to all zeros.

In log space: The denominator's log is the log-sum-exp function:

LSE(z) = log(Σⱼ exp(zⱼ)) = m + log(Σⱼ exp(zⱼ − m))

This is a smooth approximation of the maximum function and appears widely in ML (log-likelihood computation, attention, etc.).

5. Temperature Parameter

The temperature T controls the "sharpness" of the softmax distribution:

softmax(z; T)ᵢ = exp(zᵢ/T) / Σⱼ exp(zⱼ/T)

Effect of temperature: - T → 0: The distribution approaches a one-hot vector at argmax(z). Only the largest logit survives — softmax → argmax. - T = 1: Standard softmax. - T → ∞: The distribution approaches uniform — every class gets probability 1/K. - T > 1: "Softens" the distribution, making it more uniform (higher entropy). Used in knowledge distillation to reveal dark knowledge from teacher models. - T < 1: "Sharpens" the distribution, making it more peaked. Used in some sampling strategies.

Why "temperature"? This comes from statistical mechanics: the Boltzmann distribution for a system at temperature T is pᵢ ∝ exp(−Eᵢ/(kT)). The softmax is a Boltzmann distribution over energy states Eᵢ = −zᵢ.

6. Softmax in the Context of Cross-Entropy

The combination of softmax + categorical cross-entropy loss is ubiquitous in classification and language modeling:

L = −Σᵢ yᵢ·log(ŷᵢ) where ŷ = softmax(z)

A beautiful simplification occurs when we compute ∂L/∂z (see 16-04 for the full derivation):

∂L/∂zᵢ = ŷᵢ − yᵢ

The gradient is simply the difference between the predicted and true probabilities. This elegant result is one reason why softmax + cross-entropy is so widely used — it provides clean, well-behaved gradients.

7. Softmax in Attention Mechanisms

In scaled dot-product attention (Phase 17):

Attention(Q, K, V) = softmax(QKᵀ/√dₖ)V

The softmax here normalizes the attention scores into a probability distribution over input positions — each position "attends" to others with weights summing to 1. The scaling factor √dₖ prevents the dot products from growing too large (which would push softmax into the saturated, nearly-one-hot regime where gradients vanish).

Key Terms

Jacobian
Softmax
Temperature

Worked Examples

Example 1: Computing Softmax

Problem: Compute softmax([2, 1, 0.5]ᵀ).

Solution:

Step 1 — Exponentiate: e² ≈ 7.3891, e¹ ≈ 2.7183, e^{0.5} ≈ 1.6487

Step 2 — Sum: S = 7.3891 + 2.7183 + 1.6487 = 11.7561

Step 3 — Normalize: - ŷ₁ = 7.3891/11.7561 ≈ 0.6286 - ŷ₂ = 2.7183/11.7561 ≈ 0.2313 - ŷ₃ = 1.6487/11.7561 ≈ 0.1402

Check: 0.6286 + 0.2313 + 0.1402 = 1.0001 ✓ (rounding)

The largest logit (2) gets the most probability mass, but all classes get some.

Example 2: Log-Sum-Exp Trick

Problem: Compute softmax([1000, 1001, 1002]ᵀ) safely.

Solution:

Without the trick: e^{1000}, e^{1001}, e^{1002} would all overflow.

With m = 1002: - z₁ − m = 1000 − 1002 = −2 → e⁻² ≈ 0.1353 - z₂ − m = 1001 − 1002 = −1 → e⁻¹ ≈ 0.3679 - z₃ − m = 1002 − 1002 = 0 → e⁰ = 1

Sum: 0.1353 + 0.3679 + 1 = 1.5032

ŷ₁ = 0.1353/1.5032 ≈ 0.0900
ŷ₂ = 0.3679/1.5032 ≈ 0.2447
ŷ₃ = 1/1.5032 ≈ 0.6652

All stable, no overflow. The highest logit still dominates the distribution.

Example 3: Softmax with Temperature

Problem: Given logits z = [3, 1, 0]ᵀ, compute softmax at T = 0.5, T = 1, and T = 2. Interpret the results.

Solution:

T = 0.5 (cold/sharp): - z/T = [6, 2, 0] → exp: [403.43, 7.389, 1] → sum = 411.82 - ŷ ≈ [0.9796, 0.0179, 0.0024] → nearly one-hot

T = 1 (standard): - exp([3,1,0]) = [20.086, 2.718, 1] → sum = 23.804 - ŷ ≈ [0.8438, 0.1142, 0.0420]

T = 2 (hot/soft): - z/T = [1.5, 0.5, 0] → exp: [4.482, 1.649, 1] → sum = 7.131 - ŷ ≈ [0.6285, 0.2313, 0.1402]

As T increases, the distribution becomes more uniform (entropy increases). As T decreases, it approaches a hard argmax.

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: Compute the softmax Jacobian for z = [1, 0, −1]ᵀ. Give the full 3×3 matrix.

Problem 2: Prove that softmax(z + c·1) = softmax(z) where 1 is a vector of all ones and c is any constant. This is the "translation invariance" of softmax. What does this imply about the number of effective parameters?

Problem 3: For K = 2, derive the full 2×2 softmax Jacobian and show that ∂ŷ₁/∂z₁ = ŷ₁ŷ₂ (note: this matches the diagonal formula ŷ₁(1 − ŷ₁) since ŷ₂ = 1 − ŷ₁).

Problem 4: Compute the entropy H(ŷ) = −Σ ŷᵢ·log(ŷᵢ) of the softmax distribution for z = [2, 2, 2]ᵀ. Compare with the entropy for z = [5, 0, −5]ᵀ.

Problem 5: Show that ∂²/∂zⱼ∂zₖ LSE(z) is the covariance matrix of the softmax distribution. (LSE = log-sum-exp.)

Answers (click to expand)

**Problem 1:** First compute **ŷ**: e¹=2.7183, e⁰=1, e⁻¹=0.3679, sum=4.0862 ŷ₁=0.6653, ŷ₂=0.2447, ŷ₃=0.0900 Diagonal: ŷᵢ(1−ŷᵢ): [0.2227, 0.1848, 0.0819] Off-diagonal (i,j): −ŷᵢ·ŷⱼ: −(1,2): −0.1628, −(1,3): −0.0599 −(2,1): −0.1628, −(2,3): −0.0220 −(3,1): −0.0599, −(3,2): −0.0220 J = [[ 0.2227, −0.1628, −0.0599], [−0.1628, 0.1848, −0.0220], [−0.0599, −0.0220, 0.0819]] **Problem 2:** softmax(zᵢ + c) = e^(zᵢ+c)/Σⱼ e^(zⱼ+c) = eᶜ·e^(zᵢ)/(eᶜ·Σⱼ e^(zⱼ)) = e^(zᵢ)/Σⱼ e^(zⱼ) Implication: K logits have only K−1 effective degrees of freedom. You can fix one logit to 0 without loss of expressiveness. **Problem 3:** For K=2, ŷ₂ = 1−ŷ₁. ∂ŷ₁/∂z₁ = ŷ₁(1−ŷ₁) = ŷ₁ŷ₂ ✓ ∂ŷ₁/∂z₂ = −ŷ₁ŷ₂ ∂ŷ₂/∂z₁ = −ŷ₁ŷ₂ ∂ŷ₂/∂z₂ = ŷ₂(1−ŷ₂) = ŷ₁ŷ₂ Note that columns sum to 0 (since ŷ₁+ŷ₂=1, derivative must preserve this). **Problem 4:** For [2,2,2]: ŷ = [⅓, ⅓, ⅓]. H = −3·⅓·log(⅓) = log(3) ≈ 1.099 nats. For [5,0,−5]: e⁵=148.41, 1, e⁻⁵=0.0067, sum=149.42. ŷ=[0.9934, 0.0067, 0.0000]. H ≈ −0.9934·log(0.9934) − 0.0067·log(0.0067) ≈ 0.0066 + 0.0336 ≈ 0.040 nats. The first is high entropy (uniform), the second is low entropy (peaked). **Problem 5:** LSE(z) = log(Σ exp(zⱼ)) ∂LSE/∂zₖ = exp(zₖ)/Σ exp(zⱼ) = ŷₖ (the softmax probability) ∂²LSE/∂zⱼ∂zₖ = ∂ŷₖ/∂zⱼ = ŷₖ·δ_{jk} − ŷⱼ·ŷₖ This is exactly the softmax Jacobian, and its negative is the covariance of a categorical distribution: Cov(δ_{i}, δ_{j}) = −ŷᵢ·ŷⱼ for i≠j. So ∂²LSE = Cov(one-hot indicator variables). ✓

Summary

Softmax converts logits to a probability distribution: ŷᵢ = exp(zᵢ)/Σⱼexp(zⱼ), with all outputs >0 and summing to 1.
The Jacobian is ∂ŷᵢ/∂zⱼ = ŷᵢ(δ_{ij} − ŷⱼ) — diagonal terms are positive, off-diagonals negative (probabilities compete).
The log-sum-exp trick (subtract max before exp) prevents numerical overflow without changing the result.
Temperature T controls sharpness: T→0 gives argmax, T→∞ gives uniform; used in knowledge distillation and sampling.
Combined with cross-entropy loss, the gradient simplifies beautifully to ∂L/∂z = ŷ − y.

Quiz

Q1: The softmax of [0, 0, 0] outputs what distribution?

A) [0, 0, 0] B) [⅓, ⅓, ⅓] C) [1, 0, 0] D) [0.5, 0.5, 0]

Answer & Explanation

**B** — e⁰ = 1 for all entries. Sum = 3. Each output = 1/3. Equal logits produce a uniform distribution. A is false (softmax never outputs zero). C would require one logit to dominate. D is incorrect for three classes.

Q2: Why are the off-diagonal entries of the softmax Jacobian negative?

A) Because the exponential function decreases for negative values B) Because increasing zⱼ pushes probability mass toward class j, reducing it for other classes (sum must be 1) C) Because softmax outputs must sum to zero D) They are not always negative — it depends on the logits

Answer & Explanation

**B** — ∂ŷᵢ/∂zⱼ = −ŷᵢ·ŷⱼ < 0 for i ≠ j. Increasing zⱼ increases ŷⱼ, and since probabilities must sum to 1, the others must decrease. A is false (e^(zᵢ) doesn't change with zⱼ for i ≠ j). D is false (the formula is always negative for positive probabilities).

Q3: What is the purpose of the temperature parameter T in softmax?

A) To prevent numerical overflow B) To control the variance of the logits C) To adjust the entropy/sharpness of the output distribution D) To normalize the input logits

Answer & Explanation

**C** — softmax(z/T). Low T amplifies differences (sharper, one-hot-like). High T diminishes differences (softer, more uniform). Used in knowledge distillation (high T reveals "dark knowledge") and temperature sampling. A describes the log-sum-exp trick, not temperature.

Q4: The log-sum-exp trick computes softmax by subtracting m = max(zᵢ). What is the maximum possible value of exp(zᵢ − m)?

A) 0 B) e C) 1 D) Depends on the logits

Answer & Explanation

**C** — Since m = max(zᵢ), we have zᵢ − m ≤ 0 for all i. The maximum is exactly 0 for at least one entry (the max itself), giving exp(0) = 1. No overflow possible — all exponentiated values are ≤ 1.

Q5: The gradient of cross-entropy loss combined with softmax simplifies to ∂L/∂z = ŷ − y. What is this gradient when the prediction is perfect (ŷ = y)?

A) 0 (all zeros) B) 1 (all ones) C) ŷ D) −ŷ

Answer & Explanation

**A** — If ŷ = y exactly (perfect prediction), then ∂L/∂z = 0. The network has nothing to learn from this example — gradients vanish. This elegant simplification is one reason softmax + cross-entropy is so widely used.

Pitfalls

Computing softmax naively without the log-sum-exp trick: For logits like $[1000, 1001, 1002]$, $e^{1002}$ exceeds float32's maximum ($\sim 3.4 \times 10^{38}$) and produces inf. The resulting softmax is [NaN, NaN, NaN]. Always subtract $\max(\mathbf{z})$ before exponentiating — this guarantees all exponentiated values are $\leq 1$ and at least one equals exactly $1$, preventing both overflow and underflow.
Forgetting temperature when sampling from language models: At $T=1$, softmax over logits from a trained language model produces a reasonable distribution. But for creative text generation, $T > 1$ (e.g., $0.8$–$1.2$) flattens the distribution; for greedy/safe outputs, $T < 1$ sharpens it. Using $T=1$ as a default without tuning is a missed opportunity. Temperature is the primary knob for controlling the diversity–quality tradeoff in text generation.
Treating softmax as element-wise when computing gradients: Unlike sigmoid where $\partial \hat{y}_i / \partial z_j = 0$ for $i \neq j$, softmax has a full $K \times K$ Jacobian: increasing $z_j$ affects ALL outputs because of the shared denominator. If you compute per-element gradients without accounting for off-diagonal terms ($-\hat{y}_i \hat{y}_j$), your gradient is wrong. Fortunately, the combined softmax + cross-entropy gradient simplifies to $\mathbf{\hat{y}} - \mathbf{y}$, hiding this complexity.
Using two-class softmax when sigmoid is simpler and more efficient: For $K=2$, $\text{softmax}(\mathbf{z})_1 = \sigma(z_1 - z_2)$. The two logits have only one effective degree of freedom. A single sigmoid output is equivalent but uses fewer parameters, avoids the redundant degree of freedom, and produces the same probability. Two-class softmax is correct but wasteful — use sigmoid for binary classification.
Ignoring translation invariance when interpreting logits: $\text{softmax}(\mathbf{z} + c\mathbf{1}) = \text{softmax}(\mathbf{z})$ for any constant $c$. This means individual logit values are meaningless in isolation — only differences between logits matter. If you inspect a trained model's logits and try to interpret $z_3 = 5.2$ as "the model is confident about class 3", you're missing that $z_3 - \max(\mathbf{z})$ is what drives the probability. Always mean-center or max-center logits before interpretation.

Next Steps

Move on to 16-04 — Loss Functions to understand MSE, cross-entropy, and how loss functions connect to maximum likelihood estimation and information theory.

Progress

Phases

16-03 — The Softmax Function

Learning Objectives

Core Content

1. Definition and Motivation

2. Relationship to Sigmoid

3. The Softmax Jacobian

4. Numerical Stability: The Log-Sum-Exp Trick

5. Temperature Parameter

6. Softmax in the Context of Cross-Entropy

7. Softmax in Attention Mechanisms

Key Terms

Worked Examples

Example 1: Computing Softmax

Example 2: Log-Sum-Exp Trick

Example 3: Softmax with Temperature

Practice Problems

Summary

Quiz

Q1: The softmax of [0, 0, 0] outputs what distribution?

Q2: Why are the off-diagonal entries of the softmax Jacobian negative?

Q3: What is the purpose of the temperature parameter T in softmax?

Q4: The log-sum-exp trick computes softmax by subtracting m = max(zᵢ). What is the maximum possible value of exp(zᵢ − m)?

Q5: The gradient of cross-entropy loss combined with softmax simplifies to ∂L/∂z = ŷ − y. What is this gradient when the prediction is perfect (ŷ = y)?

Pitfalls

Next Steps