16-02 — Activation Functions
Phase: 16 — Neural Network Mathematics Subject: 16-02 Prerequisites: 16-01 (Perceptron), Phase 4 (derivatives), Phase 3 (functions) Next subject: 16-03 — The Softmax Function
Learning Objectives
By the end of this subject, you will be able to:
- Write the mathematical definitions of sigmoid, tanh, ReLU, Leaky ReLU, GELU, and SiLU/Swish, and sketch their graphs
- Derive the derivatives of sigmoid and tanh in closed form, and explain why this matters for backpropagation
- Explain the vanishing gradient problem for sigmoid/tanh and how ReLU mitigates it
- Define the "dying ReLU" problem and explain how Leaky ReLU and parametric variants address it
- Choose appropriate activation functions for hidden layers vs. output layers based on mathematical properties
Core Content
1. Why Activation Functions?
In 16-01, we saw that a perceptron computes wᵀx + b and applies a step function (sign). If we stack multiple layers of linear transformations, the composition remains linear — a deep network would be no more powerful than a single layer. Activation functions introduce non-linearity, allowing neural networks to approximate any continuous function (universal approximation theorem).
⚠️ THIS IS CRITICAL — Without non-linear activation functions, a multi-layer neural network collapses to a single linear transformation. All the expressive power of deep learning comes from non-linearities.
2. The Sigmoid Function
The sigmoid (logistic) function squashes any real number into the range (0, 1):
σ(x) = 1 / (1 + e⁻ˣ)
Properties: - Domain: (−∞, ∞) - Range: (0, 1) - σ(0) = 0.5 - σ(x) → 0 as x → −∞ - σ(x) → 1 as x → +∞ - Monotonically increasing - Smooth (infinitely differentiable)
Interpreting as a probability: Because σ(x) outputs values in (0,1), it can be interpreted as a probability. This is why sigmoid is commonly used in the output layer for binary classification — it gives P(y = 1 | x).
Relationship to the logit (log-odds): If p = σ(x), then x = ln(p/(1−p)) is the log-odds, or logit. The sigmoid is the inverse of the logit function.
Derivative of Sigmoid
The derivative has a remarkably clean form:
σ'(x) = σ(x)(1 − σ(x))
Derivation:
σ(x) = (1 + e⁻ˣ)⁻¹
Using the chain rule: σ'(x) = −(1 + e⁻ˣ)⁻² · (−e⁻ˣ) = e⁻ˣ/(1 + e⁻ˣ)²
Now add and subtract 1 in the numerator: e⁻ˣ/(1 + e⁻ˣ)² = [1/(1 + e⁻ˣ)] · [e⁻ˣ/(1 + e⁻ˣ)]
But e⁻ˣ/(1 + e⁻ˣ) = 1 − 1/(1 + e⁻ˣ) = 1 − σ(x)
Therefore: σ'(x) = σ(x)(1 − σ(x)) ✓
Why this matters: During backpropagation, you multiply by σ'(x). The maximum value of σ'(x) is 0.25 (at x = 0). For large |x|, σ'(x) ≈ 0. This causes the vanishing gradient problem — gradients become extremely small as they flow backward through many sigmoid layers.
3. Hyperbolic Tangent (tanh)
The tanh function squashes inputs to (−1, 1):
tanh(x) = (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ)
Properties: - Domain: (−∞, ∞) - Range: (−1, 1) - tanh(0) = 0 (zero-centered — an advantage over sigmoid) - tanh(x) → −1 as x → −∞ - tanh(x) → +1 as x → +∞ - tanh(x) = 2σ(2x) − 1 (relationship to sigmoid)
Zero-centered property: Unlike sigmoid (which outputs only positive values), tanh outputs are centered around zero. This helps gradient flow because gradients don't all have the same sign, reducing zig-zagging during optimization.
Derivative of tanh
tanh'(x) = 1 − tanh²(x)
Derivation:
Let y = tanh(x) = (eˣ − e⁻ˣ)/(eˣ + e⁻ˣ)
Using quotient rule: (u'v − uv')/v² where u = eˣ − e⁻ˣ, v = eˣ + e⁻ˣ
u' = eˣ + e⁻ˣ = v v' = eˣ − e⁻ˣ = u
tanh'(x) = (v·v − u·u)/v² = (v² − u²)/v² = 1 − (u/v)² = 1 − tanh²(x) ✓
Vanishing gradients: Maximum derivative is 1 (at x = 0), but it still approaches 0 rapidly at saturation. Deep networks with tanh still suffer from vanishing gradients.
4. Rectified Linear Unit (ReLU)
ReLU is the most widely used activation function in modern deep learning:
ReLU(x) = max(0, x)
Properties: - Domain: (−∞, ∞) - Range: [0, ∞) - Not differentiable at x = 0 (but subgradient exists, and in practice this almost never matters) - Computationally cheap (just a max operation) - For x > 0: derivative is 1 - For x < 0: derivative is 0
Why ReLU helps with vanishing gradients: For positive inputs, the derivative is exactly 1. Gradients don't decay as they flow backward — they're multiplied by 1 at each layer (for active neurons). This is in stark contrast to sigmoid/tanh where the derivative is always < 1 (and often ≪ 1).
⚠️ THIS IS CRITICAL — ReLU's constant gradient of 1 for positive activations is the key reason very deep networks can be trained. This property makes ReLU the default choice for hidden layers in most architectures.
The Dying ReLU Problem
If a neuron's weights update such that the neuron outputs ≤ 0 for ALL inputs in the training set, the gradient through that neuron is ALWAYS 0. The neuron "dies" — it stops learning permanently.
Formally: if wᵀx + b ≤ 0 for all x in the training set, then ∂L/∂w = 0 and the weights never update again.
Causes: - Too-high learning rate pushing weights into the negative regime - Poor initialization - Large negative bias
5. Leaky ReLU and Parametric ReLU
Leaky ReLU addresses dying ReLU by allowing a small, non-zero gradient when x < 0:
LeakyReLU(x) = { x if x ≥ 0 { α·x if x < 0
where α is a small constant (typically 0.01).
Parametric ReLU (PReLU) makes α a learnable parameter:
PReLU(x) = { x if x ≥ 0 { a·x if x < 0
where a is learned via gradient descent along with the other network parameters.
Derivative:
LeakyReLU'(x) = { 1 if x ≥ 0 { α if x < 0
Since α > 0, gradients can still flow when the neuron is negative — no more dead neurons.
6. GELU (Gaussian Error Linear Unit)
GELU is widely used in Transformer architectures (BERT, GPT):
GELU(x) = x · Φ(x) = x · P(X ≤ x) where X ~ N(0, 1)
Where Φ(x) is the CDF of the standard normal distribution: Φ(x) = ½[1 + erf(x/√2)].
Approximation used in practice:
GELU(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
Or the simpler sigmoid approximation: GELU(x) ≈ x·σ(1.702x)
Key property: Unlike ReLU which has a hard cutoff at 0, GELU smoothly weights inputs by their probability under a Gaussian. It's non-convex, non-monotonic, and has non-zero gradient everywhere. This smooth "stochastic regularization" effect helps with training stability.
Intuition: GELU multiplies x by the probability that x is "significant." If x is very negative (far left tail of Gaussian), Φ(x) ≈ 0, so GELU(x) ≈ 0. If x is very positive, Φ(x) ≈ 1, so GELU(x) ≈ x. For intermediate values, it smoothly interpolates.
7. SiLU / Swish
SiLU (Sigmoid Linear Unit), also called Swish:
SiLU(x) = x · σ(x)
Discovered by automated search but later shown to have strong theoretical properties.
Properties: - Smooth and non-monotonic - Lower bounded (approaches 0 as x → −∞) but not upper bounded - Self-gated: the sigmoid acts as a soft gate controlling how much of x passes through - Derivative: SiLU'(x) = σ(x) + x·σ(x)(1 − σ(x)) = σ(x)(1 + x(1 − σ(x)))
8. Choosing Activation Functions: A Practical Guide
Hidden layers: - Default choice: ReLU (fast, mitigates vanishing gradients) - If dying ReLU is a problem: Leaky ReLU or PReLU - For Transformers and large language models: GELU or SiLU/Swish - Legacy: tanh (still used in LSTMs/GRUs for gating)
Output layers: - Binary classification: Sigmoid (outputs probability in (0, 1)) - Multi-class classification: Softmax (see 16-03) - Regression (unbounded): Linear (no activation, or identity) - Regression (bounded positive): ReLU or Softplus (log(1 + eˣ))
Key Terms
- Activation functions
- GELU
- Leaky ReLU
- Linear
- PReLU
- Parametric ReLU (PReLU)
- ReLU
- Sigmoid
- Softmax
- Softplus
- Tanh
Worked Examples
Example 1: Derivative of Sigmoid
Problem: Compute σ'(3).
Solution:
σ(3) = 1/(1 + e⁻³) = 1/(1 + 0.0498) ≈ 0.9526
σ'(3) = σ(3)(1 − σ(3)) = 0.9526 × 0.0474 ≈ 0.0452
Notice how small this derivative is — only 4.5% of the signal passes through. If we chain 20 such layers, the gradient would be (0.045)²⁰ ≈ 10⁻²⁷ — practically zero. This is the vanishing gradient problem.
Example 2: Comparing Activations at a Point
Problem: For x = −2, compute sigmoid(x), tanh(x), ReLU(x), LeakyReLU(x, α=0.01), and GELU(x). Compare their outputs.
Solution:
- sigmoid(−2) = 1/(1 + e²) = 1/(1 + 7.3891) ≈ 0.1192
- tanh(−2) = (e⁻² − e²)/(e⁻² + e²) ≈ (0.1353 − 7.3891)/(0.1353 + 7.3891) ≈ −0.9640
- ReLU(−2) = max(0, −2) = 0
- LeakyReLU(−2) = max(−2, 0.01×(−2)) = −0.02
- GELU(−2) = −2 × Φ(−2) = −2 × 0.0228 ≈ −0.0456
Comparison: Sigmoid and tanh are saturated (near their limits). ReLU is completely dead. LeakyReLU and GELU pass through a small signal, keeping the gradient alive.
Example 3: GELU Approximation Accuracy
Problem: Compute the exact GELU(1.5) using the CDF and compare with the tanh approximation.
Solution:
Exact: Φ(1.5) = P(Z ≤ 1.5) where Z ~ N(0,1). From standard normal table: Φ(1.5) ≈ 0.9332. GELU(1.5) = 1.5 × 0.9332 = 1.3998
Approximation: GELU(1.5) ≈ 0.5 × 1.5 × (1 + tanh(√(2/π)(1.5 + 0.044715 × 1.5³))) = 0.75 × (1 + tanh(√(2/π)(1.5 + 0.044715 × 3.375))) = 0.75 × (1 + tanh(0.7979 × 1.6509)) = 0.75 × (1 + tanh(1.3173)) = 0.75 × (1 + 0.8658) = 0.75 × 1.8658 = 1.3994
Error: |1.3998 − 1.3994| = 0.0004. The approximation is excellent — within 0.03%.
Practice Problems
(Answers are below. Try each problem before checking.)
Problem 1: Derive the derivative of tanh(x) starting from tanh(x) = (eˣ − e⁻ˣ)/(eˣ + e⁻ˣ) and show that tanh'(x) = sech²(x) = 1/cosh²(x).
Problem 2: Show that tanh(x) = 2σ(2x) − 1, and use this relationship to express tanh'(x) in terms of σ and its derivative.
Problem 3: Compute ∂/∂a LeakyReLU(x; a) where the parameter a controls the negative slope. This is needed for PReLU backpropagation.
Problem 4: A deep ReLU network has 100 hidden layers, each with 256 neurons. If 40% of neurons have negative pre-activations (output 0), what fraction of gradient paths are completely blocked (multiplied by 0 at least once)?
Problem 5: Compare σ'(5), tanh'(5), and the derivative of ReLU at x = 5. Which one will propagate the strongest gradient signal?
Problem 6: Show that the derivative of SiLU(x) = x·σ(x) can also be written as SiLU'(x) = σ(x) + SiLU(x)(1 − σ(x)).
Answers (click to expand)
**Problem 1:** Using quotient rule with u = eˣ − e⁻ˣ, v = eˣ + e⁻ˣ: tanh'(x) = (v·v − u·u)/v² = (v² − u²)/v² Note that v² − u² = (eˣ + e⁻ˣ)² − (eˣ − e⁻ˣ)² = 4 (expand and simplify). So tanh'(x) = 4/(eˣ + e⁻ˣ)² = 1/cosh²(x) = sech²(x). ✓ **Problem 2:** σ(2x) = 1/(1 + e⁻²ˣ) 2σ(2x) − 1 = 2/(1 + e⁻²ˣ) − 1 = (2 − 1 − e⁻²ˣ)/(1 + e⁻²ˣ) = (1 − e⁻²ˣ)/(1 + e⁻²ˣ) Multiply numerator and denominator by eˣ: (eˣ − e⁻ˣ)/(eˣ + e⁻ˣ) = tanh(x). ✓ tanh'(x) = 4σ'(2x) = 4σ(2x)(1 − σ(2x)) = 1 − tanh²(x) as expected. **Problem 3:** For x < 0: ∂/∂a LeakyReLU(x; a) = ∂/∂a (a·x) = x. For x ≥ 0: ∂/∂a = 0. **Problem 4:** A gradient path is blocked if ANY neuron along the path has a zero gradient (ReLU negative). The probability a path survives one layer is 0.6. For 100 layers: P(survive) = 0.6¹⁰⁰ ≈ 6.5 × 10⁻²³. So essentially ALL paths are blocked — 99.999...% die. This illustrates why deep ReLU networks still need good initialization and normalization. **Problem 5:** - σ'(5) = σ(5)(1 − σ(5)) ≈ 0.9933 × 0.0067 ≈ 0.0067 - tanh'(5) = 1 − tanh²(5) ≈ 1 − 0.9999² ≈ 0.0002 - ReLU'(5) = 1 ReLU has the strongest gradient (1), followed by sigmoid (0.0067), then tanh (0.0002). This is why ReLU-based networks train faster and deeper than sigmoid/tanh networks. **Problem 6:** SiLU'(x) = d/dx[x·σ(x)] = σ(x) + x·σ'(x) = σ(x) + x·σ(x)(1 − σ(x)) = σ(x) + x·σ(x)(1 − σ(x)) = σ(x) + SiLU(x)(1 − σ(x)). ✓Summary
- Activation functions introduce non-linearity; without them, deep networks reduce to linear transformations.
- Sigmoid σ(x) = 1/(1+e⁻ˣ) outputs (0,1) and has derivative σ(x)(1−σ(x)), but saturates → vanishing gradients.
- Tanh outputs (−1,1), is zero-centered (better than sigmoid), with derivative 1−tanh²(x); still saturates.
- ReLU = max(0,x) has constant gradient 1 for x>0, solving vanishing gradients but creating "dying ReLU" for x<0.
- LeakyReLU/GELU/SiLU soften ReLU's hard zero with small negative slopes or smooth weighting, improving gradient flow and performance in deep architectures.
Pitfalls
- Using sigmoid or tanh in deep networks without considering vanishing gradients: The derivative of sigmoid is at most 0.25, and tanh at most 1.0, but both decay rapidly toward zero when inputs are large in magnitude. In a 50-layer network, the gradient can shrink by factors of $10^{-20}$ or worse. If you're building a deep network and training stalls with sigmoid/tanh, switch to ReLU or its variants. Sigmoid and tanh are now primarily used in output layers and gating mechanisms (LSTM/GRU), not hidden layers.
- Ignoring the dying ReLU problem until it's too late: If a large fraction of ReLU neurons consistently output zero across the training set, they receive zero gradient and never recover. Monitor the fraction of dead neurons during training. If >20–30% are dead after the first few epochs, reduce the learning rate, switch to LeakyReLU/PReLU, or improve weight initialization (e.g., He initialization instead of Xavier).
- Using linear (identity) activation in hidden layers: Stacking linear layers produces a single equivalent linear layer — all depth is wasted. The network loses its universal approximation capability. If you accidentally set $activation=None$ or $activation='linear'$ in hidden layers, the model may still train but will perform no better than a shallow linear model. Always verify that hidden layers have non-linear activations.
- Choosing the wrong activation for the output layer: Binary classification needs sigmoid (output in (0,1)), multi-class needs softmax (probability distribution), regression with unbounded outputs needs linear/no activation, and regression requiring positive outputs needs ReLU or softplus. Using sigmoid for multi-class (sum of probabilities won't be 1) or softmax for binary (wastes a degree of freedom) are common mistakes.
- Overlooking GELU vs ReLU tradeoffs in transformers: GELU is standard in transformer architectures (BERT, GPT) because its smooth probabilistic gating provides better training dynamics for self-attention networks. But GELU is more expensive to compute than ReLU (requires Gaussian CDF approximation). For non-transformer architectures (CNNs, simple MLPs), ReLU often works equally well and is faster. Don't cargo-cult GELU into every architecture.
Quiz
Q1: What is σ'(x) expressed in terms of σ(x)?
A) σ(x)² B) 1 − σ(x)² C) σ(x)(1 − σ(x)) D) σ(x)/(1 − σ(x))
Answer and Explanations
**Correct: C) σ(x)(1 − σ(x))** σ'(x) = e⁻ˣ/(1+e⁻ˣ)² = σ(x)(1−σ(x)). Maximum value 0.25 at x=0. - A) σ(x)² is wrong — that would give σ'(0) = 0.25 but with wrong shape overall. - B) 1−σ(x)² would be 1−0.25=0.75 at x=0, which is too large (actual max is 0.25). - C) ✓ Correct. This clean form is why sigmoid is computationally efficient in backprop. - D) σ(x)/(1−σ(x)) = eˣ, which grows without bound — not a derivative.Q2: Why does ReLU help mitigate the vanishing gradient problem?
A) ReLU has a larger output range than sigmoid B) The derivative of ReLU is exactly 1 for all positive inputs C) ReLU is differentiable everywhere D) ReLU has a smaller parameter count
Answer and Explanations
**Correct: B) The derivative of ReLU is exactly 1 for all positive inputs** When gradients flow through many layers, each layer multiplies by the activation derivative. For sigmoid/tanh, this derivative is always <1 (often ≪1). The product of many numbers <1 rapidly approaches 0. ReLU's derivative of exactly 1 (for active neurons) means gradients don't decay. - A) Incorrect. Output range doesn't affect gradient flow. - B) ✓ Correct. Multiplicative factor of 1 preserves gradient magnitude. - C) Incorrect. ReLU is NOT differentiable at x=0 (though this rarely matters in practice). - D) Incorrect. Parameter count is unrelated to gradient flow.Q3: A ReLU neuron with weights w = [−2, −3]ᵀ and bias b = 1 receives only inputs from the set {[1,1]ᵀ, [2,2]ᵀ}. Is this neuron "dead"?
A) Yes, it outputs 0 for all training examples B) No, it outputs positive values for all training examples C) Yes, because its weights are all negative D) No, because the bias is positive
Answer and Explanations
**Correct: A) Yes, it outputs 0 for all training examples** For [1,1]: z = −2·1 + (−3)·1 + 1 = −4 → ReLU(−4) = 0 For [2,2]: z = −2·2 + (−3)·2 + 1 = −9 → ReLU(−9) = 0 The neuron outputs 0 for all training examples. During backprop, ∂L/∂z = 0 (because ReLU'(z<0) = 0), so gradients don't flow — the neuron is dead. - A) ✓ Correct. Zero output for all training data means no gradient, no learning. - B) Incorrect. The outputs are all 0, not positive. - C) Incorrect. Negative weights don't guarantee dead neurons — inputs could be negative too. - D) Incorrect. A positive bias doesn't save the neuron if the weighted sum of negative inputs outweighs it.Q4: Which activation function is most appropriate for the hidden layers of a very deep (100+ layer) convolutional network?
A) Sigmoid B) Tanh C) ReLU D) Linear (identity)
Answer and Explanations
**Correct: C) ReLU** ReLU's constant gradient of 1 for positive activations enables training very deep networks. Sigmoid and tanh would cause gradients to vanish. Linear activation makes depth pointless (composition of linear functions is linear). - A) Incorrect. Sigmoid would cause severe vanishing gradients in 100 layers. - B) Incorrect. Tanh also suffers from vanishing gradients at depth. - C) ✓ Correct. ReLU (or its variants like LeakyReLU) is the standard for deep networks. - D) Incorrect. Linear activations collapse the network to a single linear layer — no depth benefit.Q5: GELU(x) = x·Φ(x) can be viewed as a smooth version of which activation function?
A) Sigmoid B) Tanh C) ReLU D) Softmax
Answer and Explanations
**Correct: C) ReLU** For large positive x, Φ(x) ≈ 1, so GELU(x) ≈ x (like ReLU). For large negative x, Φ(x) ≈ 0, so GELU(x) ≈ 0 (like ReLU). The key difference is that GELU is smooth at the transition point. It can be seen as ReLU with a "soft" probabilistic gating. - A) Incorrect. GELU looks nothing like sigmoid — it's unbounded above. - B) Incorrect. GELU is not centered and has very different shape. - C) ✓ Correct. GELU approximates ReLU but with smooth, probabilistic gating. - D) Incorrect. Softmax normalizes across multiple values; GELU is element-wise.Q6: What is the value of tanh'(3)?
A) ≈ 0.99 B) ≈ 0.01 C) ≈ 0.00 (essentially zero) D) ≈ 1.00
Answer and Explanations
**Correct: B) ≈ 0.01** tanh(3) = (e³ − e⁻³)/(e³ + e⁻³) ≈ (20.0855 − 0.0498)/(20.0855 + 0.0498) ≈ 0.9951 tanh'(3) = 1 − tanh²(3) ≈ 1 − 0.9902 ≈ 0.0098 - A) 0.99 is too large — tanh is saturated at x=3. - B) ✓ Correct. ~0.01 means only 1% of the gradient survives this neuron. - C) Not quite zero, but very small. "Essentially zero" is close but numerically it's about 0.01. - D) 1.00 is tanh'(0), not tanh'(3).Next Steps
Move on to 16-03 — The Softmax Function to learn how multi-class classification uses the softmax to produce a probability distribution over classes.