15-01 — Floating-Point Arithmetic
Phase: Numerical Methods for ML | Subject: 15-01 Prerequisites: 00-03-decimals.md, basic binary representation Next subject: 15-02-condition-stability.md
Learning Objectives
By the end of this subject, you will be able to:
- Explain the IEEE 754 floating-point representation (sign, exponent, mantissa)
- Compute machine epsilon and explain its significance
- Diagnose and avoid catastrophic cancellation
- Explain why $0.1 + 0.2 \neq 0.3$ in floating point
- Apply numerically stable alternatives to common formulas (e.g., softmax, log-sum-exp, variance)
Core Content
IEEE 754 Double Precision (float64)
The standard floating-point format in ML (and almost everywhere):
$$\text{value} = (-1)^s \times (1.m) \times 2^{e - 1023}$$
| Component | Bits | Meaning |
|---|---|---|
| Sign $s$ | 1 bit | 0 = positive, 1 = negative |
| Exponent $e$ | 11 bits | Biased by 1023; range $[1, 2046]$ (0 and 2047 are special) |
| Mantissa $m$ | 52 bits | Fractional part; leading 1 is implicit (normalized numbers) |
Example: Represent 6.5 in float64. $6.5 = 110.1_2 = 1.101_2 \times 2^2$ Sign: $s = 0$. Exponent: $e = 2 + 1023 = 1025 = 10000000001_2$. Mantissa: $101000\ldots0_2$ (52 bits).
⚠️ CRITICAL — Not all real numbers are exactly representable. Floating-point uses a finite set of rational numbers with denominator $2^k$. Numbers like $0.1$, $0.2$, and $0.3$ have infinite binary expansions (like $1/3 = 0.333\ldots$ in decimal), so they're rounded to the nearest representable float.
Machine Epsilon
Machine epsilon $\epsilon_{\text{mach}}$ is the gap between 1 and the next representable float. For float64:
$$\epsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}$$
This means float64 has approximately 15–16 decimal digits of precision.
Practical implication: If you compute something and the answer is 1.0, the error could be as large as $\sim 10^{-16}$. Usually fine, but it accumulates over many operations.
Why $0.1 + 0.2 \neq 0.3$
>>> 0.1 + 0.2
0.30000000000000004
$0.1$ in binary: $0.00011001100110011\ldots_2$ (repeating) $0.2$ in binary: $0.00110011001100110\ldots_2$ Both are rounded to 53 significant bits. Their sum differs from the rounded representation of $0.3$ by $\sim 5.5 \times 10^{-17}$.
Takeaway: Never compare floats with $==$. Use $abs(a - b) < 1e-12$ or similar.
Catastrophic Cancellation
When subtracting two nearly equal numbers, most of the significant digits cancel, leaving only rounding error.
Example: Compute $\sqrt{x^2 + 1} - x$ for large $x$ (e.g., $x = 10^8$).
Direct computation: $\sqrt{10^{16} + 1} \approx 10^8$ and $10^8$ — we're subtracting two ~16-digit numbers that differ by $\sim 5 \times 10^{-9}$. In float64, $10^8$ only has about 8 significant digits after the decimal (since float64 has ~15 total digits and 8 go to the integer part). Result: total loss of accuracy.
Stable alternative: Rationalize the numerator: $$\sqrt{x^2 + 1} - x = \frac{(\sqrt{x^2+1} - x)(\sqrt{x^2+1} + x)}{\sqrt{x^2+1} + x} = \frac{1}{\sqrt{x^2+1} + x}$$
This avoids cancellation entirely. For $x = 10^8$, this gives $\approx 5 \times 10^{-9}$ accurately.
The Log-Sum-Exp Trick
The softmax function $\operatorname{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}$ can overflow when $x_i$ is large ($e^{709} \approx 1.8 \times 10^{308}$, near float64 max).
Stable computation: $$\operatorname{softmax}(x_i) = \frac{e^{x_i - \max_k x_k}}{\sum_j e^{x_j - \max_k x_k}}$$
Subtracting the maximum ensures all exponents are $\leq 0$, so $e^{x_i - \max x_k} \in (0, 1]$. No overflow.
Similarly, $\log\sum_i e^{x_i} = \max_k x_k + \log\sum_i e^{x_i - \max_k x_k}$.
Key Terms
- Machine epsilon
Worked Examples
Example 1: Catastrophic Cancellation in Variance
Compute the variance of $[10000, 10001, 10002]$ using the naive two-pass formula $\mathbb{E}[X^2] - (\mathbb{E}[X])^2$.
Solution (naive): $\mathbb{E}[X] = (10000+10001+10002)/3 = 10001$. $\mathbb{E}[X^2] = (10^8 + 100020001 + 100040004)/3 = 300060005/3 = 100020001.666...$
$\operatorname{Var} = 100020001.666... - 10001^2 = 100020001.666... - 100020001 = 0.666...$
In float64, $100020001.666...$ and $100020001$ differ by $\sim 2/3$, but we're subtracting two ~9-digit numbers. The subtraction loses about 8 digits — result is accurate to ~7 digits. Not catastrophic here, but for data with coefficient of variation near machine epsilon, the formula can produce negative variances!
Stable alternative: Welford's online algorithm or the two-pass algorithm (compute mean first, then sum squared deviations).
Click for answer
Naive two-pass gives $\operatorname{Var} \approx 0.6666667$ (correct but mildly inaccurate for this small example). For data like $[10^8, 10^8+1, 10^8+2]$, the naive formula can give zero or negative variance due to catastrophic cancellation.Example 2: Log-Sum-Exp
Compute $\log(e^{1000} + e^{1001} + e^{1002})$ stably.
Solution: Without trick: $e^{1000}$, $e^{1001}$, $e^{1002}$ all overflow float64 ($e^{709.78}$ is the max). With trick: $m = \max(1000, 1001, 1002) = 1002$. $\log(e^{1000} + e^{1001} + e^{1002}) = 1002 + \log(e^{-2} + e^{-1} + e^0) = 1002 + \log(0.1353 + 0.3679 + 1) = 1002 + \log(1.5032) = 1002 + 0.4076 = 1002.4076$
Click for answer
$\approx 1002.4076$. Without the log-sum-exp trick, the computation overflows.Example 3: Forward vs Backward Stability
Compute the roots of $x^2 - 10^6x + 1 = 0$ using the standard quadratic formula. Identify the unstable root.
Solution: Standard formula: $x = \frac{10^6 \pm \sqrt{10^{12} - 4}}{2}$ $\sqrt{10^{12} - 4} = \sqrt{999999999996} \approx 999999.999998$
$x_1 = \frac{10^6 + 999999.999998}{2} \approx 999999.999999$ — accurate (no cancellation). $x_2 = \frac{10^6 - 999999.999998}{2} \approx \frac{0.000002}{2} = 0.000001$ — catastrophic cancellation! We subtracted two nearly identical large numbers.
Stable alternative for the smaller root: Use $x_2 = (c/a) / x_1$ (Vieta's formula: product of roots = $c/a = 1$). $x_2 = 1 / 999999.999999 \approx 1.000000000001 \times 10^{-6}$ — much more accurate.
Click for answer
The smaller root ($x_2$) suffers from catastrophic cancellation in the standard formula. Vieta's formula $x_2 = c/(a x_1) = 1/x_1$ recovers it accurately.Quiz
Q1: What does the concept of Machine epsilon primarily refer to in this subject?
A) A computational error related to Machine epsilon B) A historical anecdote about Machine epsilon C) A visual representation of Machine epsilon D) The definition and application of Machine epsilon
Correct: D)
- If you chose A: This is incorrect. Machine epsilon is defined as: the definition and application of machine epsilon. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Machine epsilon is defined as: the definition and application of machine epsilon. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Machine epsilon is defined as: the definition and application of machine epsilon. The other options describe different aspects that are not the primary focus.
- If you chose D: Machine epsilon is defined as: the definition and application of machine epsilon. The other options describe different aspects that are not the primary focus. Correct!
Q2: Which of the following is the key formula discussed in this subject?
A) A simplified version of 0.1 + 0.2 \neq 0.3... B) An unrelated formula from a different topic C) The inverse operation of the formula in question D) 0.1 + 0.2 \neq 0.3
Correct: D)
- If you chose A: This is incorrect. The formula 0.1 + 0.2 \neq 0.3 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: This is incorrect. The formula 0.1 + 0.2 \neq 0.3 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula 0.1 + 0.2 \neq 0.3 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: The formula 0.1 + 0.2 \neq 0.3 is central to this subject. The other options are either simplified versions or unrelated. Correct!
Q3: What is the primary purpose of Ieee 754 Double Precision (Float64)?
A) It is used to ieee 754 double precision (float64) in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain
Correct: A)
- If you chose A: Ieee 754 Double Precision (Float64) serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. Ieee 754 Double Precision (Float64) serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Ieee 754 Double Precision (Float64) serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Ieee 754 Double Precision (Float64) serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Why $0.1 + 0.2 \Neq 0.3$ is TRUE?
A) Why $0.1 + 0.2 \Neq 0.3$ is not related to this subject B) Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject C) Why $0.1 + 0.2 \Neq 0.3$ is mentioned only as a historical footnote D) Why $0.1 + 0.2 \Neq 0.3$ is an advanced topic beyond this subject's scope
Correct: B)
- If you chose A: This is incorrect. Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject. This subject covers Why $0.1 + 0.2 \Neq 0.3$ as part of its core content.
- If you chose B: Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject. This subject covers Why $0.1 + 0.2 \Neq 0.3$ as part of its core content. Correct!
- If you chose C: This is incorrect. Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject. This subject covers Why $0.1 + 0.2 \Neq 0.3$ as part of its core content.
- If you chose D: This is incorrect. Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject. This subject covers Why $0.1 + 0.2 \Neq 0.3$ as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) An unrelated numerical value C) $0.00011001100110011\ldots_2$ (repeating) D) A different result from a common mistake
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is $0.00011001100110011\ldots_2$ (repeating). The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is $0.00011001100110011\ldots_2$ (repeating). The other options represent common errors.
- If you chose C: The worked examples show that the result is $0.00011001100110011\ldots_2$ (repeating). The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is $0.00011001100110011\ldots_2$ (repeating). The other options represent common errors.
Q6: How are Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation related?
A) Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are completely unrelated topics B) Why $0.1 + 0.2 \Neq 0.3$ is a special case of Catastrophic Cancellation C) Why $0.1 + 0.2 \Neq 0.3$ is the inverse of Catastrophic Cancellation D) Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are covered in this subject as interconnected topics.
- If you chose D: Both Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are covered in this subject as interconnected topics. Correct!
Q7: What is a common pitfall when working with The Log-Sum-Exp Trick?
A) The main error with The Log-Sum-Exp Trick is using it when it is not needed B) The Log-Sum-Exp Trick has no common misconceptions C) The Log-Sum-Exp Trick is always computed the same way in all contexts D) A common mistake is confusing The Log-Sum-Exp Trick with a similar concept
Correct: D)
- If you chose A: This is incorrect. Students often confuse The Log-Sum-Exp Trick with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse The Log-Sum-Exp Trick with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse The Log-Sum-Exp Trick with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: Students often confuse The Log-Sum-Exp Trick with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
Q8: When should you apply Example 1: Catastrophic Cancellation In Variance?
A) Example 1: Catastrophic Cancellation In Variance is not practically useful B) Apply Example 1: Catastrophic Cancellation In Variance to solve problems in this subject's domain C) Use Example 1: Catastrophic Cancellation In Variance only in pure mathematics contexts D) Avoid Example 1: Catastrophic Cancellation In Variance unless explicitly instructed
Correct: B)
- If you chose A: This is incorrect. Example 1: Catastrophic Cancellation In Variance is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Example 1: Catastrophic Cancellation In Variance is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Example 1: Catastrophic Cancellation In Variance is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Example 1: Catastrophic Cancellation In Variance is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
What's the next representable float64 after 1.0? What about before 1.0?
Click for answer
After 1.0: $1 + 2^{-52} \approx 1 + 2.22 \times 10^{-16}$. Before 1.0: $1 - 2^{-53} \approx 1 - 1.11 \times 10^{-16}$. The gap is asymmetric because the exponent changes at powers of 2 — gap doubles at $2, 4, 8, \ldots$ and halves at $1/2, 1/4, \ldots$. -
Explain why $\tanh(x)$ is numerically stable for all $x$ but computing it as $(e^x - e^{-x})/(e^x + e^{-x})$ can overflow.
Click for answer
For large positive $x$, $e^x \to \infty$ (overflow), $e^{-x} \to 0$. The ratio $\to 1$ but the intermediate computation overflows. Stable alternatives: (1) Use library `tanh(x)` which handles this internally. (2) For positive $x$: $\tanh(x) = 1 - 2/(e^{2x}+1)$ (no overflow in the denominator). (3) For negative $x$: use $\tanh(x) = -\tanh(-x)$. -
Compute the smallest positive normalized float64 and the smallest positive subnormal.
Click for answer
Smallest normalized: exponent $= 1$ (biased: $1-1023=-1022$), mantissa $= 0$. Value: $1.0 \times 2^{-1022} \approx 2.225 \times 10^{-308}$. Smallest subnormal: exponent $= 0$, mantissa $= 000\ldots001$ (LSB only). Value: $0.00\ldots01_2 \times 2^{-1022} = 2^{-52} \times 2^{-1022} = 2^{-1074} \approx 4.94 \times 10^{-324}$. Subnormals allow gradual underflow — without them, any value below $2^{-1022}$ rounds to zero. -
The sigmoid function $\sigma(x) = 1/(1+e^{-x})$ can be numerically unstable. Write a stable version.
Click for answer
For $x \geq 0$: $\sigma(x) = 1/(1+e^{-x})$ — safe, $e^{-x} \in (0,1]$, no overflow. For $x < 0$: $\sigma(x) = e^x/(1+e^x)$ — equivalent by multiplying numerator and denominator by $e^x$, and $e^x \in (0,1]$, safe. Or equivalently: return 0.5 + 0.5 * tanh(x/2). -
Demonstrate catastrophic cancellation in computing $\frac{1-\cos x}{x^2}$ for $x = 10^{-8}$. Provide a stable alternative.
Click for answer
For small $x$: $\cos x \approx 1 - x^2/2 + x^4/24 - \ldots$ (Taylor series). $1 - \cos x \approx x^2/2$ (for tiny $x$). So $(1-\cos x)/x^2 \approx 1/2$. Naive: For $x=10^{-8}$, $\cos(10^{-8}) \approx 0.9999999999999999$ (float64). $1 - \cos x$ loses all precision — all significant digits cancel. Stable: Use $\frac{1-\cos x}{x^2} = \frac{2\sin^2(x/2)}{x^2} = \frac{1}{2}\left(\frac{\sin(x/2)}{x/2}\right)^2$. For small $x$, $\sin(x/2)/(x/2) \approx 1$, giving $1/2$ accurately.
Summary
Key takeaways:
- IEEE 754 float64: 1 sign bit, 11 exponent bits, 52 mantissa bits → ~15-16 decimal digits
- Machine epsilon $\epsilon_{\text{mach}} = 2^{-52} \approx 2.2 \times 10^{-16}$ — the resolution at 1.0
- Floating-point has finite precision; not all decimals are exactly representable
- Catastrophic cancellation: subtracting nearly equal numbers destroys accuracy
- Log-sum-exp trick prevents overflow in softmax and related computations
- Always use numerically stable formulations: avoid $a-b$ when $a \approx b$, use rationalization or algebraically equivalent stable forms
Pitfalls
-
Comparing floats with $==$. Due to rounding errors in binary representation, $0.1 + 0.2 == 0.3$ evaluates to
False. Always use $abs(a - b) < tol$ with tolerance around $10^{-12}$ for float64, $10^{-6}$ for float32. -
Computing variance with the naive two-pass formula. $\mathbb{E}[X^2] - (\mathbb{E}[X])^2$ subtracts two large nearly-equal numbers when the coefficient of variation is small. For data like $[10^8, 10^8+1, 10^8+2]$, the naive formula can produce zero or negative variance. Use Welford's online algorithm or the stable two-pass method.
-
Computing softmax by directly exponentiating. $\operatorname{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}$ overflows for $x_i > 709$ in float64. Always subtract $\max_k x_k$ from all inputs before exponentiating — the shifted exponents are all $\leq 0$, making overflow impossible.
-
Subtracting nearly equal numbers without rationalization. $\sqrt{x^2 + 1} - x$ for large $x$ loses all significant digits. Rationalize: $1 / (\sqrt{x^2 + 1} + x)$ — mathematically equivalent but numerically stable. Always look for algebraically equivalent forms that avoid subtraction of close values.
-
Assuming higher precision automatically solves all numerical problems. Switching from float32 to float64 gains ~9 more digits but doesn't fix algorithms that have inherent cancellation. A stable formulation in float32 often beats an unstable one in float64.
Next Steps
Next up: 15-02-condition-stability.md — condition numbers, numerical stability of algorithms, and why some problems are inherently hard regardless of precision.