📐 Concept diagram

15-01 — Floating-Point Arithmetic

Phase: Numerical Methods for ML | Subject: 15-01 Prerequisites: 00-03-decimals.md, basic binary representation Next subject: 15-02-condition-stability.md

Learning Objectives

By the end of this subject, you will be able to:

Explain the IEEE 754 floating-point representation (sign, exponent, mantissa)
Compute machine epsilon and explain its significance
Diagnose and avoid catastrophic cancellation
Explain why $0.1 + 0.2 \neq 0.3$ in floating point
Apply numerically stable alternatives to common formulas (e.g., softmax, log-sum-exp, variance)

Core Content

IEEE 754 Double Precision (float64)

The standard floating-point format in ML (and almost everywhere):

$$\text{value} = (-1)^s \times (1.m) \times 2^{e - 1023}$$

Component	Bits	Meaning
Sign $s$	1 bit	0 = positive, 1 = negative
Exponent $e$	11 bits	Biased by 1023; range $[1, 2046]$ (0 and 2047 are special)
Mantissa $m$	52 bits	Fractional part; leading 1 is implicit (normalized numbers)

Example: Represent 6.5 in float64. $6.5 = 110.1_2 = 1.101_2 \times 2^2$ Sign: $s = 0$. Exponent: $e = 2 + 1023 = 1025 = 10000000001_2$. Mantissa: $101000\ldots0_2$ (52 bits).

⚠️ CRITICAL — Not all real numbers are exactly representable. Floating-point uses a finite set of rational numbers with denominator $2^k$. Numbers like $0.1$, $0.2$, and $0.3$ have infinite binary expansions (like $1/3 = 0.333\ldots$ in decimal), so they're rounded to the nearest representable float.

Machine Epsilon

Machine epsilon $\epsilon_{\text{mach}}$ is the gap between 1 and the next representable float. For float64:

$$\epsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}$$

This means float64 has approximately 15–16 decimal digits of precision.

Practical implication: If you compute something and the answer is 1.0, the error could be as large as $\sim 10^{-16}$. Usually fine, but it accumulates over many operations.

Why $0.1 + 0.2 \neq 0.3$

>>> 0.1 + 0.2
0.30000000000000004

$0.1$ in binary: $0.00011001100110011\ldots_2$ (repeating) $0.2$ in binary: $0.00110011001100110\ldots_2$ Both are rounded to 53 significant bits. Their sum differs from the rounded representation of $0.3$ by $\sim 5.5 \times 10^{-17}$.

Takeaway: Never compare floats with $==$. Use $abs(a - b) < 1e-12$ or similar.

Catastrophic Cancellation

When subtracting two nearly equal numbers, most of the significant digits cancel, leaving only rounding error.

Example: Compute $\sqrt{x^2 + 1} - x$ for large $x$ (e.g., $x = 10^8$).

Direct computation: $\sqrt{10^{16} + 1} \approx 10^8$ and $10^8$ — we're subtracting two ~16-digit numbers that differ by $\sim 5 \times 10^{-9}$. In float64, $10^8$ only has about 8 significant digits after the decimal (since float64 has ~15 total digits and 8 go to the integer part). Result: total loss of accuracy.

Stable alternative: Rationalize the numerator: $$\sqrt{x^2 + 1} - x = \frac{(\sqrt{x^2+1} - x)(\sqrt{x^2+1} + x)}{\sqrt{x^2+1} + x} = \frac{1}{\sqrt{x^2+1} + x}$$

This avoids cancellation entirely. For $x = 10^8$, this gives $\approx 5 \times 10^{-9}$ accurately.

The Log-Sum-Exp Trick

The softmax function $\operatorname{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}$ can overflow when $x_i$ is large ($e^{709} \approx 1.8 \times 10^{308}$, near float64 max).

Stable computation: $$\operatorname{softmax}(x_i) = \frac{e^{x_i - \max_k x_k}}{\sum_j e^{x_j - \max_k x_k}}$$

Subtracting the maximum ensures all exponents are $\leq 0$, so $e^{x_i - \max x_k} \in (0, 1]$. No overflow.

Similarly, $\log\sum_i e^{x_i} = \max_k x_k + \log\sum_i e^{x_i - \max_k x_k}$.

Key Terms

Machine epsilon

Worked Examples

Example 1: Catastrophic Cancellation in Variance

Compute the variance of $[10000, 10001, 10002]$ using the naive two-pass formula $\mathbb{E}[X^2] - (\mathbb{E}[X])^2$.

Solution (naive): $\mathbb{E}[X] = (10000+10001+10002)/3 = 10001$. $\mathbb{E}[X^2] = (10^8 + 100020001 + 100040004)/3 = 300060005/3 = 100020001.666...$

$\operatorname{Var} = 100020001.666... - 10001^2 = 100020001.666... - 100020001 = 0.666...$

In float64, $100020001.666...$ and $100020001$ differ by $\sim 2/3$, but we're subtracting two ~9-digit numbers. The subtraction loses about 8 digits — result is accurate to ~7 digits. Not catastrophic here, but for data with coefficient of variation near machine epsilon, the formula can produce negative variances!

Stable alternative: Welford's online algorithm or the two-pass algorithm (compute mean first, then sum squared deviations).

Click for answer

Naive two-pass gives $\operatorname{Var} \approx 0.6666667$ (correct but mildly inaccurate for this small example). For data like $[10^8, 10^8+1, 10^8+2]$, the naive formula can give zero or negative variance due to catastrophic cancellation.

Example 2: Log-Sum-Exp

Compute $\log(e^{1000} + e^{1001} + e^{1002})$ stably.

Solution: Without trick: $e^{1000}$, $e^{1001}$, $e^{1002}$ all overflow float64 ($e^{709.78}$ is the max). With trick: $m = \max(1000, 1001, 1002) = 1002$. $\log(e^{1000} + e^{1001} + e^{1002}) = 1002 + \log(e^{-2} + e^{-1} + e^0) = 1002 + \log(0.1353 + 0.3679 + 1) = 1002 + \log(1.5032) = 1002 + 0.4076 = 1002.4076$

Click for answer

$\approx 1002.4076$. Without the log-sum-exp trick, the computation overflows.

Example 3: Forward vs Backward Stability

Compute the roots of $x^2 - 10^6x + 1 = 0$ using the standard quadratic formula. Identify the unstable root.

Solution: Standard formula: $x = \frac{10^6 \pm \sqrt{10^{12} - 4}}{2}$ $\sqrt{10^{12} - 4} = \sqrt{999999999996} \approx 999999.999998$

$x_1 = \frac{10^6 + 999999.999998}{2} \approx 999999.999999$ — accurate (no cancellation). $x_2 = \frac{10^6 - 999999.999998}{2} \approx \frac{0.000002}{2} = 0.000001$ — catastrophic cancellation! We subtracted two nearly identical large numbers.

Stable alternative for the smaller root: Use $x_2 = (c/a) / x_1$ (Vieta's formula: product of roots = $c/a = 1$). $x_2 = 1 / 999999.999999 \approx 1.000000000001 \times 10^{-6}$ — much more accurate.

Click for answer

The smaller root ($x_2$) suffers from catastrophic cancellation in the standard formula. Vieta's formula $x_2 = c/(a x_1) = 1/x_1$ recovers it accurately.

Quiz

Q1: What does the concept of Machine epsilon primarily refer to in this subject?

A) A computational error related to Machine epsilon B) A historical anecdote about Machine epsilon C) A visual representation of Machine epsilon D) The definition and application of Machine epsilon

Correct: D)

If you chose A: This is incorrect. Machine epsilon is defined as: the definition and application of machine epsilon. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Machine epsilon is defined as: the definition and application of machine epsilon. The other options describe different aspects that are not the primary focus.
If you chose C: This is incorrect. Machine epsilon is defined as: the definition and application of machine epsilon. The other options describe different aspects that are not the primary focus.
If you chose D: Machine epsilon is defined as: the definition and application of machine epsilon. The other options describe different aspects that are not the primary focus. Correct!

Q2: Which of the following is the key formula discussed in this subject?

A) A simplified version of 0.1 + 0.2 \neq 0.3... B) An unrelated formula from a different topic C) The inverse operation of the formula in question D) 0.1 + 0.2 \neq 0.3

Correct: D)

If you chose A: This is incorrect. The formula 0.1 + 0.2 \neq 0.3 is central to this subject. The other options are either simplified versions or unrelated.
If you chose B: This is incorrect. The formula 0.1 + 0.2 \neq 0.3 is central to this subject. The other options are either simplified versions or unrelated.
If you chose C: This is incorrect. The formula 0.1 + 0.2 \neq 0.3 is central to this subject. The other options are either simplified versions or unrelated.
If you chose D: The formula 0.1 + 0.2 \neq 0.3 is central to this subject. The other options are either simplified versions or unrelated. Correct!

Q3: What is the primary purpose of Ieee 754 Double Precision (Float64)?

A) It is used to ieee 754 double precision (float64) in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain

Correct: A)

If you chose A: Ieee 754 Double Precision (Float64) serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose B: This is incorrect. Ieee 754 Double Precision (Float64) serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. Ieee 754 Double Precision (Float64) serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. Ieee 754 Double Precision (Float64) serves the purpose described in the correct answer. The other options misrepresent its role.

Q4: Which statement about Why $0.1 + 0.2 \Neq 0.3$ is TRUE?

A) Why $0.1 + 0.2 \Neq 0.3$ is not related to this subject B) Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject C) Why $0.1 + 0.2 \Neq 0.3$ is mentioned only as a historical footnote D) Why $0.1 + 0.2 \Neq 0.3$ is an advanced topic beyond this subject's scope

Correct: B)

If you chose A: This is incorrect. Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject. This subject covers Why $0.1 + 0.2 \Neq 0.3$ as part of its core content.
If you chose B: Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject. This subject covers Why $0.1 + 0.2 \Neq 0.3$ as part of its core content. Correct!
If you chose C: This is incorrect. Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject. This subject covers Why $0.1 + 0.2 \Neq 0.3$ as part of its core content.
If you chose D: This is incorrect. Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject. This subject covers Why $0.1 + 0.2 \Neq 0.3$ as part of its core content.

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) An unrelated numerical value C) $0.00011001100110011\ldots_2$ (repeating) D) A different result from a common mistake

Correct: C)

If you chose A: This is incorrect. The worked examples show that the result is $0.00011001100110011\ldots_2$ (repeating). The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is $0.00011001100110011\ldots_2$ (repeating). The other options represent common errors.
If you chose C: The worked examples show that the result is $0.00011001100110011\ldots_2$ (repeating). The other options represent common errors. Correct!
If you chose D: This is incorrect. The worked examples show that the result is $0.00011001100110011\ldots_2$ (repeating). The other options represent common errors.

Q6: How are Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation related?

A) Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are completely unrelated topics B) Why $0.1 + 0.2 \Neq 0.3$ is a special case of Catastrophic Cancellation C) Why $0.1 + 0.2 \Neq 0.3$ is the inverse of Catastrophic Cancellation D) Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are closely related concepts

Correct: D)

If you chose A: This is incorrect. Both Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are covered in this subject as interconnected topics.
If you chose D: Both Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are covered in this subject as interconnected topics. Correct!

Q7: What is a common pitfall when working with The Log-Sum-Exp Trick?

A) The main error with The Log-Sum-Exp Trick is using it when it is not needed B) The Log-Sum-Exp Trick has no common misconceptions C) The Log-Sum-Exp Trick is always computed the same way in all contexts D) A common mistake is confusing The Log-Sum-Exp Trick with a similar concept

Correct: D)

If you chose A: This is incorrect. Students often confuse The Log-Sum-Exp Trick with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse The Log-Sum-Exp Trick with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: This is incorrect. Students often confuse The Log-Sum-Exp Trick with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: Students often confuse The Log-Sum-Exp Trick with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!

Q8: When should you apply Example 1: Catastrophic Cancellation In Variance?

A) Example 1: Catastrophic Cancellation In Variance is not practically useful B) Apply Example 1: Catastrophic Cancellation In Variance to solve problems in this subject's domain C) Use Example 1: Catastrophic Cancellation In Variance only in pure mathematics contexts D) Avoid Example 1: Catastrophic Cancellation In Variance unless explicitly instructed

Correct: B)

If you chose A: This is incorrect. Example 1: Catastrophic Cancellation In Variance is a practical tool used throughout this subject to solve relevant problems.
If you chose B: Example 1: Catastrophic Cancellation In Variance is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose C: This is incorrect. Example 1: Catastrophic Cancellation In Variance is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. Example 1: Catastrophic Cancellation In Variance is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

What's the next representable float64 after 1.0? What about before 1.0?

Click for answer
After 1.0: $1 + 2^{-52} \approx 1 + 2.22 \times 10^{-16}$. Before 1.0: $1 - 2^{-53} \approx 1 - 1.11 \times 10^{-16}$. The gap is asymmetric because the exponent changes at powers of 2 — gap doubles at $2, 4, 8, \ldots$ and halves at $1/2, 1/4, \ldots$.
Explain why $\tanh(x)$ is numerically stable for all $x$ but computing it as $(e^x - e^{-x})/(e^x + e^{-x})$ can overflow.

Click for answer
For large positive $x$, $e^x \to \infty$ (overflow), $e^{-x} \to 0$. The ratio $\to 1$ but the intermediate computation overflows. Stable alternatives: (1) Use library `tanh(x)` which handles this internally. (2) For positive $x$: $\tanh(x) = 1 - 2/(e^{2x}+1)$ (no overflow in the denominator). (3) For negative $x$: use $\tanh(x) = -\tanh(-x)$.
Compute the smallest positive normalized float64 and the smallest positive subnormal.

Click for answer
Smallest normalized: exponent $= 1$ (biased: $1-1023=-1022$), mantissa $= 0$. Value: $1.0 \times 2^{-1022} \approx 2.225 \times 10^{-308}$. Smallest subnormal: exponent $= 0$, mantissa $= 000\ldots001$ (LSB only). Value: $0.00\ldots01_2 \times 2^{-1022} = 2^{-52} \times 2^{-1022} = 2^{-1074} \approx 4.94 \times 10^{-324}$. Subnormals allow gradual underflow — without them, any value below $2^{-1022}$ rounds to zero.
The sigmoid function $\sigma(x) = 1/(1+e^{-x})$ can be numerically unstable. Write a stable version.

Click for answer
For $x \geq 0$: $\sigma(x) = 1/(1+e^{-x})$ — safe, $e^{-x} \in (0,1]$, no overflow. For $x < 0$: $\sigma(x) = e^x/(1+e^x)$ — equivalent by multiplying numerator and denominator by $e^x$, and $e^x \in (0,1]$, safe. Or equivalently: return 0.5 + 0.5 * tanh(x/2).
Demonstrate catastrophic cancellation in computing $\frac{1-\cos x}{x^2}$ for $x = 10^{-8}$. Provide a stable alternative.

Click for answer
For small $x$: $\cos x \approx 1 - x^2/2 + x^4/24 - \ldots$ (Taylor series). $1 - \cos x \approx x^2/2$ (for tiny $x$). So $(1-\cos x)/x^2 \approx 1/2$. Naive: For $x=10^{-8}$, $\cos(10^{-8}) \approx 0.9999999999999999$ (float64). $1 - \cos x$ loses all precision — all significant digits cancel. Stable: Use $\frac{1-\cos x}{x^2} = \frac{2\sin^2(x/2)}{x^2} = \frac{1}{2}\left(\frac{\sin(x/2)}{x/2}\right)^2$. For small $x$, $\sin(x/2)/(x/2) \approx 1$, giving $1/2$ accurately.

Summary

Key takeaways:

IEEE 754 float64: 1 sign bit, 11 exponent bits, 52 mantissa bits → ~15-16 decimal digits
Machine epsilon $\epsilon_{\text{mach}} = 2^{-52} \approx 2.2 \times 10^{-16}$ — the resolution at 1.0
Floating-point has finite precision; not all decimals are exactly representable
Catastrophic cancellation: subtracting nearly equal numbers destroys accuracy
Log-sum-exp trick prevents overflow in softmax and related computations
Always use numerically stable formulations: avoid $a-b$ when $a \approx b$, use rationalization or algebraically equivalent stable forms

Pitfalls

Comparing floats with $==$. Due to rounding errors in binary representation, $0.1 + 0.2 == 0.3$ evaluates to False. Always use $abs(a - b) < tol$ with tolerance around $10^{-12}$ for float64, $10^{-6}$ for float32.
Computing variance with the naive two-pass formula. $\mathbb{E}[X^2] - (\mathbb{E}[X])^2$ subtracts two large nearly-equal numbers when the coefficient of variation is small. For data like $[10^8, 10^8+1, 10^8+2]$, the naive formula can produce zero or negative variance. Use Welford's online algorithm or the stable two-pass method.
Computing softmax by directly exponentiating. $\operatorname{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}$ overflows for $x_i > 709$ in float64. Always subtract $\max_k x_k$ from all inputs before exponentiating — the shifted exponents are all $\leq 0$, making overflow impossible.
Subtracting nearly equal numbers without rationalization. $\sqrt{x^2 + 1} - x$ for large $x$ loses all significant digits. Rationalize: $1 / (\sqrt{x^2 + 1} + x)$ — mathematically equivalent but numerically stable. Always look for algebraically equivalent forms that avoid subtraction of close values.
Assuming higher precision automatically solves all numerical problems. Switching from float32 to float64 gains ~9 more digits but doesn't fix algorithms that have inherent cancellation. A stable formulation in float32 often beats an unstable one in float64.

Next Steps

Next up: 15-02-condition-stability.md — condition numbers, numerical stability of algorithms, and why some problems are inherently hard regardless of precision.

Progress

Phases

15-01 — Floating-Point Arithmetic

Learning Objectives

Core Content

IEEE 754 Double Precision (float64)

Machine Epsilon

Why $0.1 + 0.2 \neq 0.3$

Catastrophic Cancellation

The Log-Sum-Exp Trick

Key Terms

Worked Examples

Example 1: Catastrophic Cancellation in Variance

Example 2: Log-Sum-Exp

Example 3: Forward vs Backward Stability

Quiz

Practice Problems

Summary

Pitfalls

Next Steps