Math graphic
📐 Concept diagram

15-01 — Floating-Point Arithmetic

Phase: Numerical Methods for ML | Subject: 15-01 Prerequisites: 00-03-decimals.md, basic binary representation Next subject: 15-02-condition-stability.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Explain the IEEE 754 floating-point representation (sign, exponent, mantissa)
  2. Compute machine epsilon and explain its significance
  3. Diagnose and avoid catastrophic cancellation
  4. Explain why $0.1 + 0.2 \neq 0.3$ in floating point
  5. Apply numerically stable alternatives to common formulas (e.g., softmax, log-sum-exp, variance)

Core Content

IEEE 754 Double Precision (float64)

The standard floating-point format in ML (and almost everywhere):

$$\text{value} = (-1)^s \times (1.m) \times 2^{e - 1023}$$

Component Bits Meaning
Sign $s$ 1 bit 0 = positive, 1 = negative
Exponent $e$ 11 bits Biased by 1023; range $[1, 2046]$ (0 and 2047 are special)
Mantissa $m$ 52 bits Fractional part; leading 1 is implicit (normalized numbers)

Example: Represent 6.5 in float64. $6.5 = 110.1_2 = 1.101_2 \times 2^2$ Sign: $s = 0$. Exponent: $e = 2 + 1023 = 1025 = 10000000001_2$. Mantissa: $101000\ldots0_2$ (52 bits).

⚠️ CRITICAL — Not all real numbers are exactly representable. Floating-point uses a finite set of rational numbers with denominator $2^k$. Numbers like $0.1$, $0.2$, and $0.3$ have infinite binary expansions (like $1/3 = 0.333\ldots$ in decimal), so they're rounded to the nearest representable float.

Machine Epsilon

Machine epsilon $\epsilon_{\text{mach}}$ is the gap between 1 and the next representable float. For float64:

$$\epsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}$$

This means float64 has approximately 15–16 decimal digits of precision.

Practical implication: If you compute something and the answer is 1.0, the error could be as large as $\sim 10^{-16}$. Usually fine, but it accumulates over many operations.

Why $0.1 + 0.2 \neq 0.3$

>>> 0.1 + 0.2
0.30000000000000004

$0.1$ in binary: $0.00011001100110011\ldots_2$ (repeating) $0.2$ in binary: $0.00110011001100110\ldots_2$ Both are rounded to 53 significant bits. Their sum differs from the rounded representation of $0.3$ by $\sim 5.5 \times 10^{-17}$.

Takeaway: Never compare floats with $==$. Use $abs(a - b) < 1e-12$ or similar.

Catastrophic Cancellation

When subtracting two nearly equal numbers, most of the significant digits cancel, leaving only rounding error.

Example: Compute $\sqrt{x^2 + 1} - x$ for large $x$ (e.g., $x = 10^8$).

Direct computation: $\sqrt{10^{16} + 1} \approx 10^8$ and $10^8$ — we're subtracting two ~16-digit numbers that differ by $\sim 5 \times 10^{-9}$. In float64, $10^8$ only has about 8 significant digits after the decimal (since float64 has ~15 total digits and 8 go to the integer part). Result: total loss of accuracy.

Stable alternative: Rationalize the numerator: $$\sqrt{x^2 + 1} - x = \frac{(\sqrt{x^2+1} - x)(\sqrt{x^2+1} + x)}{\sqrt{x^2+1} + x} = \frac{1}{\sqrt{x^2+1} + x}$$

This avoids cancellation entirely. For $x = 10^8$, this gives $\approx 5 \times 10^{-9}$ accurately.

The Log-Sum-Exp Trick

The softmax function $\operatorname{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}$ can overflow when $x_i$ is large ($e^{709} \approx 1.8 \times 10^{308}$, near float64 max).

Stable computation: $$\operatorname{softmax}(x_i) = \frac{e^{x_i - \max_k x_k}}{\sum_j e^{x_j - \max_k x_k}}$$

Subtracting the maximum ensures all exponents are $\leq 0$, so $e^{x_i - \max x_k} \in (0, 1]$. No overflow.

Similarly, $\log\sum_i e^{x_i} = \max_k x_k + \log\sum_i e^{x_i - \max_k x_k}$.



Key Terms

Worked Examples

Example 1: Catastrophic Cancellation in Variance

Compute the variance of $[10000, 10001, 10002]$ using the naive two-pass formula $\mathbb{E}[X^2] - (\mathbb{E}[X])^2$.

Solution (naive): $\mathbb{E}[X] = (10000+10001+10002)/3 = 10001$. $\mathbb{E}[X^2] = (10^8 + 100020001 + 100040004)/3 = 300060005/3 = 100020001.666...$

$\operatorname{Var} = 100020001.666... - 10001^2 = 100020001.666... - 100020001 = 0.666...$

In float64, $100020001.666...$ and $100020001$ differ by $\sim 2/3$, but we're subtracting two ~9-digit numbers. The subtraction loses about 8 digits — result is accurate to ~7 digits. Not catastrophic here, but for data with coefficient of variation near machine epsilon, the formula can produce negative variances!

Stable alternative: Welford's online algorithm or the two-pass algorithm (compute mean first, then sum squared deviations).

Click for answer Naive two-pass gives $\operatorname{Var} \approx 0.6666667$ (correct but mildly inaccurate for this small example). For data like $[10^8, 10^8+1, 10^8+2]$, the naive formula can give zero or negative variance due to catastrophic cancellation.

Example 2: Log-Sum-Exp

Compute $\log(e^{1000} + e^{1001} + e^{1002})$ stably.

Solution: Without trick: $e^{1000}$, $e^{1001}$, $e^{1002}$ all overflow float64 ($e^{709.78}$ is the max). With trick: $m = \max(1000, 1001, 1002) = 1002$. $\log(e^{1000} + e^{1001} + e^{1002}) = 1002 + \log(e^{-2} + e^{-1} + e^0) = 1002 + \log(0.1353 + 0.3679 + 1) = 1002 + \log(1.5032) = 1002 + 0.4076 = 1002.4076$

Click for answer $\approx 1002.4076$. Without the log-sum-exp trick, the computation overflows.

Example 3: Forward vs Backward Stability

Compute the roots of $x^2 - 10^6x + 1 = 0$ using the standard quadratic formula. Identify the unstable root.

Solution: Standard formula: $x = \frac{10^6 \pm \sqrt{10^{12} - 4}}{2}$ $\sqrt{10^{12} - 4} = \sqrt{999999999996} \approx 999999.999998$

$x_1 = \frac{10^6 + 999999.999998}{2} \approx 999999.999999$ — accurate (no cancellation). $x_2 = \frac{10^6 - 999999.999998}{2} \approx \frac{0.000002}{2} = 0.000001$ — catastrophic cancellation! We subtracted two nearly identical large numbers.

Stable alternative for the smaller root: Use $x_2 = (c/a) / x_1$ (Vieta's formula: product of roots = $c/a = 1$). $x_2 = 1 / 999999.999999 \approx 1.000000000001 \times 10^{-6}$ — much more accurate.

Click for answer The smaller root ($x_2$) suffers from catastrophic cancellation in the standard formula. Vieta's formula $x_2 = c/(a x_1) = 1/x_1$ recovers it accurately.


Quiz

Q1: What does the concept of Machine epsilon primarily refer to in this subject?

A) A computational error related to Machine epsilon B) A historical anecdote about Machine epsilon C) A visual representation of Machine epsilon D) The definition and application of Machine epsilon

Correct: D)

Q2: Which of the following is the key formula discussed in this subject?

A) A simplified version of 0.1 + 0.2 \neq 0.3... B) An unrelated formula from a different topic C) The inverse operation of the formula in question D) 0.1 + 0.2 \neq 0.3

Correct: D)

Q3: What is the primary purpose of Ieee 754 Double Precision (Float64)?

A) It is used to ieee 754 double precision (float64) in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain

Correct: A)

Q4: Which statement about Why $0.1 + 0.2 \Neq 0.3$ is TRUE?

A) Why $0.1 + 0.2 \Neq 0.3$ is not related to this subject B) Why $0.1 + 0.2 \Neq 0.3$ is a fundamental concept covered in this subject C) Why $0.1 + 0.2 \Neq 0.3$ is mentioned only as a historical footnote D) Why $0.1 + 0.2 \Neq 0.3$ is an advanced topic beyond this subject's scope

Correct: B)

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) An unrelated numerical value C) $0.00011001100110011\ldots_2$ (repeating) D) A different result from a common mistake

Correct: C)

Q6: How are Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation related?

A) Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are completely unrelated topics B) Why $0.1 + 0.2 \Neq 0.3$ is a special case of Catastrophic Cancellation C) Why $0.1 + 0.2 \Neq 0.3$ is the inverse of Catastrophic Cancellation D) Why $0.1 + 0.2 \Neq 0.3$ and Catastrophic Cancellation are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with The Log-Sum-Exp Trick?

A) The main error with The Log-Sum-Exp Trick is using it when it is not needed B) The Log-Sum-Exp Trick has no common misconceptions C) The Log-Sum-Exp Trick is always computed the same way in all contexts D) A common mistake is confusing The Log-Sum-Exp Trick with a similar concept

Correct: D)

Q8: When should you apply Example 1: Catastrophic Cancellation In Variance?

A) Example 1: Catastrophic Cancellation In Variance is not practically useful B) Apply Example 1: Catastrophic Cancellation In Variance to solve problems in this subject's domain C) Use Example 1: Catastrophic Cancellation In Variance only in pure mathematics contexts D) Avoid Example 1: Catastrophic Cancellation In Variance unless explicitly instructed

Correct: B)

Practice Problems

  1. What's the next representable float64 after 1.0? What about before 1.0?

    Click for answer After 1.0: $1 + 2^{-52} \approx 1 + 2.22 \times 10^{-16}$. Before 1.0: $1 - 2^{-53} \approx 1 - 1.11 \times 10^{-16}$. The gap is asymmetric because the exponent changes at powers of 2 — gap doubles at $2, 4, 8, \ldots$ and halves at $1/2, 1/4, \ldots$.

  2. Explain why $\tanh(x)$ is numerically stable for all $x$ but computing it as $(e^x - e^{-x})/(e^x + e^{-x})$ can overflow.

    Click for answer For large positive $x$, $e^x \to \infty$ (overflow), $e^{-x} \to 0$. The ratio $\to 1$ but the intermediate computation overflows. Stable alternatives: (1) Use library `tanh(x)` which handles this internally. (2) For positive $x$: $\tanh(x) = 1 - 2/(e^{2x}+1)$ (no overflow in the denominator). (3) For negative $x$: use $\tanh(x) = -\tanh(-x)$.

  3. Compute the smallest positive normalized float64 and the smallest positive subnormal.

    Click for answer Smallest normalized: exponent $= 1$ (biased: $1-1023=-1022$), mantissa $= 0$. Value: $1.0 \times 2^{-1022} \approx 2.225 \times 10^{-308}$. Smallest subnormal: exponent $= 0$, mantissa $= 000\ldots001$ (LSB only). Value: $0.00\ldots01_2 \times 2^{-1022} = 2^{-52} \times 2^{-1022} = 2^{-1074} \approx 4.94 \times 10^{-324}$. Subnormals allow gradual underflow — without them, any value below $2^{-1022}$ rounds to zero.

  4. The sigmoid function $\sigma(x) = 1/(1+e^{-x})$ can be numerically unstable. Write a stable version.

    Click for answer For $x \geq 0$: $\sigma(x) = 1/(1+e^{-x})$ — safe, $e^{-x} \in (0,1]$, no overflow. For $x < 0$: $\sigma(x) = e^x/(1+e^x)$ — equivalent by multiplying numerator and denominator by $e^x$, and $e^x \in (0,1]$, safe. Or equivalently: return 0.5 + 0.5 * tanh(x/2).

  5. Demonstrate catastrophic cancellation in computing $\frac{1-\cos x}{x^2}$ for $x = 10^{-8}$. Provide a stable alternative.

    Click for answer For small $x$: $\cos x \approx 1 - x^2/2 + x^4/24 - \ldots$ (Taylor series). $1 - \cos x \approx x^2/2$ (for tiny $x$). So $(1-\cos x)/x^2 \approx 1/2$. Naive: For $x=10^{-8}$, $\cos(10^{-8}) \approx 0.9999999999999999$ (float64). $1 - \cos x$ loses all precision — all significant digits cancel. Stable: Use $\frac{1-\cos x}{x^2} = \frac{2\sin^2(x/2)}{x^2} = \frac{1}{2}\left(\frac{\sin(x/2)}{x/2}\right)^2$. For small $x$, $\sin(x/2)/(x/2) \approx 1$, giving $1/2$ accurately.


Summary

Key takeaways:


Pitfalls

  1. Comparing floats with $==$. Due to rounding errors in binary representation, $0.1 + 0.2 == 0.3$ evaluates to False. Always use $abs(a - b) < tol$ with tolerance around $10^{-12}$ for float64, $10^{-6}$ for float32.

  2. Computing variance with the naive two-pass formula. $\mathbb{E}[X^2] - (\mathbb{E}[X])^2$ subtracts two large nearly-equal numbers when the coefficient of variation is small. For data like $[10^8, 10^8+1, 10^8+2]$, the naive formula can produce zero or negative variance. Use Welford's online algorithm or the stable two-pass method.

  3. Computing softmax by directly exponentiating. $\operatorname{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}$ overflows for $x_i > 709$ in float64. Always subtract $\max_k x_k$ from all inputs before exponentiating — the shifted exponents are all $\leq 0$, making overflow impossible.

  4. Subtracting nearly equal numbers without rationalization. $\sqrt{x^2 + 1} - x$ for large $x$ loses all significant digits. Rationalize: $1 / (\sqrt{x^2 + 1} + x)$ — mathematically equivalent but numerically stable. Always look for algebraically equivalent forms that avoid subtraction of close values.

  5. Assuming higher precision automatically solves all numerical problems. Switching from float32 to float64 gains ~9 more digits but doesn't fix algorithms that have inherent cancellation. A stable formulation in float32 often beats an unstable one in float64.



Next Steps

Next up: 15-02-condition-stability.md — condition numbers, numerical stability of algorithms, and why some problems are inherently hard regardless of precision.