15-02 — Condition and Stability
Phase: Numerical Methods for ML | Subject: 15-02 Prerequisites: 15-01-floating-point-arithmetic.md, 06-03-partial-derivatives.md Next subject: 15-03-automatic-differentiation.md
Learning Objectives
By the end of this subject, you will be able to:
- Distinguish between problem conditioning and algorithm stability
- Compute condition numbers for common operations (addition, root-finding, matrix inversion)
- Explain why ill-conditioned problems produce unreliable results regardless of algorithm
- Analyze the stability of forward and backward algorithms
- Recognize ill-conditioned formulations in ML (normal equations, Gram matrix)
Core Content
Two Kinds of Error
$$\text{Total error} = \underbrace{\text{Problem conditioning}}{\text{property of the problem}} \times \underbrace{\text{Algorithm stability}}{\text{property of the method}}$$
⚠️ CRITICAL — The distinction: An ill-conditioned problem will give inaccurate answers NO MATTER WHAT ALGORITHM YOU USE. No amount of clever programming or higher precision can fix a poorly conditioned problem — you need to reformulate it. A stable algorithm applied to a well-conditioned problem gives accurate results; an unstable algorithm gives inaccurate results even on a well-conditioned problem.
Condition Number
The absolute condition number of a function $f$ at $x$:
$$\kappa_{\text{abs}} = \lim_{\delta \to 0} \sup_{|\Delta x| \leq \delta} \frac{|f(x + \Delta x) - f(x)|}{|\Delta x|} \approx |J_f(x)|$$
The relative condition number:
$$\kappa_{\text{rel}} = \frac{|J_f(x)| \cdot |x|}{|f(x)|} = \left|\frac{x f'(x)}{f(x)}\right|$$
A problem with $\kappa \gg 1$ is ill-conditioned — small input errors are amplified into large output errors.
Examples of Condition Numbers
| Operation | Condition Number | When Ill-Conditioned |
|---|---|---|
| $f(x) = \sqrt{x}$ | $\kappa_{\text{rel}} = 1/2$ | Always well-conditioned |
| $f(x) = \ln x$ | $\kappa_{\text{rel}} = 1/ | \ln x |
| $f(x) = x_1 - x_2$ | $\kappa_{\text{rel}} = \frac{ | x_1 |
| Solving $A\mathbf{x} = \mathbf{b}$ | $\kappa(A) = |A||A^{-1}|$ | Nearly singular $A$ |
Matrix Condition Number
For a matrix $A$, the condition number in the $\ell_2$ norm:
$$\kappa_2(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$$
where $\sigma_i$ are singular values. $\kappa_2(A) \geq 1$, with $\kappa_2(A) = 1$ for orthogonal matrices.
Ill-conditioned matrices: $\kappa$ large → small input perturbations can cause large output changes.
Rule of thumb: When solving $A\mathbf{x} = \mathbf{b}$ in float64 (16 digits), you lose approximately $\log_{10}\kappa(A)$ digits of accuracy. If $\kappa(A) = 10^{12}$, you have only ~4 accurate digits.
Normal Equations ARE Ill-Conditioned
The least-squares problem $\min |A\mathbf{x} - \mathbf{b}|_2$ can be solved via normal equations:
$$A^T A \mathbf{x} = A^T \mathbf{b}$$
But $\kappa(A^T A) = [\kappa(A)]^2$! If $\kappa(A) = 10^6$, then $\kappa(A^T A) = 10^{12}$ — you square the condition number.
The fix: Use QR decomposition instead. Solve $R\mathbf{x} = Q^T\mathbf{b}$ where $A = QR$. $\kappa(R) = \kappa(A)$ — no squaring.
Forward vs Backward Stability
An algorithm is backward stable if it gives the exact answer to a slightly perturbed problem:
$$\hat{y} = f(x + \Delta x) \quad \text{with} \quad \frac{|\Delta x|}{|x|} = O(\epsilon_{\text{mach}})$$
In other words: the computed result is the exact result for an input that differs from the true input by at most machine epsilon. This is the gold standard.
- Forward stable: error in output is $O(\epsilon_{\text{mach}} \cdot \kappa)$
- Backward stable: error in output is $O(\epsilon_{\text{mach}} \cdot \kappa$, same bound) but the error comes from input perturbation, not algorithmic amplification
Key Terms
- 15 02 Condition Stability
- Condition Number
- Correct: A)
- Correct: B)
- Correct: C)
- Example 1: Condition of Subtraction
- Example 2: Matrix Condition
- Example 3: Normal Equations Catastrophe
- Examples of Condition Numbers
- Forward vs Backward Stability
- Matrix Condition Number
- Normal Equations ARE Ill-Conditioned
Worked Examples
Example 1: Condition of Subtraction
Compute $\kappa_{\text{rel}}$ for $f(x_1, x_2) = x_1 - x_2$ at $(x_1, x_2) = (1.0001, 1.0000)$.
Solution: $\frac{\partial f}{\partial x_1} = 1$, $\frac{\partial f}{\partial x_2} = -1$. $J = [1, -1]$, $|J|_\infty = 2$.
$\kappa_{\text{abs}} = 2$. $f(1.0001, 1.0000) = 0.0001$.
$\kappa_{\text{rel}} = \frac{2 \cdot 1.0001}{0.0001} \approx 20,000$
$\frac{|\Delta f|}{|f|} \leq \kappa_{\text{rel}} \cdot \frac{|\Delta x|}{|x|} \approx 20000 \cdot 10^{-16} = 2 \times 10^{-12}$
So even with $\kappa = 20,000$, float64's 16 digits leave ~12 accurate digits. This is acceptable. But if $x_1 = 1.0000000000000001$ and $x_2 = 1.0000000000000000$, the difference $10^{-16}$ is entirely buried in rounding error — result is 0 or garbage.
Click for answer
$\kappa_{\text{rel}} \approx 20,000$. For $x_1$ and $x_2$ differing by $\sim 10^{-16}$, the subtraction is completely unreliable (all significant digits lost).Example 2: Matrix Condition
$A = \begin{pmatrix} 1 & 1 \ 1 & 1.0001 \end{pmatrix}$. Compute $\kappa_2(A)$.
Solution: This matrix is nearly singular — rows are almost identical. Eigenvalues of $A^T A$: approximately $\lambda_1 \approx 4$, $\lambda_2 \approx 2.5 \times 10^{-9}$. $\sigma_{\max} \approx 2$, $\sigma_{\min} \approx 5 \times 10^{-5}$. $\kappa_2(A) = \sigma_{\max}/\sigma_{\min} \approx 2 / (5 \times 10^{-5}) = 40,000$.
This means solving $A\mathbf{x} = \mathbf{b}$ will lose about $\log_{10}(40000) \approx 4.6$ decimal digits of accuracy in float64.
Click for answer
$\kappa_2(A) \approx 40,000$. Solving $A\mathbf{x} = \mathbf{b}$ loses ~5 digits of accuracy. The near-linear-dependence of rows causes the ill-conditioning.Example 3: Normal Equations Catastrophe
$A$ is a $100 \times 10$ matrix with $\kappa(A) = 10^4$. What is $\kappa(A^T A)$ and how many digits are lost using normal equations vs QR?
Solution: $\kappa(A^T A) = \kappa(A)^2 = 10^8$. Using normal equations loses $\log_{10}(10^8) = 8$ decimal digits → only ~8 accurate digits remain.
Using QR: $A = QR$, solve $R\mathbf{x} = Q^T\mathbf{b}$. $\kappa(R) = \kappa(A) = 10^4$. Lose only 4 digits → ~12 accurate digits.
For $\kappa(A) = 10^8$: normal equations lose 16 digits → completely useless in float64. QR loses 8 digits → still 8 accurate digits.
Click for answer
Normal equations: lose $\sim 8$ digits. QR: lose $\sim 4$ digits. For $\kappa(A) > 10^8$, normal equations are useless in float64 — QR is essential.Quiz
Q1: What does the concept of Condition Number primarily refer to in this subject?
A) The definition and application of Condition Number B) A historical anecdote about Condition Number C) A visual representation of Condition Number D) A computational error related to Condition Number
Correct: A)
- If you chose A: Condition Number is defined as: the definition and application of condition number. The other options describe different aspects that are not the primary focus. Correct!
- If you chose B: This is incorrect. Condition Number is defined as: the definition and application of condition number. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Condition Number is defined as: the definition and application of condition number. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Condition Number is defined as: the definition and application of condition number. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) A simplified version of \text{Total error} = \underb... B) The inverse operation of the formula in question C) An unrelated formula from a different topic D) \text{Total error} = \underbrace{\text{Problem conditioning}}{\text{property of the problem}} \times \underbrace{\text{Algorithm stability}}{\text{property of the method}}
Correct: D)
- If you chose A: This is incorrect. The formula \text{Total error} = \underbrace{\text{Problem conditioning}}{\text{property of the problem}} \times \underbrace{\text{Algorithm stability}}{\text{property of the method}} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: This is incorrect. The formula \text{Total error} = \underbrace{\text{Problem conditioning}}{\text{property of the problem}} \times \underbrace{\text{Algorithm stability}}{\text{property of the method}} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula \text{Total error} = \underbrace{\text{Problem conditioning}}{\text{property of the problem}} \times \underbrace{\text{Algorithm stability}}{\text{property of the method}} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: The formula \text{Total error} = \underbrace{\text{Problem conditioning}}{\text{property of the problem}} \times \underbrace{\text{Algorithm stability}}{\text{property of the method}} is central to this subject. The other options are either simplified versions or unrelated. Correct!
Q3: What is the primary purpose of Forward vs Backward Stability?
A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It is used to forward vs backward stability in mathematical analysis D) It replaces all other methods in this domain
Correct: C)
- If you chose A: This is incorrect. Forward vs Backward Stability serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Forward vs Backward Stability serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: Forward vs Backward Stability serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose D: This is incorrect. Forward vs Backward Stability serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Matrix Condition Number is TRUE?
A) Matrix Condition Number is an advanced topic beyond this subject's scope B) Matrix Condition Number is a fundamental concept covered in this subject C) Matrix Condition Number is not related to this subject D) Matrix Condition Number is mentioned only as a historical footnote
Correct: B)
- If you chose A: This is incorrect. Matrix Condition Number is a fundamental concept covered in this subject. This subject covers Matrix Condition Number as part of its core content.
- If you chose B: Matrix Condition Number is a fundamental concept covered in this subject. This subject covers Matrix Condition Number as part of its core content. Correct!
- If you chose C: This is incorrect. Matrix Condition Number is a fundamental concept covered in this subject. This subject covers Matrix Condition Number as part of its core content.
- If you chose D: This is incorrect. Matrix Condition Number is a fundamental concept covered in this subject. This subject covers Matrix Condition Number as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) A different result from a common mistake C) An unrelated numerical value D) error in output is $O(\epsilon_{\text{mach}} \cdot
Correct: D)
- If you chose A: This is incorrect. The worked examples show that the result is error in output is $O(\epsilon_{\text{mach}} \cdot. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is error in output is $O(\epsilon_{\text{mach}} \cdot. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is error in output is $O(\epsilon_{\text{mach}} \cdot. The other options represent common errors.
- If you chose D: The worked examples show that the result is error in output is $O(\epsilon_{\text{mach}} \cdot. The other options represent common errors. Correct!
Q6: How are Matrix Condition Number and Normal Equations ARE Ill-Conditioned related?
A) Matrix Condition Number and Normal Equations ARE Ill-Conditioned are completely unrelated topics B) Matrix Condition Number is a special case of Normal Equations ARE Ill-Conditioned C) Matrix Condition Number is the inverse of Normal Equations ARE Ill-Conditioned D) Matrix Condition Number and Normal Equations ARE Ill-Conditioned are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Matrix Condition Number and Normal Equations ARE Ill-Conditioned are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Matrix Condition Number and Normal Equations ARE Ill-Conditioned are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Matrix Condition Number and Normal Equations ARE Ill-Conditioned are covered in this subject as interconnected topics.
- If you chose D: Both Matrix Condition Number and Normal Equations ARE Ill-Conditioned are covered in this subject as interconnected topics. Correct!
Q7: What is a common pitfall when working with Two Kinds Of Error?
A) The main error with Two Kinds Of Error is using it when it is not needed B) Two Kinds Of Error has no common misconceptions C) A common mistake is confusing Two Kinds Of Error with a similar concept D) Two Kinds Of Error is always computed the same way in all contexts
Correct: C)
- If you chose A: This is incorrect. Students often confuse Two Kinds Of Error with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Two Kinds Of Error with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse Two Kinds Of Error with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse Two Kinds Of Error with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Examples Of Condition Numbers?
A) Examples Of Condition Numbers is not practically useful B) Use Examples Of Condition Numbers only in pure mathematics contexts C) Avoid Examples Of Condition Numbers unless explicitly instructed D) Apply Examples Of Condition Numbers to solve problems in this subject's domain
Correct: D)
- If you chose A: This is incorrect. Examples Of Condition Numbers is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Examples Of Condition Numbers is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Examples Of Condition Numbers is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: Examples Of Condition Numbers is a practical tool used throughout this subject to solve relevant problems. Correct!
Practice Problems
-
Compute the relative condition number of $f(x) = e^x$ at $x=100$. Why is this ill-conditioned?
Click for answer
$f'(x) = e^x$. $\kappa_{\text{rel}} = |x f'(x)/f(x)| = |x \cdot e^x / e^x| = |x|$. At $x=100$: $\kappa_{\text{rel}} = 100$. A 1% error in $x$ becomes a 100% error in $f(x)$. But the real issue for $x=100$ isn't conditioning — it's overflow: $e^{100} \approx 2.7 \times 10^{43}$ is representable, but $e^{710}$ overflows. More dramatic: at $x = -100$, $\kappa_{\text{rel}} = 100$ but $e^{-100} \approx 3.7 \times 10^{-44}$ is well within range. The function is well-behaved despite being ill-conditioned. -
Show that adding a small multiple of the identity ($A + \lambda I$) improves conditioning. Compute $\kappa(A + \lambda I)$ vs $\kappa(A)$ for $\lambda > 0$.
Click for answer
If $\sigma_i$ are singular values of $A$, then $\sigma_i + \lambda$ are singular values of $A + \lambda I$ (when $A$ is symmetric PSD; more generally, it's not this simple, but the idea holds for the spectrum shift). $\kappa(A + \lambda I) = (\sigma_{\max} + \lambda)/(\sigma_{\min} + \lambda) \leq \sigma_{\max}/\sigma_{\min} = \kappa(A)$. Adding $\lambda I$ shrinks the condition number. As $\lambda \to \infty$, $\kappa \to 1$. This is why ridge regression ($A^T A + \lambda I$) is better conditioned than OLS ($A^T A$) — the regularization term $\lambda I$ directly improves conditioning. -
A problem has $\kappa = 10^{10}$. You use float32 ($\epsilon_{\text{mach}} \approx 10^{-7}$). How many accurate digits can you expect?
Click for answer
Float32 has ~7 decimal digits. The condition number of $10^{10}$ means you lose $\log_{10}(10^{10}) = 10$ digits. $7 - 10 = -3$: you have ZERO accurate digits and potentially no correct sign. The result is pure noise. This is why float32 is insufficient for ill-conditioned problems — you MUST use float64 or reformulate. -
Is the inner product $\mathbf{x}^T\mathbf{y}$ backward stable?
Click for answer
Yes. The computed inner product satisfies $\operatorname{fl}(\mathbf{x}^T\mathbf{y}) = (\mathbf{x} + \Delta\mathbf{x})^T\mathbf{y}$ where $\|\Delta x_i\| \leq n \epsilon_{\text{mach}} |x_i|$. The result is the exact inner product for a slightly perturbed input. However, for large $n$, the error bound grows with $n$ — for $n=10^6$, you lose about $\log_{10}(10^6) = 6$ extra digits. -
Explain why training a linear regression with correlated features produces unstable coefficient estimates (high variance), linking to condition number.
Click for answer
Correlated features mean columns of the design matrix $X$ are nearly linearly dependent → $X^T X$ is nearly singular → $\kappa(X^T X)$ is large → the solution $\hat{\beta} = (X^T X)^{-1} X^T \mathbf{y}$ is highly sensitive to small changes in $\mathbf{y}$ (different training samples give very different $\hat{\beta}$). This is exactly the problem ridge regression ($\ell_2$ regularization) solves: adding $\lambda I$ reduces $\kappa$, stabilizing the coefficients at the cost of some bias. The condition number directly quantifies the instability.
Summary
Key takeaways:
- Conditioning is a property of the problem — cannot be fixed by a better algorithm
- Stability is a property of the algorithm — an unstable algorithm loses accuracy even on well-conditioned problems
- Relative condition number $\kappa_{\text{rel}} \approx |x f'(x) / f(x)|$ — large $\kappa$ amplifies input errors
- Matrix condition number $\kappa(A) = \sigma_{\max}/\sigma_{\min}$ — nearly singular matrices are ill-conditioned
- Normal equations square the condition number: $\kappa(A^T A) = \kappa(A)^2$ — use QR instead
- A backward stable algorithm gives the exact answer for a slightly perturbed input
- Float64 loses $\approx \log_{10}\kappa$ digits; if $\kappa > 10^{16}$, results are meaningless
Pitfalls
-
Confusing conditioning with stability. Conditioning is a property of the problem — no algorithm can fix an ill-conditioned problem. Stability is a property of the algorithm — a stable algorithm applied to a well-conditioned problem gives accurate results. An unstable algorithm loses accuracy even on perfectly conditioned problems.
-
Using normal equations ($A^T A \mathbf{x} = A^T \mathbf{b}$) for least squares. Forming $A^T A$ squares the condition number: $\kappa(A^T A) = \kappa(A)^2$. If $\kappa(A) = 10^8$, normal equations lose 16 digits — total loss in float64. Always use QR decomposition ($R\mathbf{x} = Q^T\mathbf{b}$) where $\kappa(R) = \kappa(A)$.
-
Assuming higher precision fixes an ill-conditioned problem. Quad precision (float128, ~34 digits) buys you maybe 18 more digits, but an exponentially ill-conditioned problem ($\kappa = 10^{100}$) is meaningless at any practical precision. The only solution is to reformulate the problem.
-
Using the determinant to assess conditioning. A diagonal matrix $\operatorname{diag}(10^{-10}, 1)$ has $\det = 10^{-10}$ (tiny) but $\kappa = 10^{10}$ (huge). Conversely, $\operatorname{diag}(10^{10}, 10^{-10})$ has $\det = 1$ but $\kappa = 10^{20}$. The determinant is unrelated to conditioning — use singular values.
-
Ignoring that $A^T A$ squares the condition number in ML contexts. Training linear regression with correlated features produces a design matrix $X$ with large $\kappa$. The normal equations $(X^T X)^{-1} X^T \mathbf{y}$ become catastrophically unstable — this is why sklearn and other libraries use SVD or QR internally, and why ridge regression ($X^T X + \lambda I$) improves conditioning.
Next Steps
Next up: 15-03-automatic-differentiation.md — forward and reverse mode AD, computational graphs, and how PyTorch/TensorFlow/JAX compute gradients.