Math graphic
📐 Concept diagram

15-02 — Condition and Stability

Phase: Numerical Methods for ML | Subject: 15-02 Prerequisites: 15-01-floating-point-arithmetic.md, 06-03-partial-derivatives.md Next subject: 15-03-automatic-differentiation.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Distinguish between problem conditioning and algorithm stability
  2. Compute condition numbers for common operations (addition, root-finding, matrix inversion)
  3. Explain why ill-conditioned problems produce unreliable results regardless of algorithm
  4. Analyze the stability of forward and backward algorithms
  5. Recognize ill-conditioned formulations in ML (normal equations, Gram matrix)

Core Content

Two Kinds of Error

$$\text{Total error} = \underbrace{\text{Problem conditioning}}{\text{property of the problem}} \times \underbrace{\text{Algorithm stability}}{\text{property of the method}}$$

⚠️ CRITICAL — The distinction: An ill-conditioned problem will give inaccurate answers NO MATTER WHAT ALGORITHM YOU USE. No amount of clever programming or higher precision can fix a poorly conditioned problem — you need to reformulate it. A stable algorithm applied to a well-conditioned problem gives accurate results; an unstable algorithm gives inaccurate results even on a well-conditioned problem.

Condition Number

The absolute condition number of a function $f$ at $x$:

$$\kappa_{\text{abs}} = \lim_{\delta \to 0} \sup_{|\Delta x| \leq \delta} \frac{|f(x + \Delta x) - f(x)|}{|\Delta x|} \approx |J_f(x)|$$

The relative condition number:

$$\kappa_{\text{rel}} = \frac{|J_f(x)| \cdot |x|}{|f(x)|} = \left|\frac{x f'(x)}{f(x)}\right|$$

A problem with $\kappa \gg 1$ is ill-conditioned — small input errors are amplified into large output errors.

Examples of Condition Numbers

Operation Condition Number When Ill-Conditioned
$f(x) = \sqrt{x}$ $\kappa_{\text{rel}} = 1/2$ Always well-conditioned
$f(x) = \ln x$ $\kappa_{\text{rel}} = 1/ \ln x
$f(x) = x_1 - x_2$ $\kappa_{\text{rel}} = \frac{ x_1
Solving $A\mathbf{x} = \mathbf{b}$ $\kappa(A) = |A||A^{-1}|$ Nearly singular $A$

Matrix Condition Number

For a matrix $A$, the condition number in the $\ell_2$ norm:

$$\kappa_2(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$$

where $\sigma_i$ are singular values. $\kappa_2(A) \geq 1$, with $\kappa_2(A) = 1$ for orthogonal matrices.

Ill-conditioned matrices: $\kappa$ large → small input perturbations can cause large output changes.

Rule of thumb: When solving $A\mathbf{x} = \mathbf{b}$ in float64 (16 digits), you lose approximately $\log_{10}\kappa(A)$ digits of accuracy. If $\kappa(A) = 10^{12}$, you have only ~4 accurate digits.

Normal Equations ARE Ill-Conditioned

The least-squares problem $\min |A\mathbf{x} - \mathbf{b}|_2$ can be solved via normal equations:

$$A^T A \mathbf{x} = A^T \mathbf{b}$$

But $\kappa(A^T A) = [\kappa(A)]^2$! If $\kappa(A) = 10^6$, then $\kappa(A^T A) = 10^{12}$ — you square the condition number.

The fix: Use QR decomposition instead. Solve $R\mathbf{x} = Q^T\mathbf{b}$ where $A = QR$. $\kappa(R) = \kappa(A)$ — no squaring.

Forward vs Backward Stability

An algorithm is backward stable if it gives the exact answer to a slightly perturbed problem:

$$\hat{y} = f(x + \Delta x) \quad \text{with} \quad \frac{|\Delta x|}{|x|} = O(\epsilon_{\text{mach}})$$

In other words: the computed result is the exact result for an input that differs from the true input by at most machine epsilon. This is the gold standard.



Key Terms

Worked Examples

Example 1: Condition of Subtraction

Compute $\kappa_{\text{rel}}$ for $f(x_1, x_2) = x_1 - x_2$ at $(x_1, x_2) = (1.0001, 1.0000)$.

Solution: $\frac{\partial f}{\partial x_1} = 1$, $\frac{\partial f}{\partial x_2} = -1$. $J = [1, -1]$, $|J|_\infty = 2$.

$\kappa_{\text{abs}} = 2$. $f(1.0001, 1.0000) = 0.0001$.

$\kappa_{\text{rel}} = \frac{2 \cdot 1.0001}{0.0001} \approx 20,000$

$\frac{|\Delta f|}{|f|} \leq \kappa_{\text{rel}} \cdot \frac{|\Delta x|}{|x|} \approx 20000 \cdot 10^{-16} = 2 \times 10^{-12}$

So even with $\kappa = 20,000$, float64's 16 digits leave ~12 accurate digits. This is acceptable. But if $x_1 = 1.0000000000000001$ and $x_2 = 1.0000000000000000$, the difference $10^{-16}$ is entirely buried in rounding error — result is 0 or garbage.

Click for answer $\kappa_{\text{rel}} \approx 20,000$. For $x_1$ and $x_2$ differing by $\sim 10^{-16}$, the subtraction is completely unreliable (all significant digits lost).

Example 2: Matrix Condition

$A = \begin{pmatrix} 1 & 1 \ 1 & 1.0001 \end{pmatrix}$. Compute $\kappa_2(A)$.

Solution: This matrix is nearly singular — rows are almost identical. Eigenvalues of $A^T A$: approximately $\lambda_1 \approx 4$, $\lambda_2 \approx 2.5 \times 10^{-9}$. $\sigma_{\max} \approx 2$, $\sigma_{\min} \approx 5 \times 10^{-5}$. $\kappa_2(A) = \sigma_{\max}/\sigma_{\min} \approx 2 / (5 \times 10^{-5}) = 40,000$.

This means solving $A\mathbf{x} = \mathbf{b}$ will lose about $\log_{10}(40000) \approx 4.6$ decimal digits of accuracy in float64.

Click for answer $\kappa_2(A) \approx 40,000$. Solving $A\mathbf{x} = \mathbf{b}$ loses ~5 digits of accuracy. The near-linear-dependence of rows causes the ill-conditioning.

Example 3: Normal Equations Catastrophe

$A$ is a $100 \times 10$ matrix with $\kappa(A) = 10^4$. What is $\kappa(A^T A)$ and how many digits are lost using normal equations vs QR?

Solution: $\kappa(A^T A) = \kappa(A)^2 = 10^8$. Using normal equations loses $\log_{10}(10^8) = 8$ decimal digits → only ~8 accurate digits remain.

Using QR: $A = QR$, solve $R\mathbf{x} = Q^T\mathbf{b}$. $\kappa(R) = \kappa(A) = 10^4$. Lose only 4 digits → ~12 accurate digits.

For $\kappa(A) = 10^8$: normal equations lose 16 digits → completely useless in float64. QR loses 8 digits → still 8 accurate digits.

Click for answer Normal equations: lose $\sim 8$ digits. QR: lose $\sim 4$ digits. For $\kappa(A) > 10^8$, normal equations are useless in float64 — QR is essential.


Quiz

Q1: What does the concept of Condition Number primarily refer to in this subject?

A) The definition and application of Condition Number B) A historical anecdote about Condition Number C) A visual representation of Condition Number D) A computational error related to Condition Number

Correct: A)

Q2: Which of the following is the key formula discussed in this subject?

A) A simplified version of \text{Total error} = \underb... B) The inverse operation of the formula in question C) An unrelated formula from a different topic D) \text{Total error} = \underbrace{\text{Problem conditioning}}{\text{property of the problem}} \times \underbrace{\text{Algorithm stability}}{\text{property of the method}}

Correct: D)

Q3: What is the primary purpose of Forward vs Backward Stability?

A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It is used to forward vs backward stability in mathematical analysis D) It replaces all other methods in this domain

Correct: C)

Q4: Which statement about Matrix Condition Number is TRUE?

A) Matrix Condition Number is an advanced topic beyond this subject's scope B) Matrix Condition Number is a fundamental concept covered in this subject C) Matrix Condition Number is not related to this subject D) Matrix Condition Number is mentioned only as a historical footnote

Correct: B)

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) A different result from a common mistake C) An unrelated numerical value D) error in output is $O(\epsilon_{\text{mach}} \cdot

Correct: D)

Q6: How are Matrix Condition Number and Normal Equations ARE Ill-Conditioned related?

A) Matrix Condition Number and Normal Equations ARE Ill-Conditioned are completely unrelated topics B) Matrix Condition Number is a special case of Normal Equations ARE Ill-Conditioned C) Matrix Condition Number is the inverse of Normal Equations ARE Ill-Conditioned D) Matrix Condition Number and Normal Equations ARE Ill-Conditioned are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with Two Kinds Of Error?

A) The main error with Two Kinds Of Error is using it when it is not needed B) Two Kinds Of Error has no common misconceptions C) A common mistake is confusing Two Kinds Of Error with a similar concept D) Two Kinds Of Error is always computed the same way in all contexts

Correct: C)

Q8: When should you apply Examples Of Condition Numbers?

A) Examples Of Condition Numbers is not practically useful B) Use Examples Of Condition Numbers only in pure mathematics contexts C) Avoid Examples Of Condition Numbers unless explicitly instructed D) Apply Examples Of Condition Numbers to solve problems in this subject's domain

Correct: D)

Practice Problems

  1. Compute the relative condition number of $f(x) = e^x$ at $x=100$. Why is this ill-conditioned?

    Click for answer $f'(x) = e^x$. $\kappa_{\text{rel}} = |x f'(x)/f(x)| = |x \cdot e^x / e^x| = |x|$. At $x=100$: $\kappa_{\text{rel}} = 100$. A 1% error in $x$ becomes a 100% error in $f(x)$. But the real issue for $x=100$ isn't conditioning — it's overflow: $e^{100} \approx 2.7 \times 10^{43}$ is representable, but $e^{710}$ overflows. More dramatic: at $x = -100$, $\kappa_{\text{rel}} = 100$ but $e^{-100} \approx 3.7 \times 10^{-44}$ is well within range. The function is well-behaved despite being ill-conditioned.

  2. Show that adding a small multiple of the identity ($A + \lambda I$) improves conditioning. Compute $\kappa(A + \lambda I)$ vs $\kappa(A)$ for $\lambda > 0$.

    Click for answer If $\sigma_i$ are singular values of $A$, then $\sigma_i + \lambda$ are singular values of $A + \lambda I$ (when $A$ is symmetric PSD; more generally, it's not this simple, but the idea holds for the spectrum shift). $\kappa(A + \lambda I) = (\sigma_{\max} + \lambda)/(\sigma_{\min} + \lambda) \leq \sigma_{\max}/\sigma_{\min} = \kappa(A)$. Adding $\lambda I$ shrinks the condition number. As $\lambda \to \infty$, $\kappa \to 1$. This is why ridge regression ($A^T A + \lambda I$) is better conditioned than OLS ($A^T A$) — the regularization term $\lambda I$ directly improves conditioning.

  3. A problem has $\kappa = 10^{10}$. You use float32 ($\epsilon_{\text{mach}} \approx 10^{-7}$). How many accurate digits can you expect?

    Click for answer Float32 has ~7 decimal digits. The condition number of $10^{10}$ means you lose $\log_{10}(10^{10}) = 10$ digits. $7 - 10 = -3$: you have ZERO accurate digits and potentially no correct sign. The result is pure noise. This is why float32 is insufficient for ill-conditioned problems — you MUST use float64 or reformulate.

  4. Is the inner product $\mathbf{x}^T\mathbf{y}$ backward stable?

    Click for answer Yes. The computed inner product satisfies $\operatorname{fl}(\mathbf{x}^T\mathbf{y}) = (\mathbf{x} + \Delta\mathbf{x})^T\mathbf{y}$ where $\|\Delta x_i\| \leq n \epsilon_{\text{mach}} |x_i|$. The result is the exact inner product for a slightly perturbed input. However, for large $n$, the error bound grows with $n$ — for $n=10^6$, you lose about $\log_{10}(10^6) = 6$ extra digits.

  5. Explain why training a linear regression with correlated features produces unstable coefficient estimates (high variance), linking to condition number.

    Click for answer Correlated features mean columns of the design matrix $X$ are nearly linearly dependent → $X^T X$ is nearly singular → $\kappa(X^T X)$ is large → the solution $\hat{\beta} = (X^T X)^{-1} X^T \mathbf{y}$ is highly sensitive to small changes in $\mathbf{y}$ (different training samples give very different $\hat{\beta}$). This is exactly the problem ridge regression ($\ell_2$ regularization) solves: adding $\lambda I$ reduces $\kappa$, stabilizing the coefficients at the cost of some bias. The condition number directly quantifies the instability.


Summary

Key takeaways:


Pitfalls

  1. Confusing conditioning with stability. Conditioning is a property of the problem — no algorithm can fix an ill-conditioned problem. Stability is a property of the algorithm — a stable algorithm applied to a well-conditioned problem gives accurate results. An unstable algorithm loses accuracy even on perfectly conditioned problems.

  2. Using normal equations ($A^T A \mathbf{x} = A^T \mathbf{b}$) for least squares. Forming $A^T A$ squares the condition number: $\kappa(A^T A) = \kappa(A)^2$. If $\kappa(A) = 10^8$, normal equations lose 16 digits — total loss in float64. Always use QR decomposition ($R\mathbf{x} = Q^T\mathbf{b}$) where $\kappa(R) = \kappa(A)$.

  3. Assuming higher precision fixes an ill-conditioned problem. Quad precision (float128, ~34 digits) buys you maybe 18 more digits, but an exponentially ill-conditioned problem ($\kappa = 10^{100}$) is meaningless at any practical precision. The only solution is to reformulate the problem.

  4. Using the determinant to assess conditioning. A diagonal matrix $\operatorname{diag}(10^{-10}, 1)$ has $\det = 10^{-10}$ (tiny) but $\kappa = 10^{10}$ (huge). Conversely, $\operatorname{diag}(10^{10}, 10^{-10})$ has $\det = 1$ but $\kappa = 10^{20}$. The determinant is unrelated to conditioning — use singular values.

  5. Ignoring that $A^T A$ squares the condition number in ML contexts. Training linear regression with correlated features produces a design matrix $X$ with large $\kappa$. The normal equations $(X^T X)^{-1} X^T \mathbf{y}$ become catastrophically unstable — this is why sklearn and other libraries use SVD or QR internally, and why ridge regression ($X^T X + \lambda I$) improves conditioning.



Next Steps

Next up: 15-03-automatic-differentiation.md — forward and reverse mode AD, computational graphs, and how PyTorch/TensorFlow/JAX compute gradients.