📐 Concept diagram

25-08 — Adversarial Robustness

Phase: 25 — Frontiers & Active Research Areas Subject: 25-08 Prerequisites: 16 (Neural Networks), 14 (Optimization), 17 (Deep Learning) Next subject: 25-09 — Continual Learning

Learning Objectives

By the end of this subject, you will be able to:

Define adversarial examples and explain why neural networks are vulnerable to them
Implement FGSM and PGD attacks and compute their perturbation budgets
Formulate adversarial training as a minimax optimisation problem
Explain the concept of certified robustness and the trade-offs with standard accuracy
Analyse the relationship between adversarial robustness and interpretable features

Core Content

1. Adversarial Examples

An adversarial example is an input $\mathbf{x}'$ that is perceptually indistinguishable from a legitimate input $\mathbf{x}$ but causes a model to make a different (incorrect) prediction:

$$|\mathbf{x}' - \mathbf{x}|_p \leq \epsilon \quad \text{but} \quad f(\mathbf{x}') \neq y$$

where $|\cdot|p$ is typically the $\ell\infty$ norm (maximum per-pixel change) or $\ell_2$ norm.

Key discoveries (Szegedy et al., 2014; Goodfellow et al., 2015): - Neural networks are surprisingly vulnerable — imperceptible perturbations can flip predictions - Adversarial examples transfer between models — an attack crafted for one model often fools others - The vulnerability stems from the locally-linear behaviour of networks in high-dimensional input space

⚠️ CRITICAL: Adversarial vulnerability is not a bug — it's a fundamental consequence of high-dimensional geometry. In a $d$-dimensional input space (e.g., $d = 3072$ for CIFAR-10), a small $\ell_\infty$ perturbation of $\epsilon = 8/255$ accumulated across all dimensions can produce a large change in activation: $\mathbf{w}^\top(\mathbf{x} + \boldsymbol{\eta}) = \mathbf{w}^\top\mathbf{x} + \mathbf{w}^\top\boldsymbol{\eta}$. The worst-case $\boldsymbol{\eta}$ (with $|\boldsymbol{\eta}|_\infty \leq \epsilon$) is $\boldsymbol{\eta} = \epsilon \cdot \text{sign}(\mathbf{w})$, giving activation change $\epsilon |\mathbf{w}|_1$, which grows with input dimension.

2. Attack Methods

Fast Gradient Sign Method (FGSM) — Goodfellow et al., 2015:

$$\mathbf{x}' = \mathbf{x} + \epsilon \cdot \text{sign}(\nabla_\mathbf{x} L(\boldsymbol{\theta}, \mathbf{x}, y))$$

A single-step attack that perturbs each pixel by $\pm\epsilon$ in the direction that maximally increases the loss. Computationally cheap but weak against defended models.

Projected Gradient Descent (PGD) — Madry et al., 2018:

The strongest first-order attack. Iterative FGSM with projection back to the $\epsilon$-ball:

$$\mathbf{x}^{t+1} = \Pi_{\mathbf{x}+\mathcal{S}}(\mathbf{x}^t + \alpha \cdot \text{sign}(\nabla_{\mathbf{x}^t} L(\boldsymbol{\theta}, \mathbf{x}^t, y)))$$

where $\mathcal{S} = {\boldsymbol{\eta} : |\boldsymbol{\eta}|_\infty \leq \epsilon}$ and $\Pi$ projects onto the $\epsilon$-ball around the original $\mathbf{x}$:

$$\Pi_{\mathbf{x}+\mathcal{S}}(\mathbf{x}') = \text{clip}(\mathbf{x}', \mathbf{x} - \epsilon, \mathbf{x} + \epsilon)$$

Typically run for 7-40 iterations with step size $\alpha = \epsilon/4$ or $\epsilon/10$, often with random initialisation within the $\epsilon$-ball.

Carlini & Wagner (C&W) attack: Formulated as an optimisation problem:

$$\min_{\boldsymbol{\eta}} |\boldsymbol{\eta}|_p + c \cdot g(\mathbf{x} + \boldsymbol{\eta})$$

where $g(\cdot)$ is a margin-based loss (encouraging misclassification). Solved via gradient-based optimisation with a change-of-variables to handle box constraints.

3. Adversarial Training

The most effective empirical defence: train the model on adversarial examples.

Formulation (Madry et al., 2018):

$$\min_{\boldsymbol{\theta}} \mathbb{E}{(\mathbf{x}, y) \sim \mathcal{D}} \left[\max{|\boldsymbol{\eta}|_\infty \leq \epsilon} L(\boldsymbol{\theta}, \mathbf{x} + \boldsymbol{\eta}, y)\right]$$

This is a minimax (saddle-point) problem: - Inner maximisation: find the strongest adversarial perturbation within the $\epsilon$-ball (via PGD) - Outer minimisation: train model parameters to be robust against these perturbations

The PGD-adversarial training algorithm:

For each mini-batch: 1. For each example, run $k$ steps of PGD to find $\mathbf{x}^{\text{adv}}$ 2. Compute the loss on $\mathbf{x}^{\text{adv}}$ (NOT on clean $\mathbf{x}$) 3. Update $\boldsymbol{\theta}$ via SGD

Key empirical findings: - PGD adversarial training provides robustness against all first-order attacks - Requires larger model capacity (wider networks needed) - Training is 5-10× slower (inner loop of PGD steps) - Clean accuracy drops (typically 5-15% on CIFAR-10) — robustness-accuracy trade-off - Robust models learn interpretable, human-aligned features (Engstrom et al., 2019)

4. Certified Robustness

Empirical defences (like adversarial training) can be broken by stronger attacks. Certified robustness provides a mathematical guarantee: for a given input $\mathbf{x}$, the model's prediction is provably constant within an $\epsilon$-ball.

Interval Bound Propagation (IBP): Propagate bounds through the network layer by layer. For a linear layer $\mathbf{z} = W\mathbf{x} + \mathbf{b}$ with input bounds $\mathbf{l} \leq \mathbf{x} \leq \mathbf{u}$:

Lower bound: $\mathbf{z}{\text{lower}} = W^+ \mathbf{l} + W^- \mathbf{u} + \mathbf{b}$ Upper bound: $\mathbf{z}{\text{upper}} = W^+ \mathbf{u} + W^- \mathbf{l} + \mathbf{b}$

where $W^+ = \max(W, 0)$ and $W^- = \min(W, 0)$. For ReLU: bound the output by $[\max(0, z_{\text{lower}}), \max(0, z_{\text{upper}})]$.

The model is certified robust at $\mathbf{x}$ if the lower bound of the correct logit exceeds the upper bound of all other logits throughout the $\epsilon$-ball.

Trade-off: Certified methods provide guarantees but are looser (pessimistic bounds) and harder to train. Empirical robustness (adversarial training) provides practical security but without formal guarantees.

5. Why Adversarial Examples Exist

Linear explanation (Goodfellow et al.): In high dimensions, models are effectively linear in local neighbourhoods. A perturbation aligned with the sign of the gradient accumulates across dimensions, causing large changes.

Non-robust features (Ilyas et al., 2019): Datasets contain both robust features (human-aligned, generalise well) and non-robust features (predictive but fragile to small perturbations). Standard training exploits BOTH types; adversarial training forces the model to rely only on robust features. Non-robust features are real patterns in the data — just extremely brittle ones.

Key Terms

Adversarial examples
Adversarial training
Certified robustness
FGSM

Worked Examples

Example 1: FGSM Perturbation

Problem: For a binary classifier with logit $z = w_1 x_1 + w_2 x_2 + b$ where $\mathbf{w} = [2.0, -1.5], b = 0.1$, input $\mathbf{x} = [0.3, 0.7]$, true label $y=1$, loss $L = -y\log\sigma(z) - (1-y)\log(1-\sigma(z))$, compute the FGSM adversarial example with $\epsilon = 0.1$.

Solution:

$z = 2(0.3) + (-1.5)(0.7) + 0.1 = 0.6 - 1.05 + 0.1 = -0.35$

$\sigma(z) = 1/(1+e^{0.35}) = 1/(1+1.419) = 0.413$

$\nabla_\mathbf{x} L$: For logistic regression, $\partial L/\partial \mathbf{x} = (\sigma(z) - y)\mathbf{w}$

$\partial L/\partial \mathbf{x} = (0.413 - 1)[2.0, -1.5] = (-0.587)[2.0, -1.5] = [-1.174, 0.881]$

$\mathbf{x}' = \mathbf{x} + 0.1 \cdot \text{sign}([-1.174, 0.881]) = [0.3, 0.7] + 0.1[-1, 1] = [0.2, 0.8]$

Check: $z' = 2(0.2) + (-1.5)(0.8) + 0.1 = 0.4 - 1.2 + 0.1 = -0.7$. $\sigma(z') = 0.332$, prediction shifts more strongly toward class 0 — successful attack ($\ell_\infty$ distance = 0.1).

Example 2: PGD Iteration

Problem: Starting from $\mathbf{x}^0 = \mathbf{x}$ with $\epsilon = 0.3, \alpha = 0.1$, run 3 iterations of PGD ($\ell_\infty$) on the same example, tracking each step.

Solution:

Step 1: $\mathbf{x}^1 = \Pi_{0.3}(\mathbf{x} + 0.1 \cdot \text{sign}(\nabla L))$ = same computation as FGSM but with $\alpha=0.1$: $\mathbf{x}^1 = [0.3 - 0.117, 0.7 + 0.088] = [0.183, 0.788]$

Step 2: Recompute gradient at $\mathbf{x}^1$: $z = 2(0.183) + (-1.5)(0.788) + 0.1 = 0.366 - 1.182 + 0.1 = -0.716$ $\sigma(z) = 0.328$, $\partial L/\partial \mathbf{x} = (0.328 - 1)[2.0, -1.5] = [-1.344, 1.008]$ $\mathbf{x}^2 = \Pi_{0.3}([0.183 - 0.134, 0.788 + 0.101]) = \Pi_{0.3}([0.049, 0.888]) = [0.049, 0.888]$ (within bounds)

Step 3: $z = 2(0.049) + (-1.5)(0.888) + 0.1 = -1.134$, $\sigma(z) = 0.243$, gradient $=[-1.514, 1.136]$ $\mathbf{x}^3 = \Pi_{0.3}([0.049 - 0.151, 0.888 + 0.114]) = \Pi_{0.3}([-0.102, 1.002]) = [0.0, 1.0]$

PGD finds a stronger attack (prediction confidence drops from 0.413 to 0.243) than single-step FGSM (0.332).

Example 3: Robustness-Accuracy Trade-off

Problem: Standard ResNet-18 achieves 95% accuracy on CIFAR-10. After adversarial training with $\epsilon = 8/255$, it achieves 87% clean accuracy and 47% robust accuracy (against PGD-20). Explain why clean accuracy drops and whether the trade-off is fundamental.

Solution:

Standard training exploits non-robust features — patterns in the data that are predictive but fragile. These features contribute ~8% to clean accuracy. Adversarial training discards these features, forcing the model to rely only on robust features. The 8% drop represents the contribution of non-robust features to standard accuracy.

Whether this trade-off is fundamental is an open question. Tsipras et al. (2019) formalised it as an information-theoretic trade-off between standard and robust error. However, larger models, more data, and better training methods can partially close the gap — the trade-off may be an artifact of limited capacity rather than fundamental.

Practice Problems

Problem 1: For a linear classifier $f(\mathbf{x}) = \text{sign}(\mathbf{w}^\top\mathbf{x} + b)$, show that the minimal $\ell_2$ perturbation to flip the prediction is $\boldsymbol{\eta}^* = -y(\mathbf{w}^\top\mathbf{x} + b)\mathbf{w}/|\mathbf{w}|_2^2$.

Problem 2: PGD with random initialisation (R-PGD) starts from $\mathbf{x}^0 = \mathbf{x} + \mathbf{u}$ where $\mathbf{u} \sim \text{Uniform}(-\epsilon, \epsilon)^d$. Why does this produce stronger attacks than starting from $\mathbf{x}$?

Problem 3: Explain the gradient masking/obfuscation problem — why a defence that produces near-zero gradients can be trivially broken by black-box attacks.

Problem 4: For certified robustness via IBP on a 2-layer ReLU network $f(\mathbf{x}) = W_2 \text{ReLU}(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$ with $\mathbf{x} \in [\mathbf{l}, \mathbf{u}]$, derive the bounds on the output logits.

Problem 5: Adversarially trained models often learn more interpretable features. Propose an experiment to test whether a model uses "robust" vs "non-robust" features.

Answers (click to expand)

**Problem 1:** The decision boundary is $\mathbf{w}^\top\mathbf{x} + b = 0$. Signed distance: $d = y(\mathbf{w}^\top\mathbf{x} + b)/\|\mathbf{w}\|_2$. Perturbation normal to boundary: $\boldsymbol{\eta} = -\alpha y \mathbf{w}/\|\mathbf{w}\|_2$. For $\mathbf{x}' = \mathbf{x} + \boldsymbol{\eta}$ to cross boundary: $\mathbf{w}^\top(\mathbf{x} - \alpha y\mathbf{w}/\|\mathbf{w}\|_2) + b = 0$. Solve: $\alpha = y(\mathbf{w}^\top\mathbf{x} + b)/\|\mathbf{w}\|_2$. $\|\boldsymbol{\eta}\|_2 = \alpha$. So $\boldsymbol{\eta}^* = -y(\mathbf{w}^\top\mathbf{x} + b)\mathbf{w}/\|\mathbf{w}\|_2^2$. ✓ **Problem 2:** Random initialisation prevents the attack from getting stuck in flat regions of the loss landscape near $\mathbf{x}$. The PGD trajectory near a clean input might have zero gradient (correctly classified, confident), but random perturbation moves to a region with non-zero gradient. Also tests robustness across the entire $\epsilon$-ball, not just along one trajectory. **Problem 3:** Gradient masking makes gradient-based attacks fail by producing small or misleading gradients. Black-box attacks use finite differences, random search, or transfer from a surrogate model — they don't need gradients and bypass gradient obfuscation. A defence that only works against white-box attacks is not truly robust. **Problem 4:** For linear layer: $\mathbf{z}_1 = W_1\mathbf{x} + \mathbf{b}_1$. Bounds: $\mathbf{z}_{1,\text{low}} = W_1^+\mathbf{l} + W_1^-\mathbf{u} + \mathbf{b}_1$, $\mathbf{z}_{1,\text{up}} = W_1^+\mathbf{u} + W_1^-\mathbf{l} + \mathbf{b}_1$. For ReLU: $\mathbf{a}_{1,\text{low}} = \max(0, \mathbf{z}_{1,\text{low}})$, $\mathbf{a}_{1,\text{up}} = \max(0, \mathbf{z}_{1,\text{up}})$. For second linear layer: same pattern with $\mathbf{a}_1$ bounds. The output logit bounds are $[f_{\text{low}}, f_{\text{up}}]$. Certified if correct class lower bound exceeds all other classes' upper bounds. **Problem 5:** Train two identical architectures: one with standard training, one with adversarial training. Visualise maximally-activating inputs for individual neurons. Adversarially trained neurons should show coherent, human-interpretable patterns (edges, textures, object parts). Standard neurons may show noisy, incomprehensible patterns. Also test transfer: adversarial examples from the standard model should NOT transfer well to the robust model if the robust model discards non-robust features.

Summary

Adversarial examples exploit the locally-linear nature of neural networks in high dimensions — small $\ell_p$-bounded perturbations cause large output changes.
FGSM provides fast single-step attacks; PGD is the strongest first-order iterative attack and the standard for evaluating empirical robustness.
Adversarial training (minimax formulation) is the most effective empirical defence but introduces a robustness-accuracy trade-off and higher computational cost.
Certified robustness (IBP, convex relaxations) provides mathematical guarantees but is currently looser than empirical methods.
The existence of adversarial examples is linked to non-robust features in data — patterns that are predictive but extremely fragile.

Pitfalls

Gradient masking ≠ robustness: If your defense produces zero gradients but fails against black-box attacks, it's not robust.
Weak attack evaluation: Many papers claim robustness against weak attacks (FGSM) but fail against PGD with sufficient iterations.
Ignoring the accuracy trade-off: A model with 30% clean accuracy and 29% robust accuracy is trivial — it hasn't learned anything useful. Always report both.

Quiz

Question 1: The Fast Gradient Sign Method (FGSM) perturbs an input by:

A. Adding random Gaussian noise scaled by $\epsilon$ B. Taking a single step in the direction of $\text{sign}(\nabla_{\mathbf{x}} L(\boldsymbol{\theta}, \mathbf{x}, y))$, scaled by $\epsilon$ C. Minimizing the loss function with respect to the input D. Changing all pixels by a fixed offset of $+\epsilon$

Correct Answer: B

Explanation

- **If you chose A:** FGSM uses the sign of the gradient, not random noise — the perturbation is adversarial, not random. - **If you chose B:** Correct. FGSM is $\mathbf{x}' = \mathbf{x} + \epsilon \cdot \text{sign}(\nabla_{\mathbf{x}} L)$ — a single-step attack in the direction that maximally increases the loss. - **If you chose C:** That would be gradient *descent* on the input, which would minimize the loss (opposite of an attack). - **If you chose D:** FGSM perturbs each pixel by $\pm\epsilon$ individually based on the gradient sign, not a uniform offset.

Question 2: Why is Projected Gradient Descent (PGD) considered a stronger attack than single-step FGSM?

A. PGD uses a larger perturbation budget $\epsilon$ B. PGD takes multiple gradient steps with projection back to the $\epsilon$-ball, finding better constrained local maxima of the loss C. PGD uses a fundamentally different loss function D. PGD doesn't require gradient information

Correct Answer: B

Explanation

- **If you chose A:** Both FGSM and PGD operate within the same $\epsilon$-ball. PGD is stronger within the same budget. - **If you chose B:** Correct. PGD iteratively climbs the loss landscape, each step followed by projection $\Pi_{\mathbf{x}+\mathcal{S}}$ to stay within the $\epsilon$-ball. This finds much higher-loss perturbations than a single step. - **If you chose C:** Both typically use the same cross-entropy loss; the difference is iterative vs. single-step optimization. - **If you chose D:** PGD explicitly requires gradient information — it's a *first-order* attack.

Question 3: Adversarial training (Madry et al., 2018) is formulated as what type of optimization problem?

A. $\min_{\boldsymbol{\theta}} L(f_{\boldsymbol{\theta}}(\mathbf{x}), y)$ — standard empirical risk minimization B. $\min_{\boldsymbol{\theta}} \max_{|\boldsymbol{\eta}|\infty \leq \epsilon} L(f{\boldsymbol{\theta}}(\mathbf{x} + \boldsymbol{\eta}), y)$ — a minimax (saddle-point) problem C. $\max_{\boldsymbol{\theta}} \min_{|\boldsymbol{\eta}|\infty \leq \epsilon} L(f{\boldsymbol{\theta}}(\mathbf{x} + \boldsymbol{\eta}), y)$ D. $\min_{\boldsymbol{\theta}} \mathbb{E}[L(f_{\boldsymbol{\theta}}(\mathbf{x}), y)] + \lambda|\boldsymbol{\theta}|_2^2$ — standard L2 regularization

Correct Answer: B

Explanation

- **If you chose A:** Standard ERM trains only on clean examples and offers no adversarial robustness. - **If you chose B:** Correct. The inner maximization finds the strongest adversarial perturbation; the outer minimization trains the model to be robust against it. This is the canonical minimax formulation. - **If you chose C:** This swaps min and max — the attacker would be minimizing loss (helping the model), which makes no sense. - **If you chose D:** L2 regularization (weight decay) is unrelated to adversarial robustness; it doesn't protect against adversarial perturbations.

Question 4: What is the robustness-accuracy trade-off in adversarial training?

A. Robust models always achieve higher clean accuracy than standard models B. Adversarial training typically reduces clean accuracy by 5–15% while substantially improving robust accuracy C. Accuracy and robustness are perfectly positively correlated D. The trade-off only exists for small models and disappears at scale

Correct Answer: B

Explanation

- **If you chose A:** The opposite — adversarial training reduces clean accuracy. The model discards non-robust features that helped on clean data. - **If you chose B:** Correct. Standard models exploit non-robust (brittle) features that contribute to clean accuracy. Adversarial training forces reliance on robust features only, trading some clean accuracy for adversarial robustness. - **If you chose C:** They are typically inversely related — improving one often degrades the other. - **If you chose D:** The trade-off persists even in large models, though more capacity and data can reduce the gap.

Question 5: Interval Bound Propagation (IBP) provides what kind of robustness guarantee?

A. Empirical defense — it makes PGD attacks fail B. Certified (provable) bounds — it mathematically guarantees that predictions won't change within an $\epsilon$-ball C. Faster adversarial training through bound approximation D. Better gradient-based attacks

Correct Answer: B

Explanation

- **If you chose A:** IBP is a certification method, not an empirical defense. It provides mathematical guarantees, not just empirical resistance. - **If you chose B:** Correct. IBP propagates interval bounds $[\mathbf{l}, \mathbf{u}]$ through each network layer. If the lower bound of the correct logit exceeds all other logits' upper bounds, the prediction is *provably* constant within the $\epsilon$-ball. - **If you chose C:** IBP is for certification, not for speeding up training. Training with IBP is actually slower and harder. - **If you chose D:** IBP provides robustness certificates; it doesn't generate attacks.

Question 6: According to Ilyas et al. (2019), why do adversarial examples exist?

A. Neural networks have fundamental architectural flaws that cannot be fixed B. Standard training exploits non-robust features — genuine but extremely brittle patterns in the data that are predictive yet fragile to small perturbations C. GPUs introduce subtle numerical errors that break predictions D. The loss landscape of neural networks is perfectly flat

Correct Answer: B

Explanation

- **If you chose A:** Adversarial vulnerability is not a bug — it's a consequence of how models use available features. Adversarial training can substantially fix it. - **If you chose B:** Correct. Datasets contain both robust features (human-aligned, stable under small perturbations) and non-robust features (predictive but fragile). Standard training uses both; adversarial training forces reliance on robust features only. - **If you chose C:** Numerical precision is not the cause — adversarial examples persist even with high-precision computation. - **If you chose D:** The loss landscape is highly non-convex with sharp directions that enable adversarial perturbations.

Next Steps

From attacks on models to models that forget: 25-09 — Continual Learning.

Progress

Phases

25-08 — Adversarial Robustness

Learning Objectives

Core Content

1. Adversarial Examples

2. Attack Methods

3. Adversarial Training

4. Certified Robustness

5. Why Adversarial Examples Exist

Key Terms

Worked Examples

Example 1: FGSM Perturbation

Example 2: PGD Iteration

Example 3: Robustness-Accuracy Trade-off

Practice Problems

Summary

Pitfalls

Quiz

Next Steps