25-07 — Differential Privacy
Phase: 25 — Frontiers & Active Research Areas Subject: 25-07 Prerequisites: 12 (Statistics), 20 (Training & Fine-tuning), 25-06 (Federated Learning) Next subject: 25-08 — Adversarial Robustness
Learning Objectives
By the end of this subject, you will be able to:
- Define $\varepsilon$-differential privacy and explain its semantic meaning
- Apply the Gaussian mechanism to queries with bounded sensitivity
- Derive the DP-SGD algorithm and analyse its privacy guarantees via the moments accountant
- Explain the privacy-utility trade-off and compose privacy budgets across multiple queries
- Distinguish between local and central differential privacy models
Core Content
1. What Is Differential Privacy?
Differential privacy (DP) is a mathematical framework for quantifying the privacy leakage of an algorithm. Informally: an algorithm is differentially private if its output distribution is nearly identical whether or not any single individual's data is included in the input.
Formal definition ($\varepsilon$-DP): A randomised mechanism $\mathcal{M}: \mathcal{D} \to \mathcal{R}$ satisfies $\varepsilon$-differential privacy if for all neighbouring datasets $D, D'$ (differing by exactly one record) and all measurable subsets $S \subseteq \mathcal{R}$:
$$P(\mathcal{M}(D) \in S) \leq e^\varepsilon \cdot P(\mathcal{M}(D') \in S)$$
⚠️ CRITICAL: This bounds the maximum influence any single individual can have on the output. An adversary observing the output cannot confidently infer whether a specific person's data was included — the log-likelihood ratio is bounded by $\varepsilon$.
Semantic meaning: If $\varepsilon = 0.1$, the output probabilities change by at most ~10.5% ($e^{0.1} \approx 1.105$) when a single record is added or removed. A smaller $\varepsilon$ provides stronger privacy.
($\varepsilon, \delta$)-DP (approximate DP): A relaxed definition allowing a small failure probability $\delta$:
$$P(\mathcal{M}(D) \in S) \leq e^\varepsilon \cdot P(\mathcal{M}(D') \in S) + \delta$$
Typical values: $\varepsilon \in [0.1, 10]$, $\delta = 10^{-5}$ or $10^{-6}$ (much smaller than $1/n$ where $n$ is dataset size). $\delta$ should be cryptographically small — it's the probability that the pure $\varepsilon$-DP guarantee fails.
2. Sensitivity and the Gaussian Mechanism
The $\ell_2$-sensitivity of a query $f: \mathcal{D} \to \mathbb{R}^d$ is:
$$\Delta_2(f) = \max_{D \sim D'} |f(D) - f(D')|_2$$
where $D \sim D'$ denotes neighbouring datasets. This is the maximum change in the query output when a single record is modified.
Gaussian Mechanism: For a query $f$ with $\ell_2$-sensitivity $\Delta_2$, the mechanism:
$$\mathcal{M}(D) = f(D) + \mathcal{N}(0, \sigma^2 \Delta_2^2 \mathbf{I})$$
satisfies $(\varepsilon, \delta)$-DP when $\sigma \geq \sqrt{2\log(1.25/\delta)}/\varepsilon$.
Example — counting query: $f(D) = \frac{1}{n}\sum_{i=1}^n x_i$ where $x_i \in [0, 1]$. Sensitivity: if one record changes from 0 to 1, the average changes by $1/n$, so $\Delta_2 = 1/n$. Noise scale $\sigma$ needed is inversely proportional to $n$ — larger datasets require less noise for the same privacy.
3. DP-SGD (Differentially Private Stochastic Gradient Descent)
DP-SGD (Abadi et al., 2016) is the standard algorithm for training neural networks with differential privacy guarantees. It modifies standard SGD by:
- Per-sample gradient clipping: For each example $(x_i, y_i)$ in a batch, clip the gradient to norm $C$:
$$\tilde{\mathbf{g}}_i = \mathbf{g}_i \cdot \min\left(1, \frac{C}{|\mathbf{g}_i|_2}\right)$$
- Aggregate and add noise:
$$\tilde{\mathbf{g}} = \frac{1}{B}\left(\sum_i \tilde{\mathbf{g}}_i + \mathcal{N}(0, \sigma^2 C^2 \mathbf{I})\right)$$
- Update: $\theta \leftarrow \theta - \eta \tilde{\mathbf{g}}$
The clipping bounds each example's influence (sensitivity = $C$), and the Gaussian noise masks the contribution of any single example.
Sensitivity analysis: Since each per-sample gradient is clipped to norm $C$, the $\ell_2$-sensitivity of the sum over a batch is bounded by $C$ (changing one example changes the sum by at most $C$). Noise proportional to $\sigma C$ is added before averaging.
Moments Accountant: Tracking the cumulative privacy loss over $T$ training steps is non-trivial because the simple composition theorem ($\varepsilon_{\text{total}} = T\varepsilon$) is loose. The moments accountant provides tighter bounds by tracking the moment-generating function of the privacy loss random variable:
$$\alpha_{\mathcal{M}}(\lambda) = \max_{D, D'} \log \mathbb{E}_{\mathcal{M}(D)}\left[\left(\frac{P(\mathcal{M}(D)=o)}{P(\mathcal{M}(D')=o)}\right)^\lambda\right]$$
For subsampled Gaussian mechanism with sampling probability $q = B/n$ and noise $\sigma$, the moments bound at order $\lambda$ is well-approximated (for moderate $\lambda$) as:
$$\alpha(\lambda) \lesssim \frac{q^2 \lambda(\lambda+1)}{(1-q)\sigma^2}$$
The total privacy cost: $\varepsilon = \min_\lambda (\alpha_{\text{total}}(\lambda) + \log(1/\delta))/(\lambda-1)$.
4. Privacy Budget Composition
Basic composition: Running $k$ mechanisms with guarantees $(\varepsilon_1, \delta_1), \ldots, (\varepsilon_k, \delta_k)$ yields $(\sum \varepsilon_i, \sum \delta_i)$-DP.
Advanced composition: For $k$ folds of $(\varepsilon, \delta)$-DP mechanisms, the total guarantee is approximately $(\tilde{\varepsilon}, k\delta + \delta')$ where:
$$\tilde{\varepsilon} = \varepsilon\sqrt{2k\log(1/\delta')} + k\varepsilon(e^\varepsilon - 1)$$
For small $\varepsilon$: $\tilde{\varepsilon} \approx O(\varepsilon\sqrt{k})$, significantly better than $O(k\varepsilon)$.
This is why practical DP training can use thousands of SGD steps while maintaining reasonable $\varepsilon$ — the subsampling amplification and advanced composition work together.
5. Central vs Local DP
Central DP (trusted curator): A trusted server collects raw data, applies DP mechanism, and releases noisy results. Users must trust the server. This is the model used in the US Census and by large tech companies.
Local DP (LDP): Each user randomises their own data before sending it. The server never sees raw data. This requires more noise per user since there's no trusted aggregator. Google's RAPPOR and Apple's data collection use LDP.
| Property | Central DP | Local DP |
|---|---|---|
| Trust model | Trusted curator | No trust required |
| Noise per user | Low ($O(1/n)$ total) | High ($O(1)$ per user) |
| Utility at scale | Excellent | Requires many users |
| Example | DP-SGD training | RAPPOR, Apple differential privacy |
Key Terms
- DP-SGD
- Differential privacy
- Gaussian mechanism
- Local vs central DP
- Privacy budget
Worked Examples
Example 1: Gaussian Mechanism Privacy Guarantee
Problem: A query has sensitivity $\Delta_2 = 0.01$ and noise $\sigma = 1.5$ is added. What $(\varepsilon, \delta)$ does this provide for $\delta = 10^{-5}$?
Solution:
For the Gaussian mechanism: $\varepsilon = \frac{\Delta_2}{\sigma}\sqrt{2\log(1.25/\delta)}$
$\sqrt{2\log(1.25/10^{-5})} = \sqrt{2\log(125000)} = \sqrt{2 \cdot 11.736} = \sqrt{23.472} = 4.845$
$\varepsilon = 0.01 \cdot 4.845 / 1.5 = 0.0485 / 1.5 = 0.0323$
This provides $(0.0323, 10^{-5})$-DP — quite strong privacy. Larger $\sigma$ or smaller $\Delta_2$ gives smaller $\varepsilon$ (stronger privacy).
Example 2: DP-SGD Privacy Budget
Problem: Training for $T=600$ steps with batch size $B=256$, dataset size $n=60000$, clipping norm $C=1.0$, noise $\sigma=1.0$. Estimate $\varepsilon$ at $\delta=10^{-5}$ using moments accountant approximation.
Solution:
$q = B/n = 256/60000 \approx 0.00427$
At order $\lambda=10$: $\alpha(10) \approx \frac{q^2 \cdot 10 \cdot 11}{(1-q)\sigma^2} \approx \frac{0.0000182 \cdot 110}{0.996 \cdot 1} \approx \frac{0.00200}{0.996} \approx 0.00201$
Total: $\alpha_{\text{total}}(10) = 600 \cdot 0.00201 = 1.206$
$\varepsilon \approx 1.206/10 + \log(10^5)/10 = 0.1206 + 1.151 = 1.27$
This gives $(1.27, 10^{-5})$-DP — moderate privacy, suitable for many applications.
If we increase noise to $\sigma=3.0$: $\alpha(10) \approx 0.00201/9 = 0.000223$, total $= 0.134$, $\varepsilon \approx 0.0134 + 1.151 = 1.16$ — modest improvement because the $\log(1/\delta)/\lambda$ term dominates for small $\varepsilon$.
Example 3: Laplace vs Gaussian for Counting
Problem: A counting query on $n=1000$ data points where each $x_i \in {0,1}$ has sensitivity $\Delta_1 = 1$ (for the sum). Compare Laplace mechanism (for pure $\varepsilon$-DP) vs Gaussian mechanism (for $(\varepsilon, \delta)$-DP) for $\varepsilon=0.5, \delta=10^{-5}$.
Solution:
Laplace mechanism: Add $\text{Lap}(\Delta_1/\varepsilon) = \text{Lap}(1/0.5) = \text{Lap}(2)$ noise to the sum. Standard deviation of Laplace(2) is $\sqrt{2} \cdot 2 = 2.828$.
Gaussian mechanism: $\sigma = \Delta_2 \sqrt{2\log(1.25/\delta)}/\varepsilon$. Here $\Delta_2 = 1$ (same as $\Delta_1$ for a scalar).
$\sigma = 1 \cdot 4.845 / 0.5 = 9.69$
Gaussian noise std dev = $9.69$, over 3× larger than Laplace. This is the cost of the $\delta$ relaxation: pure $\varepsilon$-DP (Laplace) is more efficient for simple queries. The Gaussian mechanism's advantage is that it composes better over multiple steps and handles vector-valued queries naturally.
Practice Problems
Problem 1: Show that if $\mathcal{M}$ satisfies $\varepsilon$-DP, then for any two neighbouring datasets $D, D'$, the privacy loss random variable $L = \log(P(\mathcal{M}(D)=o)/P(\mathcal{M}(D')=o))$ satisfies $|L| \leq \varepsilon$ with probability 1.
Problem 2: For DP-SGD with $q=0.01, \sigma=2.0, T=1000$, compare $\varepsilon$ using simple composition vs advanced composition vs moments accountant. Which gives the tightest bound?
Problem 3: A hospital wants to release the average age of patients with $\varepsilon=0.1$. Ages are in $[0, 120]$. With 10,000 patients, how much noise is needed (Gaussian mechanism, $\delta=10^{-6}$)?
Problem 4: Explain the "privacy-utility trade-off" — why can't we have both perfect privacy ($\varepsilon=0$) and perfect utility?
Problem 5: Why does subsampling amplify privacy? Derive the amplification factor for the subsampled Gaussian mechanism with sampling probability $q$.
Answers (click to expand)
**Problem 1:** By $\varepsilon$-DP: $P(\mathcal{M}(D)=o) \leq e^\varepsilon P(\mathcal{M}(D')=o)$. Taking logs: $\log(P(D)/P(D')) \leq \varepsilon$. By symmetry (swap $D$ and $D'$): $P(D') \leq e^\varepsilon P(D)$, so $\log(P(D')/P(D)) \leq \varepsilon$, hence $-\varepsilon \leq \log(P(D)/P(D'))$. Combined: $|L| \leq \varepsilon$ with probability 1 for pure $\varepsilon$-DP. For $(\varepsilon,\delta)$-DP, this only holds with probability $\geq 1-\delta$. **Problem 2:** Simple composition: $\varepsilon_{\text{simple}} = 1000\varepsilon_{\text{single}}$. Advanced: $\varepsilon_{\text{adv}} \approx O(\varepsilon_{\text{single}}\sqrt{1000}) \approx 31.6\varepsilon_{\text{single}}$. Moments accountant: finite numerical optimisation over $\lambda$, typically 5-20× tighter than simple composition. Moments accountant gives the tightest bounds by tracking the precise moment-generating function. **Problem 3:** $\Delta_2 = 120/10000 = 0.012$ (sensitivity of the average). $\sigma = \Delta_2 \sqrt{2\log(1.25/\delta)}/\varepsilon$. For $\delta=10^{-6}$: $\sqrt{2\log(1.25/10^{-6})} = \sqrt{2 \cdot 14.039} = 5.30$. $\sigma = 0.012 \cdot 5.30 / 0.1 = 0.636$. Gaussian noise with std dev $0.636$ years on the average age — reasonable utility with strong privacy. **Problem 4:** At $\varepsilon=0$: $P(\mathcal{M}(D) \in S) = P(\mathcal{M}(D') \in S)$ for ALL $S$, meaning $\mathcal{M}(D)$ and $\mathcal{M}(D')$ have identical distributions for ALL neighbouring pairs. The output must be completely independent of the data — an algorithm that ignores its input entirely. Such an algorithm has zero utility. As $\varepsilon$ increases, the output can depend more on the data, improving utility at the cost of privacy. The trade-off is fundamental: DP bounds the mutual information between output and individual records. **Problem 5:** Subsampling amplifies privacy because an adversary doesn't know if a specific record was included in the sampled batch. Even if not sampled, the mechanism adds noise to the query over the sampled subset — the adversary can't tell if a record affected the result or was absent by chance. For sampling probability $q$, approximate amplification: the DP guarantee improves by roughly factor $q$: $\varepsilon_{\text{subsampled}} \approx \log(1 + q(e^\varepsilon - 1)) \approx q\varepsilon$ for small $\varepsilon$.Summary
- Differential privacy mathematically bounds the influence of any single record on an algorithm's output via the privacy parameter $\varepsilon$.
- Gaussian mechanism adds noise calibrated to query sensitivity — cleaner composition properties than Laplace for iterative algorithms.
- DP-SGD clips per-sample gradients and adds Gaussian noise to ensure training does not memorise individual training examples.
- Privacy budget composes across operations — the moments accountant provides tight bounds for iterative mechanisms with subsampling.
- Local vs central DP represents a fundamental trust trade-off: stronger privacy model (LDP) vs better utility (central DP).
Pitfalls
- $\varepsilon$ is not a "privacy switch": $\varepsilon=8$ does not mean "8% privacy lost" — it bounds the log-likelihood ratio. $e^8 \approx 2981$, a 3000× change in output probability, which is essentially no meaningful privacy. Typical useful range is $\varepsilon \in [0.1, 5]$.
- $\delta$ must be cryptographically small: If $\delta > 1/n$, a mechanism could randomly leak the entire dataset and blame it on the $\delta$ failure probability. Always ensure $\delta \ll 1/n$.
- Gradient clipping is critical: Without clipping bounds, the sensitivity is unbounded and DP is impossible with finite noise. The clipping norm $C$ is a crucial hyperparameter — too small loses signal, too large requires more noise.
Quiz
Q1: $\varepsilon$-DP requires that for neighbouring datasets $D, D'$: $P(\mathcal{M}(D) \in S) \leq e^\varepsilon \cdot P(\mathcal{M}(D') \in S)$. What does $\varepsilon=0$ imply?
A) Perfect privacy — $\mathcal{M}$ is independent of its input B) No privacy guarantee C) The mechanism is deterministic D) The output is always the same
Q2: In DP-SGD, why is gradient clipping necessary?
A) To speed up training B) To bound the sensitivity of the gradient computation C) To reduce memory usage D) To improve accuracy
Q3: The moments accountant provides tighter privacy bounds than simple composition because:
A) It uses a different definition of privacy B) It tracks the moment-generating function of privacy loss rather than a simple sum C) It assumes the data is IID D) It uses a smaller $\delta$
Q4: For the Gaussian mechanism, $\sigma = \Delta_2 \sqrt{2\log(1.25/\delta)}/\varepsilon$. To improve privacy (lower $\varepsilon$) without changing $\sigma$, you would:
A) Increase $\delta$ B) Decrease $\Delta_2$ (reduce query sensitivity) C) Increase the dataset size D) Both B and C
Q5: Local differential privacy (LDP) is weaker than central DP for the same $\varepsilon$ because:
A) Each user adds noise independently — no trusted aggregator to average out noise B) LDP uses a different $\varepsilon$ definition C) LDP does not allow subsampling D) LDP requires larger datasets
Answers (click to expand)
**Q1: A** — $\varepsilon=0$ means the distributions are identical, so output is independent of input. **Q2: B** — Without clipping, a single example could have arbitrarily large gradient → unbounded sensitivity. **Q3: B** — The moments accountant captures the precise privacy loss distribution rather than using loose union bounds. **Q4: B** — Lower sensitivity → less noise needed for same $\varepsilon$, OR same noise gives smaller $\varepsilon$. **Q5: A** — No trusted curator to average noise across users; each user's noise is independent and doesn't cancel.Next Steps
Now apply these concepts to the security of neural networks: 25-08 — Adversarial Robustness, where we study how imperceptible perturbations can fool models and how to defend against them.