25-06 — Federated Learning
Phase: 25 — Frontiers & Active Research Areas Subject: 25-06 Prerequisites: 20 (Training & Fine-tuning), 25-07 (Differential Privacy — concurrent), 14 (Optimization) Next subject: 25-07 — Differential Privacy
Learning Objectives
By the end of this subject, you will be able to:
- Formulate the federated learning objective and explain how it differs from centralised training
- Derive the Federated Averaging (FedAvg) algorithm and its convergence properties
- Analyse the effects of non-IID data and heterogeneity on FL convergence
- Explain communication-efficiency techniques including gradient compression and quantisation
- Describe how differential privacy integrates with federated learning for user-level privacy
Core Content
1. The Federated Learning Problem
Federated learning (FL) trains a shared model across $K$ decentralised clients, each with its own local dataset $D_k$, without centralising the data:
$$\min_{\mathbf{w}} F(\mathbf{w}) = \sum_{k=1}^K \frac{n_k}{n} F_k(\mathbf{w})$$
where $n_k = |D_k|$, $n = \sum_k n_k$, and $F_k(\mathbf{w}) = \frac{1}{n_k} \sum_{i \in D_k} \ell(\mathbf{w}; x_i, y_i)$ is the local empirical loss on client $k$.
⚠️ CRITICAL: Data NEVER leaves the client devices. Only model updates (gradients or weight deltas) are communicated. This is the core privacy property of FL, though privacy guarantees require additional mechanisms (see 25-07).
Key challenges distinguishing FL from distributed training: - Non-IID data: Each client's data distribution may be radically different - Unbalanced data: Clients have vastly different amounts of data - Limited communication: Clients may be on slow or intermittent connections - System heterogeneity: Clients have different compute capabilities - Privacy: Raw gradients can leak information about training data
2. Federated Averaging (FedAvg)
The foundational FL algorithm, introduced by McMahan et al. (2017):
Algorithm (FedAvg): 1. Server initialises global model $\mathbf{w}^0$ 2. For each round $t = 1, 2, \ldots, T$: a. Server sends $\mathbf{w}^{t-1}$ to a random subset $S_t$ of clients b. Each client $k \in S_t$ performs $E$ local epochs of SGD: $$\mathbf{w}k^{t,0} = \mathbf{w}^{t-1}$$ For $e = 0, \ldots, E-1$: $$\mathbf{w}_k^{t,e+1} = \mathbf{w}_k^{t,e} - \eta \nabla F_k(\mathbf{w}_k^{t,e}; \xi_k^{t,e})$$ c. Clients send local updates $\Delta_k^t = \mathbf{w}_k^{t,E} - \mathbf{w}^{t-1}$ back to server d. Server aggregates: $\mathbf{w}^t = \mathbf{w}^{t-1} + \sum{k \in S_t} \frac{n_k}{n_{S_t}} \Delta_k^t$
Equivalently, the server computes the weighted average of local models:
$$\mathbf{w}^t = \sum_{k \in S_t} \frac{n_k}{n_{S_t}} \mathbf{w}_k^{t,E}$$
Special cases: - $E = 1$, full participation → Federated SGD (same as distributed SGD) - $E > 1$ → Clients perform multiple local updates, reducing communication rounds but introducing "client drift"
3. Client Drift and Convergence
When $E > 1$, each client's local updates follow its own data distribution. Since data is non-IID, local models drift away from each other and from the "true" global optimum:
$$\mathbb{E}[|\mathbf{w}_k^{t,e} - \mathbf{w}^*|^2] \text{ grows with } e$$
Convergence analysis (simplified for strongly convex $F$):
With $\eta_t = \frac{1}{\mu t}$, FedAvg achieves:
$$\mathbb{E}[F(\mathbf{w}^T) - F^*] \leq O\left(\frac{G^2}{\mu T} + \frac{\sigma^2}{\mu n T} + \frac{(E-1)^2 G^2 \Gamma}{\mu T}\right)$$
where $G$ is the gradient bound, $\sigma^2$ is the stochastic gradient variance, and $\Gamma$ quantifies data heterogeneity (how non-IID the clients are).
The third term is the client drift penalty — it grows with $(E-1)^2\Gamma$, showing that more local epochs amplify the negative effects of non-IID data.
FedProx (Li et al., 2020) addresses this by adding a proximal term to local objectives:
$$\min_{\mathbf{w}_k} F_k(\mathbf{w}_k) + \frac{\mu}{2}|\mathbf{w}_k - \mathbf{w}^{t-1}|^2$$
This penalises local models from drifting too far from the global model, making convergence more robust to heterogeneity.
4. Communication Efficiency
Communication is often the bottleneck in FL. Key techniques:
Gradient compression: - Sparsification: Send only the top-$k$ largest gradient components (by magnitude), zeroing the rest - Quantisation: Reduce gradient precision from 32-bit float to 8-bit or even 1-bit (sign only) - QSGD (Alistarh et al., 2017): Stochastic quantisation with error feedback for unbiased estimates
Error feedback (EF-SGD): When gradients are compressed, accumulate the compression error locally and add it to the next round's gradient before compression — this prevents bias from accumulating:
$$\mathbf{e}_k^{t} = \mathbf{g}_k^t + \mathbf{e}_k^{t-1} - C(\mathbf{g}_k^t + \mathbf{e}_k^{t-1})$$
where $C$ is the compression operator and $\mathbf{e}_k$ is the local error accumulator.
5. Heterogeneous Clients and Stragglers
Straggler problem: Some clients are slow (low compute, poor network). Waiting for all clients makes rounds slow.
Solutions: - Partial participation: Only sample a subset of clients per round - Asynchronous FL: Server updates model as soon as any client sends an update (not waiting for all) - Tiered FL: Group clients by capability; faster clients do more local work
Personalised FL: Instead of one global model, learn both shared and personalised components:
$$\mathbf{w}k = \mathbf{w}{\text{shared}} + \mathbf{v}_k$$
where $\mathbf{v}k$ is a client-specific parameter vector (much smaller than $\mathbf{w}{\text{shared}}$). This handles strong data heterogeneity by allowing per-client adaptation.
Key Terms
- Client heterogeneity
- Communication efficiency
- FedAvg
- FedProx
- Federated learning
- Server initialises
Worked Examples
Example 1: FedAvg Weighted Average
Problem: Server coordinates 4 clients with dataset sizes $n_1=500, n_2=300, n_3=1000, n_4=200$. After local training, model weights are $\mathbf{w}_1=[2.1, 3.0], \mathbf{w}_2=[1.9, 3.2], \mathbf{w}_3=[2.3, 2.8], \mathbf{w}_4=[2.0, 3.1]$. Compute the FedAvg aggregated model.
Solution:
Total data points: $n = 500+300+1000+200 = 2000$
Weights: $p_1=500/2000=0.25, p_2=0.15, p_3=0.5, p_4=0.1$
$$\mathbf{w}_{\text{FedAvg}} = \sum_k p_k \mathbf{w}_k = 0.25[2.1, 3.0] + 0.15[1.9, 3.2] + 0.5[2.3, 2.8] + 0.1[2.0, 3.1]$$
$$= [0.525+0.285+1.15+0.2, 0.75+0.48+1.4+0.31] = [2.16, 2.94]$$
Client 3 (1000 points) dominates because it has half the data — the weighted average naturally gives more influence to data-rich clients.
Example 2: Client Drift Quantification
Problem: For a 1D linear regression $F_k(w) = \frac{1}{2}(w - a_k)^2$ with heterogeneous client optima $a_1=1, a_2=3, a_3=5$. Global optimum is $w^*=3$. With $E=10$ local epochs and learning rate $\eta=0.1$, compute the client drift for client 1 if the global model starts at $w=3$.
Solution:
Client 1's local gradient: $\nabla F_1(w) = w - 1$
Starting from $w_1^0 = 3$, running 10 local SGD steps:
$w_1^1 = 3 - 0.1(3-1) = 3 - 0.2 = 2.8$ $w_1^2 = 2.8 - 0.1(2.8-1) = 2.8 - 0.18 = 2.62$ $w_1^{10} = 1 + (3-1)(1-0.1)^{10} = 1 + 2 \cdot 0.9^{10} = 1 + 2 \cdot 0.349 = 1.698$
Drift: $|w_1^{10} - w^*| = |1.698 - 3| = 1.302$
Client 1's local model has drifted toward its local optimum $a_1=1$, contributing a biased update to the FedAvg aggregate. This is the client drift problem.
With FedProx ($\mu=0.5$), the local objective becomes $\frac{1}{2}(w-1)^2 + \frac{0.5}{2}(w-3)^2$, with optimum at $w = (1 + 0.5 \cdot 3)/(1 + 0.5) = 1.667$, reducing drift.
Example 3: DP-FedAvg Privacy Budget
Problem: FedAvg runs for $T=1000$ rounds with $K=100$ clients, sampling fraction $q=0.1$ per round. Each client clips its gradient to norm $C=1.0$ and adds Gaussian noise with $\sigma=2.0$. Using the moments accountant, estimate $(\varepsilon, \delta=10^{-5})$ after training.
Solution:
Sampling probability $q=0.1$, noise multiplier $z = \sigma/C = 2.0$. For the Gaussian mechanism with subsampling, the privacy loss at order $\lambda$ is approximately:
$$\alpha(\lambda) \approx \frac{q^2 \lambda(\lambda+1)}{(1-q)\sigma^2}$$
For $\lambda=10$: $\alpha(10) \approx \frac{0.01 \cdot 10 \cdot 11}{0.9 \cdot 4} = \frac{1.1}{3.6} \approx 0.306$ per round.
Total privacy loss: $\alpha_{\text{total}}(10) = 1000 \cdot 0.306 = 306$
Using the moments accountant: $\varepsilon \approx \alpha_{\text{total}}(\lambda)/\lambda + \log(1/\delta)/\lambda$
$\varepsilon \approx 306/10 + \log(10^5)/10 = 30.6 + 11.5/10 = 30.6 + 1.15 \approx 31.8$
This is a very high $\varepsilon$ — practical DP-FL typically requires more aggressive noise ($\sigma \geq 5$) or fewer rounds to achieve meaningful privacy ($\varepsilon < 10$).
Practice Problems
Problem 1: For $K=5$ clients with equal data, compute FedAvg with $E=3$ local epochs and learning rate $\eta=0.1$ for a quadratic $F_k(w) = \frac{1}{2}(w - k)^2$, starting from $w^0=0$. Show one round.
Problem 2: Explain why $E=1$ (FedSGD) avoids client drift but requires more communication rounds, and why $E \gg 1$ reduces communication but amplifies drift.
Problem 3: For QSGD with $s$ quantisation levels, the variance of the compressed gradient is bounded by $\min(d/s^2, \sqrt{d}/s)|\mathbf{g}|^2$. Explain the two regimes and which dominates for high compression ($s=1$).
Problem 4: Differential privacy in FL provides user-level privacy. Explain what "user-level" means and why it's stronger than sample-level DP.
Problem 5: Design a FedAvg variant that handles clients with dramatically different dataset sizes (e.g., 10 examples vs 100,000). What weighting scheme would you use?
Answers (click to expand)
**Problem 1:** Round $t=0$: Global $w^0=0$. Each client $k$ starts from $w_k^0=0$ and does 3 local steps toward $a_k=k$: GD update: $w_k^{e+1} = w_k^e - \eta(w_k^e - k) = (1-\eta)w_k^e + \eta k$ $k=1$: $w_1^3 = (1-0.1)^3 \cdot 0 + (1-(1-0.1)^3)\cdot 1 = 0.271 \cdot 0 + 0.729 \cdot 1 = 0.729$ $k=2$: $w_2^3 = 0.729 \cdot 2 = 1.458$ $k=3$: $w_3^3 = 0.729 \cdot 3 = 2.187$ $k=4$: $w_4^3 = 0.729 \cdot 4 = 2.916$ $k=5$: $w_5^3 = 0.729 \cdot 5 = 3.645$ FedAvg (equal weights): $w^1 = (0.729+1.458+2.187+2.916+3.645)/5 = 11.0 \cdot 0.729/5 = 2.187$ The true global optimum is $w^* = 3$. One round has moved from 0 to 2.187 — good progress but not convergence due to client drift bias. **Problem 2:** With $E=1$ (FedSGD), client models don't drift because each client takes exactly one gradient step from the same starting point — it's equivalent to distributed SGD with gradient averaging. Communication cost: $T$ rounds × $K$ clients. With $E \gg 1$, communication rounds decrease by factor $E$, but local models drift toward their individual optima, biasing the aggregate. The optimal $E$ balances these effects — typically $E=5-20$ in practice. **Problem 3:** For large $d$, $d/s^2$ dominates — variance scales with dimension and is large for low bit counts. For small $d$ or large $s$, $\sqrt{d}/s$ dominates. At $s=1$ (1-bit SGD = sign only), variance is $\min(d, \sqrt{d}) \approx d$ for $d \geq 1$, severely degraded. Error feedback partially compensates by accumulating compression errors. **Problem 4:** User-level DP bounds the privacy loss over ALL examples from a single user/client. Even if a user contributes 1000 data points, the privacy guarantee protects against inferring whether that user participated at all (and, by extension, any of their samples). Sample-level DP protects individual data points. User-level DP is stronger because a user's entire dataset being present or absent must be indistinguishable — not just individual samples. **Problem 5:** Equal weighting ignores data quantity differences. Weighting by $n_k$ (standard FedAvg) gives large clients disproportionate influence — appropriate when data quality is uniform. Inverse-variance weighting $\propto 1/\sigma_k^2$ up-weights clients with lower gradient variance. A robust approach: cap $n_k$ (e.g., clip to 5000 max) then weight by $\sqrt{n_k}$ — gives some weight to data quantity without letting one client dominate, while still favouring clients with more data.Summary
- Federated learning trains a shared model across distributed clients without centralising data, using local computation and periodic model aggregation.
- FedAvg with $E > 1$ local epochs reduces communication but introduces client drift — heterogeneous data causes local models to diverge, biasing the aggregate.
- Communication efficiency techniques (gradient sparsification, quantisation, error feedback) reduce bandwidth at the cost of gradient fidelity.
- Client heterogeneity (non-IID data, varying capacities, stragglers) is the central challenge — addressed by proximal regularisation, personalised FL, and partial participation.
Pitfalls
- Assuming IID data: Standard FedAvg convergence analysis assumes bounded gradient dissimilarity. Real FL data is often wildly non-IID and no single global model works well — always check local model performance.
- Ignoring privacy leakage: Raw model updates can be inverted to reconstruct training data. FedAvg alone does NOT provide differential privacy.
- Communication-computation trade-off: Many local epochs reduce communication but may not converge at all with very non-IID data. Monitor divergence.
Quiz
Question 1: What is the primary mathematical difference between federated learning and standard distributed training?
A. Federated learning uses different optimizers B. Data never leaves client devices — only model updates (gradients or weight deltas) are communicated C. Federated learning always requires larger models D. Federated learning only works with IID data
Correct Answer: B
Explanation
- **If you chose A:** The optimizer (typically SGD) is the same — the difference is in how data is accessed. - **If you chose B:** Correct. This is the defining property: raw data stays on client devices; only model updates are shared. - **If you chose C:** Model size is independent of whether training is federated or centralized. - **If you chose D:** Federated learning is specifically designed to handle non-IID data, though it's challenging.Question 2: In FedAvg, what is the effect of increasing the number of local epochs $E$?
A. Reduces communication rounds but amplifies client drift with non-IID data B. Increases communication rounds and reduces accuracy in all cases C. Has no effect on convergence behavior D. Only affects the learning rate, not convergence
Correct Answer: A
Explanation
- **If you chose A:** Correct. Larger $E$ means fewer server synchronizations (reduced communication cost), but local models drift further toward their individual data distributions, biasing the aggregate. - **If you chose B:** Larger $E$ *reduces* communication rounds, not increases them. - **If you chose C:** $E$ is a critical hyperparameter that directly affects convergence through the client drift penalty $O((E-1)^2\Gamma)$. - **If you chose D:** $E$ controls the number of local updates between synchronizations, which is independent of the learning rate.Question 3: What does the proximal term in FedProx penalize?
A. Large model weights (like L2 regularization) B. Local models drifting too far from the global server model C. High communication frequency between clients and server D. Differences in dataset sizes across clients
Correct Answer: B
Explanation
- **If you chose A:** FedProx adds $\frac{\mu}{2}\|\mathbf{w}_k - \mathbf{w}^{t-1}\|^2$, which penalizes deviation from the global model, not weight magnitude. - **If you chose B:** Correct. The proximal term keeps each client's local model close to the global model $\mathbf{w}^{t-1}$, mitigating client drift from data heterogeneity. - **If you chose C:** Communication frequency is controlled by the number of local epochs $E$, not the proximal term. - **If you chose D:** Dataset size differences are handled by the weighted averaging in FedAvg, not the proximal term.Question 4: In QSGD with $s$ quantization levels, what happens to gradient variance as $s$ increases?
A. Variance increases — more levels add more noise B. Variance decreases — more quantization levels mean more precision and lower variance C. Variance stays constant regardless of $s$ D. Variance becomes negative
Correct Answer: B
Explanation
- **If you chose A:** The opposite — fewer levels (lower $s$) introduce more quantization noise and higher variance. - **If you chose B:** Correct. QSGD variance is bounded by $\min(d/s^2, \sqrt{d}/s)\|\mathbf{g}\|^2$. As $s$ increases, this bound decreases, reducing variance. - **If you chose C:** The variance bound explicitly depends on $s$. - **If you chose D:** Variance is always non-negative by definition.Question 5: What does "user-level" differential privacy guarantee in federated learning?
A. Individual data points (samples) cannot be identified B. The presence or absence of an entire user's dataset in the training process is protected C. Only the server administrator can access raw data D. All gradient updates are encrypted during transmission
Correct Answer: B
Explanation
- **If you chose A:** That's sample-level DP. User-level DP is stronger — it protects all samples belonging to a user simultaneously. - **If you chose B:** Correct. Even if a user contributes thousands of data points, user-level DP makes it indistinguishable whether that user participated at all. - **If you chose C:** FL already keeps data local; DP adds formal mathematical guarantees on top. - **If you chose D:** Encryption (e.g., secure aggregation) is a separate concern from differential privacy.Question 6: In FedAvg, why does weighting client updates by $n_k$ (dataset size) make statistical sense?
A. It gives equal voice to all clients regardless of data quantity B. Larger datasets provide more reliable gradient estimates with lower variance, so their updates warrant higher weight C. It reduces the total communication cost D. It eliminates the effects of non-IID data distributions
Correct Answer: B
Explanation
- **If you chose A:** Equal weighting ignores that some clients have far more data (and thus more reliable gradients). - **If you chose B:** Correct. The variance of a stochastic gradient scales as $O(1/n_k)$, so clients with more data produce lower-variance updates. Weighting by $n_k$ is the minimum-variance linear combination under IID assumptions. - **If you chose C:** The weighting scheme doesn't affect communication cost — that's determined by $E$ and the number of participating clients. - **If you chose D:** Non-IID effects persist regardless of the weighting scheme; weighting by $n_k$ doesn't address distribution shift.Next Steps
Now study the formal privacy guarantees behind FL: 25-07 — Differential Privacy, covering $\varepsilon$-DP, the Gaussian mechanism, DP-SGD, and the privacy-utility trade-off.