Math graphic
📐 Concept diagram

22-01 — Autoencoders

Phase: 22 — Generative Models Mathematics Subject: 22-01 Prerequisites: Phase 16–17 (Neural Networks), Phase 8 (Linear Algebra) Next subject: 22-02 — Variational Autoencoders (VAEs)


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the encoder-decoder architecture and reconstruction loss for autoencoders
  2. Explain how the bottleneck forces representation learning and prevents trivial identity mapping
  3. Implement and analyze denoising autoencoders using the corruption-plus-reconstruction framework
  4. Connect autoencoders to dimensionality reduction (PCA) and manifold learning
  5. Recognize when autoencoders fail and why they precede VAEs

Core Content

The Autoencoder Architecture

An autoencoder is a neural network trained to copy its input to its output. It consists of two components:

The overall function is:

$$\hat{\mathbf{x}} = d_\theta(e_\phi(\mathbf{x}))$$

The objective is to minimize the reconstruction loss:

$$\mathcal{L}(\phi, \theta) = \frac{1}{N}\sum_{i=1}^{N} \|\mathbf{x}^{(i)} - d_\theta(e_\phi(\mathbf{x}^{(i)}))\|_2^2$$

Why Autoencoders Aren't Trivial

Without constraints, the network could learn the identity function $d_\theta(e_\phi(\mathbf{x})) = \mathbf{x}$ by making $h \geq d$ (the latent dimension at least as large as the input dimension). We prevent this through a bottleneck: $h < d$.

⚠️ CRITICAL — Bottleneck Necessity: If $h \geq d$, a linear autoencoder with no activation functions can learn the identity exactly. The bottleneck $h < d$ forces the encoder to compress the data — discarding noise and retaining salient features. This is what makes autoencoders learn useful representations rather than memorizing.

Linear Autoencoders and PCA

Consider a linear autoencoder with encoder $\mathbf{z} = W_{\text{enc}}\mathbf{x} + \mathbf{b}{\text{enc}}$ and decoder $\hat{\mathbf{x}} = W{\text{dec}}\mathbf{z} + \mathbf{b}_{\text{dec}}$, trained with MSE loss.

Theorem: The optimal weights $W_{\text{enc}}$ and $W_{\text{dec}}$ span the same subspace as the top $h$ principal components of the data covariance matrix. That is, a linear autoencoder with bottleneck dimension $h$ learns the PCA projection onto the $h$-dimensional principal subspace.

Proof sketch: The reconstruction loss $\|\mathbf{x} - W_{\text{dec}}W_{\text{enc}}\mathbf{x}\|^2$ (ignoring biases, assuming centered data) is minimized when $W_{\text{dec}}W_{\text{enc}}$ is the orthogonal projection onto the $h$-dimensional subspace that captures maximum variance — exactly the PCA solution. The specific matrices are not unique since any invertible $h \times h$ transformation $T$ gives $(W_{\text{dec}}T^{-1})(TW_{\text{enc}}) = W_{\text{dec}}W_{\text{enc}}$, preserving the projection.

Nonlinear Autoencoders and Manifold Learning

With nonlinear activations (ReLU, sigmoid, tanh), autoencoders learn nonlinear manifolds. The encoder maps points near the data manifold to the latent space; the decoder maps back to the data space. The composition $d_\theta \circ e_\phi$ approximates a projection onto the data manifold.

Geometric interpretation: The latent space $\mathbb{R}^h$ is a coordinate chart for the $h$-dimensional manifold on which the data approximately lies. Each point on the manifold has latent coordinates $\mathbf{z}$.

Denoising Autoencoders (DAE)

A denoising autoencoder is trained to reconstruct the clean input $\mathbf{x}$ from a corrupted version $\tilde{\mathbf{x}}$:

$$\mathcal{L}{\text{DAE}} = \frac{1}{N}\sum{i=1}^{N} \|\mathbf{x}^{(i)} - d_\theta(e_\phi(\tilde{\mathbf{x}}^{(i)}))\|_2^2$$

where $\tilde{\mathbf{x}} = \text{corrupt}(\mathbf{x})$ — typically additive Gaussian noise $\tilde{\mathbf{x}} = \mathbf{x} + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$, or dropout (binary masking).

⚠️ CRITICAL — DAE Intuition: The denoising objective forces the autoencoder to learn the data manifold, not just the data points. To reconstruct from a corrupted input, the model must understand the structure of clean data — where valid data points live. This is the conceptual bridge to score-based models and diffusion models.

Connection to score matching: For small Gaussian noise $\sigma$, the optimal DAE reconstruction satisfies:

$$d_\theta(e_\phi(\mathbf{x} + \epsilon)) \approx \mathbf{x} + \sigma^2 \nabla_{\mathbf{x}} \log p(\mathbf{x})$$

where $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ is the score function. The DAE learns the score implicitly — a connection we will exploit in Phase 22-05.

Sparse Autoencoders

Another way to prevent trivial solutions: add a sparsity penalty on the latent activations. If most latent units are near zero, the representation is forced to be efficient:

$$\mathcal{L}{\text{sparse}} = \mathcal{L}{\text{reconstruction}} + \lambda \sum_{j=1}^{h} \text{KL}(\rho \| \hat{\rho}_j)$$

where $\hat{\rho}_j = \frac{1}{N}\sum_i z_j^{(i)}$ is the average activation of unit $j$, $\rho$ is a small target sparsity (e.g., 0.05), and KL is the Kullback-Leibler divergence.

Contractive Autoencoders (CAE)

Add a penalty on the Frobenius norm of the encoder's Jacobian:

$$\mathcal{L}{\text{CAE}} = \mathcal{L}{\text{reconstruction}} + \lambda \left\|\frac{\partial e_\phi(\mathbf{x})}{\partial \mathbf{x}}\right\|_F^2$$

This forces the encoder to be insensitive to small input variations — the representation contracts around the data manifold.

⚠️ Common Pitfalls

Pitfall Why It Happens Fix
Identity learning $h \geq d$ or overly powerful decoder Use bottleneck ($h < d$)
Blurry reconstructions MSE loss assumes Gaussian data; averages over modes Consider adversarial loss or VAEs
No meaningful latent structure Encoder maps arbitrarily Use regularization (sparsity, contractive) or switch to VAE
Overfitting Autoencoder memorizes training set Denoising, dropout, weight decay
Latent space collapse All inputs map to the same latent code Regularization, lower learning rate


Key Terms

Worked Examples

Example 1: Linear Autoencoder = PCA

Given data $\mathbf{X} \in \mathbb{R}^{N \times 2}$ with points $\{(1, 2), (2, 3), (3, 4), (4, 5)\}$, find the 1D linear autoencoder weights and verify they align with the first principal component.

Solution:

  1. Center the data. Mean: $(2.5, 3.5)$. Centered data: $$\tilde{\mathbf{X}} = \begin{pmatrix} -1.5 & -1.5 \\ -0.5 & -0.5 \\ 0.5 & 0.5 \\ 1.5 & 1.5 \end{pmatrix}$$

  2. Covariance matrix $\Sigma = \frac{1}{4}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}} = \begin{pmatrix} 1.25 & 1.25 \\ 1.25 & 1.25 \end{pmatrix}$.

  3. First eigenvector: solve $\Sigma \mathbf{v} = \lambda \mathbf{v}$. The rank-1 matrix has eigenvector $(1/\sqrt{2}, 1/\sqrt{2})^T$ with eigenvalue 2.5.

  4. The 1D linear autoencoder: $\mathbf{z} = \mathbf{w}{\text{enc}}^T \mathbf{x}$ (scalar), $\hat{\mathbf{x}} = \mathbf{w}{\text{dec}} z$. Minimizing $\|\mathbf{x} - \mathbf{w}{\text{dec}}\mathbf{w}{\text{enc}}^T\mathbf{x}\|^2$ gives $\mathbf{w}{\text{enc}} = \mathbf{w}{\text{dec}} = (1/\sqrt{2}, 1/\sqrt{2})$ (up to scaling). This is exactly the first PC direction.

Click for answer The 1D linear autoencoder projects onto direction $(1,1)/\\sqrt{2}$ — the first principal component. The decoder reconstructs along the same direction, giving points on the line $y = x$ (after adding back the mean).

Example 2: Compute Reconstruction Loss

An autoencoder with bottleneck $h=2$ processes a batch of 3 images (flattened to $d=4$). Given:

$\mathbf{X} = \begin{pmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 0 \end{pmatrix}$, $\hat{\mathbf{X}} = \begin{pmatrix} 0.9 & 0.1 & 0.8 & 0.1 \\ 0.1 & 0.9 & 0.1 & 0.8 \\ 0.8 & 0.9 & 0.1 & 0.1 \end{pmatrix}$

Compute the MSE reconstruction loss.

Solution:

MSE = $\frac{1}{N d}\sum_{i,j}(X_{ij} - \hat{X}_{ij})^2$

Per-sample: - Sample 1: $((1-0.9)^2 + (0-0.1)^2 + (1-0.8)^2 + (0-0.1)^2)/4 = (0.01 + 0.01 + 0.04 + 0.01)/4 = 0.0175$ - Sample 2: $(0.01 + 0.01 + 0.01 + 0.01 + 0.49)/4 = (0.01+0.01+0.01+0.49)/4 = 0.13$ (wait: $(0-0.1)^2=0.01$, $(1-0.9)^2=0.01$, $(0-0.1)^2=0.01$, $(1-0.8)^2=0.04$... let me recompute)

Recomputing carefully: - Row 1: $(0.1)^2 + (-0.1)^2 + (0.2)^2 + (-0.1)^2 = 0.01+0.01+0.04+0.01 = 0.07$, divided by 4 = 0.0175 - Row 2: $(-0.1)^2 + (0.1)^2 + (-0.1)^2 + (0.2)^2 = 0.01+0.01+0.01+0.04 = 0.07$, divided by 4 = 0.0175 - Row 3: $(0.2)^2 + (0.1)^2 + (-0.1)^2 + (-0.1)^2 = 0.04+0.01+0.01+0.01 = 0.07$, divided by 4 = 0.0175

Total MSE = $(0.0175 + 0.0175 + 0.0175)/3 = 0.0175$

Click for answer $\\mathcal{L}_{\\text{MSE}} = 0.0175$. Each sample contributes equally $0.0175$ since the sum of squared errors is $0.07$ for each with $d=4$.

Example 3: Denoising Autoencoder with Gaussian Noise

Train a DAE on scalar data $x \in \{-2, 2\}$ (two clusters). The corruption adds noise $\epsilon \sim \mathcal{N}(0, 0.5^2)$. A simple DAE with encoder $z = w_{\text{enc}} x$ and decoder $\hat{x} = w_{\text{dec}} z$ minimizes $\mathbb{E}[(x - w_{\text{dec}} w_{\text{enc}}(x + \epsilon))^2]$. Find the optimal product $w_{\text{dec}}w_{\text{enc}}$.

Solution:

The objective is $\mathbb{E}[(x - w(x + \epsilon))^2]$ where $w = w_{\text{dec}}w_{\text{enc}}$. Expanding:

$$\mathbb{E}[x^2 - 2wx(x+\epsilon) + w^2(x+\epsilon)^2] = \mathbb{E}[x^2] - 2w\mathbb{E}[x^2] - 2w\mathbb{E}[x\epsilon] + w^2\mathbb{E}[x^2] + 2w^2\mathbb{E}[x\epsilon] + w^2\mathbb{E}[\epsilon^2]$$

Since $\epsilon \perp x$ and $\mathbb{E}[\epsilon]=0$: $\mathbb{E}[x\epsilon] = 0$, $\mathbb{E}[\epsilon^2] = \sigma^2 = 0.25$. $\mathbb{E}[x^2] = \frac{1}{2}((-2)^2 + 2^2) = 4$.

$$\mathcal{L} = 4 - 2w(4) + w^2(4 + 0.25) = 4 - 8w + 4.25w^2$$

Minimizing: $\frac{\partial\mathcal{L}}{\partial w} = -8 + 8.5w = 0 \implies w = \frac{8}{8.5} \approx 0.941$.

Note: without noise ($\sigma^2=0$), the optimum would be $w=1$ (perfect reconstruction). With noise, $w < 1$ — the DAE learns to shrink reconstructions toward the mean, which is the optimal denoising strategy.

Click for answer $w^* = \\frac{\\mathbb{E}[x^2]}{\\mathbb{E}[x^2] + \\sigma^2} = \\frac{4}{4.25} \\approx 0.941$. The DAE naturally learns shrinkage toward zero (the data mean), which is optimal for Gaussian denoising. This is the scalar analog of the score-matching connection.

Practice Problems

  1. Prove that a linear autoencoder with $h \geq d$ can achieve zero reconstruction loss. What are the optimal weights?

    Click for answer Set $W_{\\text{dec}}W_{\\text{enc}} = I_d$. For example, $W_{\\text{enc}} = [I_d; 0]$ (pad with zeros) and $W_{\\text{dec}} = [I_d, 0]$ (strip zeros). The autoencoder learns the identity, which is useless for representation learning.

  2. For data $\mathbf{X}$ with covariance $\Sigma$, the 1D linear autoencoder encoder weight $\mathbf{w}$ satisfies $\Sigma \mathbf{w} = \lambda \mathbf{w}$. Derive this from the MSE objective.

    Click for answer $\\mathcal{L} = \\mathbb{E}[\\|\\mathbf{x} - \\mathbf{v}\\mathbf{w}^T\\mathbf{x}\\|^2] = \\text{tr}(\\Sigma) - 2\\mathbf{w}^T\\Sigma\\mathbf{v} + (\\mathbf{w}^T\\Sigma\\mathbf{w})\\|\\mathbf{v}\\|^2$ where $\\mathbf{w}$ is encoder, $\\mathbf{v}$ is decoder. Setting $\\nabla_{\\mathbf{v}}\\mathcal{L}=0$ gives $\\mathbf{v} = \\frac{\\Sigma\\mathbf{w}}{\\mathbf{w}^T\\Sigma\\mathbf{w}}$. Plugging back and maximizing variance captured gives $\\Sigma\\mathbf{w} = \\lambda\\mathbf{w}$.

  3. A denoising autoencoder is trained with dropout corruption (each input element set to 0 with probability $p=0.3$). For input $\mathbf{x} = (1, 2, 3, 4)$, the corrupted version in one training step is $\tilde{\mathbf{x}} = (1, 0, 3, 0)$. If the encoder is identity and the decoder is linear $\hat{\mathbf{x}} = W\tilde{\mathbf{x}}$, what should $W$ learn?

    Click for answer Minimizing $\\mathbb{E}[\\|\\mathbf{x} - W\\tilde{\\mathbf{x}}\\|^2]$ with dropout gives $W_{ii} \\approx 1/(1-p) = 1/0.7 \\approx 1.429$ (diagonal, compensating for the expected dropout), and $W_{ij} \\approx 0$ for $i \\neq j$. This is the inverse dropout scaling — the DAE learns to undo the corruption.

  4. Compute the reconstruction loss for a batch with binary cross-entropy loss: $\mathbf{x} = (1, 0, 1)$, $\hat{\mathbf{x}} = (0.9, 0.2, 0.8)$. Recall BCE: $\mathcal{L} = -\frac{1}{d}\sum_j [x_j \log \hat{x}_j + (1-x_j)\log(1-\hat{x}_j)]$.

    Click for answer $\\mathcal{L} = -\\frac{1}{3}[1\\cdot\\log 0.9 + 0\\cdot\\log 0.2 + 1\\cdot\\log 0.8 + 0\\cdot\\log 0.1 + 1\\cdot\\log 0.8 + 0\\cdot\\log 0.2]$ $= -\\frac{1}{3}[\\log 0.9 + \\log 0.8 + \\log 0.8] = -\\frac{1}{3}[-0.1054 - 0.2231 - 0.2231] = \\frac{1}{3}(0.5516) = 0.1839$.

  5. Explain why a sparse autoencoder with $\rho = 0.05$ and 100 hidden units has on average only 5 active units per input. How does this differ from a bottleneck of size 5?

    Click for answer A bottleneck of size 5 hard-limits the latent dimension — exactly 5 numbers per input. A sparse autoencoder with 100 units learns distributed representations: different inputs activate different subsets of ~5 units. The effective capacity is much higher because the 5 active units can be chosen from $\\binom{100}{5}$ possible combinations, enabling specialization while maintaining sparsity.


Summary

Key takeaways:


Quiz

  1. What happens if the bottleneck dimension $h$ equals the input dimension $d$?
  2. A) The autoencoder always learns PCA
  3. B) The autoencoder can learn the identity function
  4. C) The autoencoder cannot be trained
  5. D) The autoencoder becomes a VAE Correct: B)
  6. If you chose B: With $h = d$, the model can set $d_\theta \circ e_\phi = \text{id}$, achieving zero loss without extracting useful features.
  7. If you chose A: PCA requires $h < d$ (dimensionality reduction). With $h=d$, the model can learn identity, not PCA.
  8. If you chose C: It trains fine — it just converges to a trivial solution.
  9. If you chose D: VAEs require a probabilistic formulation, not just $h=d$.

  10. A linear autoencoder with MSE loss and bottleneck $h$ is equivalent to:

  11. A) $k$-means clustering with $k=h$
  12. B) PCA retaining $h$ principal components
  13. C) Linear discriminant analysis
  14. D) Independent component analysis Correct: B)
  15. If you chose B: The optimal linear autoencoder weights span the top-$h$ PCA subspace.
  16. If you chose A: $k$-means is a clustering algorithm, not a linear projection method. Though there are connections, the autoencoder explicitly solves PCA.
  17. If you chose C: LDA uses class labels; autoencoders are unsupervised.
  18. If you chose D: ICA finds independent (not just uncorrelated) components — different objective.

  19. A denoising autoencoder trained with additive Gaussian noise $\mathcal{N}(0, \sigma^2 I)$ learns:

  20. A) To output the exact input regardless of noise
  21. B) To output a denoised version biased toward the data mean
  22. C) Only the noise distribution
  23. D) A deterministic identity mapping Correct: B)
  24. If you chose B: For Gaussian noise, the optimal denoiser shrinks toward the mean (as shown in Example 3). More generally, it pushes toward high-density regions of the data distribution.
  25. If you chose A: Impossible — the noise destroys information. The model must infer the most likely clean input given the corrupted one.
  26. If you chose C: The model learns the data distribution, not just the noise.
  27. If you chose D: A DAE is not deterministic identity; it learns to map corrupted inputs to clean reconstructions.

  28. Which regularization method penalizes the sensitivity of the latent representation to input perturbations?

  29. A) Sparse autoencoder
  30. B) Denoising autoencoder
  31. C) Contractive autoencoder
  32. D) Undercomplete autoencoder Correct: C)
  33. If you chose C: The CAE penalty is $\|\partial e_\phi(\mathbf{x})/\partial\mathbf{x}\|_F^2$ — directly penalizing the encoder's Jacobian.
  34. If you chose A: Sparse autoencoders penalize non-zero activations, not sensitivity.
  35. If you chose B: DAEs achieve insensitivity implicitly through training on corrupted inputs, but don't have an explicit Jacobian penalty.
  36. If you chose D: Undercomplete autoencoders use a bottleneck, not a sensitivity penalty.

  37. For mean-centered data with covariance $\Sigma$, the $h$-dimensional linear autoencoder solution minimizes:

  38. A) $\text{tr}((I - W_{\text{dec}}W_{\text{enc}})^T \Sigma (I - W_{\text{dec}}W_{\text{enc}}))$
  39. B) $\|\Sigma - W_{\text{dec}}W_{\text{enc}}\|_F^2$
  40. C) $\text{tr}(\Sigma) - \text{tr}(W_{\text{dec}}W_{\text{enc}}\Sigma)$
  41. D) Both A and C are equivalent formulations Correct: D)
  42. If you chose D: The MSE loss expands to $\mathbb{E}[\|\mathbf{x} - W_{\text{dec}}W_{\text{enc}}\mathbf{x}\|^2] = \text{tr}(\Sigma) - 2\text{tr}(W_{\text{dec}}W_{\text{enc}}\Sigma) + \text{tr}((W_{\text{dec}}W_{\text{enc}})^T\Sigma W_{\text{dec}}W_{\text{enc}})$, which is expression A. Expression C is just the first-order term.

Next Steps

22-02 — Variational Autoencoders (VAEs) — where we add probabilistic structure to the latent space, deriving the ELBO and the reparameterization trick that makes VAEs trainable generative models.


Pitfalls

  1. Treating MSE as universally appropriate for all data types: MSE implicitly assumes Gaussian-distributed pixel errors, which works for continuous-valued images but fails for binary or categorical data. For binary inputs (e.g., black-and-white MNIST), use binary cross-entropy loss instead. Always match the decoder output distribution to the data type: Bernoulli for binary, categorical for discrete, and Gaussian for continuous.

  2. Setting the bottleneck too large without any regularization: A bottleneck of $h = d/2$ with no regularization still allows the autoencoder to copy input patterns rather than learning compressed representations. The bottleneck must be active and restrictive enough. For high-dimensional data like images, even $h = d/10$ can allow trivial solutions if the decoder is too powerful. Combine bottleneck constraints with regularization (dropout, weight decay) to force meaningful compression.

  3. Evaluating autoencoders by reconstruction loss alone: Low reconstruction MSE doesn't guarantee useful representations. A model can achieve near-perfect reconstruction on training data through memorization if the latent space is large enough. Always inspect the latent space structure (e.g., via PCA visualization, interpolation tests) and validate on downstream tasks (classification on latent codes, nearest-neighbor retrieval) before concluding that learning was successful.

  4. Confusing the DAE score-matching connection as an equivalence: The DAE learns $d_\theta(e_\phi(\mathbf{x} + \epsilon)) \approx \mathbf{x} + \sigma^2 \nabla_\mathbf{x} \log p(\mathbf{x})$ only in the limit of small noise and optimal training. For practical noise levels and finite training, the DAE approximates the score only approximately. Don't treat DAE reconstructions as exact score estimates — the relationship is asymptotic and depends on noise magnitude.




Q6: Which combination of autoencoder variants would be most effective for learning representations that are both robust to noise and interpretable in the latent space?

A) Undercomplete autoencoder with large $h$. B) Denoising autoencoder with sparsity constraint on the latent codes. C) Standard autoencoder trained to convergence on clean data. D) Contractive autoencoder with $h = d$.

Answer and Explanations **Correct: B)** Combining denoising (robustness to input corruption, learns data manifold) with sparsity (forces efficient, interpretable latent representations where only a few units are active per input) gives complementary benefits. Denoising handles input noise; sparsity makes individual latent dimensions interpretable. - A) Large $h$ risks identity learning and provides no robustness to noise. - C) Clean-data training without regularization produces no manifold structure or noise robustness. - D) $h = d$ with contractive penalty can still work but the contraction is the only constraint — no explicit denoising or sparsity.