22-08 β Autoregressive Models
Phase: 22 β Generative Models Mathematics Subject: 22-08 Prerequisites: 22-01 β Autoencoders, Phase 16β17 (Neural Networks & CNNs β convolutions, architectures), Phase 13 (Probability β chain rule) Next subject: 22-09 β Energy-Based Models
Learning Objectives
By the end of this subject, you will be able to:
- Formulate autoregressive generative modeling via the chain rule of probability
- Derive the masked convolution operation used in PixelCNN and explain causal constraints
- Understand dilated causal convolutions in WaveNet and compute receptive field size
- Compare autoregressive models to latent-variable and score-based approaches in terms of sampling speed, likelihood, and sample quality
- Connect autoregressive image/text generation to modern LLM architectures
Core Content
The Autoregressive Principle
An autoregressive model decomposes a joint distribution over a sequence or grid into a product of conditionals using the chain rule of probability:
$$p(\mathbf{x}) = p(x_1)\prod_{i=2}^{D} p(x_i \mid x_1, \ldots, x_{i-1})$$
For images, $\mathbf{x} = (x_1, \ldots, x_{H \times W \times C})$ is flattened into a sequence according to some ordering (typically raster scan: row-by-row, left-to-right, RGB channels interleaved).
The model learns each conditional $p(x_i \mid \mathbf{x}_{<i})$ parameterized by a neural network. Generation proceeds sequentially:
- Sample $x_1 \sim p(x_1)$
- Sample $x_2 \sim p(x_2 \mid x_1)$
- ...
- Sample $x_D \sim p(x_D \mid x_1, \ldots, x_{D-1})$
β οΈ CRITICAL: This sequential sampling is the fundamental trade-off of autoregressive models: exact likelihoods (tractable!) but slow sampling ($O(D)$ steps, not parallelizable). Contrast with GANs (fast sampling, no likelihood) and diffusion (medium sampling, approximate likelihood).
PixelCNN
PixelCNN (van den Oord et al., 2016) models $p(x_i \mid \mathbf{x}_{<i})$ using a CNN with masked convolutions that enforce the autoregressive ordering constraint.
Masked Convolutions
A standard convolution computes:
$$y_{h,w} = \sum_{i=-k}^{k}\sum_{j=-k}^{k} W_{i,j} \cdot x_{h+i, w+j}$$
To enforce the autoregressive constraint (pixel $(h,w)$ can only depend on "previous" pixels), we multiply the filter by a mask $M$:
$$y_{h,w} = \sum_{i,j} (W_{i,j} \odot M_{i,j}) \cdot x_{h+i, w+j}$$
Mask Type A (first layer): Center pixel and all "future" pixels masked to zero: $$M_{i,j} = \begin{cases} 1 & \text{if } (h+i < h) \text{ or } (h+i = h \text{ and } w+j < w) \ 0 & \text{otherwise} \end{cases}$$
Mask Type B (subsequent layers): Allows the center pixel (self-connection) since later layers operate on features, not raw pixels: $$M_{i,j} = \begin{cases} 1 & \text{if } (h+i < h) \text{ or } (h+i = h \text{ and } w+j \leq w) \ 0 & \text{otherwise} \end{cases}$$
For RGB images with interleaved channels, the mask is extended to channel depth: pixel $(h,w,c)$ can depend on $(h,w,c')$ for $c' < c$ (Type A) or $c' \leq c$ (Type B).
Blind Spot Problem
A standard masked $3 \times 3$ convolution has a "blind spot" β the pixel to the right of center in the row above is accessible, but the pixel directly above-right is not if the receptive field grows in only one direction. This is solved by using two separate streams: a horizontal stack and a vertical stack (Gated PixelCNN).
Gated PixelCNN
The Gated PixelCNN (van den Oord et al., 2016) uses:
$$\mathbf{y} = \tanh(W_{k,f} * \mathbf{x}) \odot \sigma(W_{k,g} * \mathbf{x})$$
This is a gated activation (like an LSTM/GRU gate), where $\sigma$ acts as a multiplicative gate on the $\tanh$ activation. Combined with vertical and horizontal stacks, this significantly outperforms the original PixelCNN.
PixelRNN
PixelRNN (van den Oord et al., 2016) uses Row LSTM and Diagonal BiLSTM architectures.
Row LSTM: Processes the image row by row. Each row's hidden state conditions on the previous row and the current row up to the current pixel. A 1D convolution along the row captures local spatial dependencies.
The hidden state update:
$$\mathbf{h}{h,w} = \text{LSTM}(\mathbf{x}{h,w}, \mathbf{h}{h-1,w}, \mathbf{h}{h,w-1})$$
The LSTM has a triangular (causal) receptive field growing upward and leftward.
Diagonal BiLSTM: Skews the image so that dependencies align diagonally, then processes along diagonals. This gives a larger effective receptive field at the cost of implementation complexity.
PixelRNN achieves better log-likelihood than early PixelCNN but is slower to train (sequential RNN operations vs. parallel convolutions).
β οΈ CRITICAL β WaveNet and Dilated Causal Convolutions
WaveNet (van den Oord et al., 2016) was designed for raw audio generation (16 kHz sample rate β 16,000 values per second). The key innovation: dilated causal convolutions that grow receptive field exponentially in depth.
Causal Convolution
A causal convolution ensures output at time $t$ depends only on inputs at times $\leq t$:
$$y_t = \sum_{i=0}^{k-1} W_i \cdot x_{t-i}$$
Implemented by left-padding the input.
Dilation
A dilated convolution with dilation factor $d$ spaces the kernel elements:
$$y_t = \sum_{i=0}^{k-1} W_i \cdot x_{t - d \cdot i}$$
Dilation is doubled at each layer: $d = 1, 2, 4, 8, \ldots, 2^{L-1}$ for $L$ layers.
Receptive Field
The receptive field after $L$ layers of dilated convolutions with kernel size $k$:
$$\text{RF} = 1 + (k-1)\sum_{\ell=0}^{L-1} d_\ell = 1 + (k-1)\sum_{\ell=0}^{L-1} 2^\ell = 1 + (k-1)(2^L - 1)$$
For WaveNet typical values: $k=2, L=30$ layers β RF $\approx 1 + 1 \cdot (2^{30} - 1) \approx 10^9$ samples. At 16 kHz, this covers ~17 hours of audio β sufficient for long-range dependencies.
WaveNet Architecture
$Residual block:
input β dilated causal conv β tanh β Ο (gate) β 1Γ1 conv β + input (residual)
β
skip connection β summed across layers β output
$
The gated activation (like Gated PixelCNN) uses $\tanh$ and sigmoid branches that multiply. Skip connections from every layer are summed to produce the final output, which parameterizes a 256-way softmax over $\mu$-law quantized audio values.
Autoregressive Image Generation: Modern Connections
VQ-VAE + Autoregressive Prior
VQ-VAE (vector-quantized VAE) compresses images into a discrete latent grid, then trains an autoregressive model (PixelCNN or Transformer) on the discrete latents. This decouples compression (VQ-VAE) from density modeling (autoregressive).
Two-stage training: 1. Train VQ-VAE: encoder $E(\mathbf{x}) \to \mathbf{z}$, nearest-neighbor quantize to codebook, decoder reconstructs 2. Train autoregressive model over discrete code indices $p(z_1, \ldots, z_{HW})$
This approach (used in VQ-GAN, DALLΒ·E 1) is faster than pixel-level autoregression because the latent grid is much smaller (e.g., $16 \times 16$ vs. $256 \times 256$ pixels).
Connection to Language Models
Modern LLMs (GPT, LLaMA) are autoregressive models over discrete tokens:
$$p(\text{tokens}) = \prod_{i=1}^{N} p(t_i \mid t_1, \ldots, t_{i-1};\; \text{prompt})$$
The architecture is a causally-masked Transformer (decoder-only). Key parallels: - PixelCNN's masked convolutions $\leftrightarrow$ causal attention mask in Transformers - WaveNet's dilated convolutions $\leftrightarrow$ the "next-token prediction" paradigm - Both optimize exact log-likelihood via cross-entropy
Training and Evaluation
Training: Maximum likelihood via cross-entropy loss for discretized data:
$$\mathcal{L} = -\frac{1}{D}\sum_{i=1}^{D} \log p_\theta(x_i \mid \mathbf{x}_{<i})$$
This is $O(D)$ per datapoint but fully parallelizable during training (teacher forcing β all conditionals computed simultaneously using ground-truth $\mathbf{x}_{<i}$).
Likelihood evaluation: Exact log-likelihood via the chain rule:
$$\log p(\mathbf{x}) = \sum_{i=1}^{D} \log p(x_i \mid \mathbf{x}_{<i})$$
This is a major advantage over GANs (no likelihood) and VAEs (lower bound only).
Bits per dimension (BPD):
$$\text{BPD} = -\frac{\log_2 p(\mathbf{x})}{D}$$
Lower BPD = better compression. PixelCNN achieves ~3.0 BPD on CIFAR-10; state-of-the-art around 2.8 BPD.
Key Terms
- Autoregressive models
- Diagonal BiLSTM
- Mask Type A
- Mask Type B
- PixelCNN
- PixelRNN
- Row LSTM
- WaveNet
Worked Examples
Example 1: Chain Rule Decomposition
An image has 4 pixels (2Γ2), each taking values in ${0, 1}$. Write the full autoregressive factorization and compute $p(1, 0, 1, 1)$ given: - $p(x_1 = 1) = 0.6$ - $p(x_2 = 0 \mid x_1 = 1) = 0.7$ - $p(x_3 = 1 \mid x_1 = 1, x_2 = 0) = 0.4$ - $p(x_4 = 1 \mid x_1 = 1, x_2 = 0, x_3 = 1) = 0.8$
Solution:
$$p(1,0,1,1) = p(x_1{=}1) \cdot p(x_2{=}0 \mid x_1{=}1) \cdot p(x_3{=}1 \mid x_1{=}1,x_2{=}0) \cdot p(x_4{=}1 \mid x_1{=}1,x_2{=}0,x_3{=}1)$$
$$= 0.6 \cdot 0.7 \cdot 0.4 \cdot 0.8 = 0.1344$$
Click for answer
$p(1,0,1,1) = 0.1344 = 13.44\%$. The autoregressive factorization is exact by the chain rule β no approximation or lower bound. Each factor conditions on all previous variables.
Example 2: WaveNet Receptive Field
A WaveNet has kernel size $k=3$ and $L=10$ layers with dilation doubling each layer ($d = 1, 2, 4, \ldots, 512$). Compute the receptive field. How many more layers to cover 1 second at 16 kHz?
Solution:
$$\text{RF} = 1 + (k-1)\sum_{\ell=0}^{9} 2^\ell = 1 + 2 \cdot (2^{10} - 1) = 1 + 2 \cdot 1023 = 2047 \text{ samples}$$
At 16 kHz: 2047/16000 β 0.128 seconds.
For 1 second (16,000 samples): need RF β₯ 16000. Solve for $L$:
$$1 + 2(2^L - 1) \geq 16000 \implies 2^{L+1} - 1 \geq 16000 \implies 2^{L+1} \geq 16001$$
$2^{13} = 8192, 2^{14} = 16384 \implies L+1 = 14 \implies L = 13$ layers.
Click for answer
RF = 2047 samples (0.128 sec). Need 13 layers for 1 sec coverage, only 3 more than the 10 given. The exponential growth means each additional layer doubles the receptive field β going from 10 to 13 layers multiplies RF by 8.Example 3: PixelCNN Mask
For a 3Γ3 convolution kernel $W$, construct the Type A mask for a 2D grayscale image. Assume center is at position (1,1) with indices (0,0) top-left to (2,2) bottom-right.
Solution:
Type A mask: allow only previously generated pixels (above, or same-row-left). Center not allowed.
| $W_{0,0}$=1 | $W_{0,1}$=1 | $W_{0,2}$=1 | | $W_{1,0}$=1 | $W_{1,1}$=0 | $W_{1,2}$=0 | | $W_{2,0}$=0 | $W_{2,1}$=0 | $W_{2,2}$=0 |
Row 0 (above): all allowed. Row 1 (same row): only left-of-center allowed. Row 2 (below): none allowed.
Click for answer
The mask creates a "blind" zone (lower-right quadrant around the center). Type B would differ only in $W_{1,1}=1$ (center allowed). This enforces the raster-scan dependency: pixel $(h,w)$ can only see pixels already generated when processing in row-major order.Practice Problems
-
Given a binary sequence of length 3 and autoregressive conditionals $p(x_1{=}1)=0.5$, $p(x_2{=}1|x_1)=0.3$ if $x_1=1$ else 0.4, $p(x_3{=}1|x_1,x_2)=0.6$ if exactly one of $x_1,x_2$ is 1 else 0.2, compute $p(1, 0, 1)$.
Click for answer
$p(1,0,1) = 0.5 \cdot p(x_2{=}0|x_1{=}1) \cdot p(x_3{=}1|x_1{=}1,x_2{=}0)$ $p(x_2{=}0|x_1{=}1) = 1 - 0.3 = 0.7$ $x_1=1, x_2=0$ β exactly one is 1 β $p(x_3{=}1|\ldots) = 0.6$ $p(1,0,1) = 0.5 \cdot 0.7 \cdot 0.6 = 0.21$ -
Show that for discrete data, the autoregressive negative log-likelihood equals the cross-entropy between the predicted distribution and the one-hot ground truth at each position.
Click for answer
$\mathcal{L} = -\frac{1}{D}\sum_i \log p_\theta(x_i|\mathbf{x}_{ 3. Why can't a standard (undilated) convolutional network achieve the same receptive field as WaveNet without prohibitive depth?4. Explain why autoregressive models are naturally suited for text (language modeling) but face challenges for high-resolution images.Click for answer
Standard conv receptive field grows linearly: $RF = 1 + L(k-1)$. For $k=3$, $L=30$: $RF = 1 + 30 \cdot 2 = 61$. For WaveNet dilation: $RF = 1 + 2(2^{30}-1) \approx 2 \times 10^9$. A standard network would need ~1 billion layers to match this. Dilated convolutions achieve exponential receptive field growth β essential for long sequences like audio.5. Compute bits per dimension for a model achieving $\log p(\mathbf{x}) = -1200$ on a $32 \times 32 \times 3$ image with pixel values in $\{0, \ldots, 255\}$.Click for answer
Text is naturally sequential (1D), with ~10^2β10^4 tokens per sample. Images are 2D grids flattened to 1D, yielding $256 \times 256 \times 3 \approx 200,000$ "tokens" for modest resolution. The $O(D)$ sequential sampling becomes prohibitive. Text also has strong local dependencies; images require long-range 2D spatial dependencies that are awkward in 1D orderings. This is why VQ-VAE + autoregressive prior is preferred: compress to a small latent grid first, then model with the autoregressive prior.--- ## Summary Key takeaways: - **Autoregressive models** factor $p(\mathbf{x}) = \prod_i p(x_i | \mathbf{x}_{Click for answer
$D = 32 \times 32 \times 3 = 3072$ dimensions. $\text{BPD} = -\frac{\log_2 p(\mathbf{x})}{D} = -\frac{\ln p(\mathbf{x})}{D \ln 2} = \frac{1200}{3072 \cdot 0.6931} = \frac{1200}{2129.0} \approx 0.564$ This is unrealistically low β SOTA is ~2.8 BPD on CIFAR-10. 0.564 would mean near-perfect compression, achievable only if the data is nearly deterministic.Answer and Explanations
**Correct: B)** During training, to predict $x_i$, the model receives the true $\mathbf{x}_{