Math graphic
πŸ“ Concept diagram

22-08 β€” Autoregressive Models

Phase: 22 β€” Generative Models Mathematics Subject: 22-08 Prerequisites: 22-01 β€” Autoencoders, Phase 16–17 (Neural Networks & CNNs β€” convolutions, architectures), Phase 13 (Probability β€” chain rule) Next subject: 22-09 β€” Energy-Based Models


Learning Objectives

By the end of this subject, you will be able to:

  1. Formulate autoregressive generative modeling via the chain rule of probability
  2. Derive the masked convolution operation used in PixelCNN and explain causal constraints
  3. Understand dilated causal convolutions in WaveNet and compute receptive field size
  4. Compare autoregressive models to latent-variable and score-based approaches in terms of sampling speed, likelihood, and sample quality
  5. Connect autoregressive image/text generation to modern LLM architectures

Core Content

The Autoregressive Principle

An autoregressive model decomposes a joint distribution over a sequence or grid into a product of conditionals using the chain rule of probability:

$$p(\mathbf{x}) = p(x_1)\prod_{i=2}^{D} p(x_i \mid x_1, \ldots, x_{i-1})$$

For images, $\mathbf{x} = (x_1, \ldots, x_{H \times W \times C})$ is flattened into a sequence according to some ordering (typically raster scan: row-by-row, left-to-right, RGB channels interleaved).

The model learns each conditional $p(x_i \mid \mathbf{x}_{<i})$ parameterized by a neural network. Generation proceeds sequentially:

  1. Sample $x_1 \sim p(x_1)$
  2. Sample $x_2 \sim p(x_2 \mid x_1)$
  3. ...
  4. Sample $x_D \sim p(x_D \mid x_1, \ldots, x_{D-1})$

⚠️ CRITICAL: This sequential sampling is the fundamental trade-off of autoregressive models: exact likelihoods (tractable!) but slow sampling ($O(D)$ steps, not parallelizable). Contrast with GANs (fast sampling, no likelihood) and diffusion (medium sampling, approximate likelihood).

PixelCNN

PixelCNN (van den Oord et al., 2016) models $p(x_i \mid \mathbf{x}_{<i})$ using a CNN with masked convolutions that enforce the autoregressive ordering constraint.

Masked Convolutions

A standard convolution computes:

$$y_{h,w} = \sum_{i=-k}^{k}\sum_{j=-k}^{k} W_{i,j} \cdot x_{h+i, w+j}$$

To enforce the autoregressive constraint (pixel $(h,w)$ can only depend on "previous" pixels), we multiply the filter by a mask $M$:

$$y_{h,w} = \sum_{i,j} (W_{i,j} \odot M_{i,j}) \cdot x_{h+i, w+j}$$

Mask Type A (first layer): Center pixel and all "future" pixels masked to zero: $$M_{i,j} = \begin{cases} 1 & \text{if } (h+i < h) \text{ or } (h+i = h \text{ and } w+j < w) \ 0 & \text{otherwise} \end{cases}$$

Mask Type B (subsequent layers): Allows the center pixel (self-connection) since later layers operate on features, not raw pixels: $$M_{i,j} = \begin{cases} 1 & \text{if } (h+i < h) \text{ or } (h+i = h \text{ and } w+j \leq w) \ 0 & \text{otherwise} \end{cases}$$

For RGB images with interleaved channels, the mask is extended to channel depth: pixel $(h,w,c)$ can depend on $(h,w,c')$ for $c' < c$ (Type A) or $c' \leq c$ (Type B).

Blind Spot Problem

A standard masked $3 \times 3$ convolution has a "blind spot" β€” the pixel to the right of center in the row above is accessible, but the pixel directly above-right is not if the receptive field grows in only one direction. This is solved by using two separate streams: a horizontal stack and a vertical stack (Gated PixelCNN).

Gated PixelCNN

The Gated PixelCNN (van den Oord et al., 2016) uses:

$$\mathbf{y} = \tanh(W_{k,f} * \mathbf{x}) \odot \sigma(W_{k,g} * \mathbf{x})$$

This is a gated activation (like an LSTM/GRU gate), where $\sigma$ acts as a multiplicative gate on the $\tanh$ activation. Combined with vertical and horizontal stacks, this significantly outperforms the original PixelCNN.

PixelRNN

PixelRNN (van den Oord et al., 2016) uses Row LSTM and Diagonal BiLSTM architectures.

Row LSTM: Processes the image row by row. Each row's hidden state conditions on the previous row and the current row up to the current pixel. A 1D convolution along the row captures local spatial dependencies.

The hidden state update:

$$\mathbf{h}{h,w} = \text{LSTM}(\mathbf{x}{h,w}, \mathbf{h}{h-1,w}, \mathbf{h}{h,w-1})$$

The LSTM has a triangular (causal) receptive field growing upward and leftward.

Diagonal BiLSTM: Skews the image so that dependencies align diagonally, then processes along diagonals. This gives a larger effective receptive field at the cost of implementation complexity.

PixelRNN achieves better log-likelihood than early PixelCNN but is slower to train (sequential RNN operations vs. parallel convolutions).


⚠️ CRITICAL β€” WaveNet and Dilated Causal Convolutions

WaveNet (van den Oord et al., 2016) was designed for raw audio generation (16 kHz sample rate β†’ 16,000 values per second). The key innovation: dilated causal convolutions that grow receptive field exponentially in depth.

Causal Convolution

A causal convolution ensures output at time $t$ depends only on inputs at times $\leq t$:

$$y_t = \sum_{i=0}^{k-1} W_i \cdot x_{t-i}$$

Implemented by left-padding the input.

Dilation

A dilated convolution with dilation factor $d$ spaces the kernel elements:

$$y_t = \sum_{i=0}^{k-1} W_i \cdot x_{t - d \cdot i}$$

Dilation is doubled at each layer: $d = 1, 2, 4, 8, \ldots, 2^{L-1}$ for $L$ layers.

Receptive Field

The receptive field after $L$ layers of dilated convolutions with kernel size $k$:

$$\text{RF} = 1 + (k-1)\sum_{\ell=0}^{L-1} d_\ell = 1 + (k-1)\sum_{\ell=0}^{L-1} 2^\ell = 1 + (k-1)(2^L - 1)$$

For WaveNet typical values: $k=2, L=30$ layers β†’ RF $\approx 1 + 1 \cdot (2^{30} - 1) \approx 10^9$ samples. At 16 kHz, this covers ~17 hours of audio β€” sufficient for long-range dependencies.

WaveNet Architecture

$Residual block:
  input β†’ dilated causal conv β†’ tanh βŠ™ Οƒ (gate) β†’ 1Γ—1 conv β†’ + input (residual)
                                  ↓
                              skip connection β†’ summed across layers β†’ output
$

The gated activation (like Gated PixelCNN) uses $\tanh$ and sigmoid branches that multiply. Skip connections from every layer are summed to produce the final output, which parameterizes a 256-way softmax over $\mu$-law quantized audio values.


Autoregressive Image Generation: Modern Connections

VQ-VAE + Autoregressive Prior

VQ-VAE (vector-quantized VAE) compresses images into a discrete latent grid, then trains an autoregressive model (PixelCNN or Transformer) on the discrete latents. This decouples compression (VQ-VAE) from density modeling (autoregressive).

Two-stage training: 1. Train VQ-VAE: encoder $E(\mathbf{x}) \to \mathbf{z}$, nearest-neighbor quantize to codebook, decoder reconstructs 2. Train autoregressive model over discrete code indices $p(z_1, \ldots, z_{HW})$

This approach (used in VQ-GAN, DALLΒ·E 1) is faster than pixel-level autoregression because the latent grid is much smaller (e.g., $16 \times 16$ vs. $256 \times 256$ pixels).

Connection to Language Models

Modern LLMs (GPT, LLaMA) are autoregressive models over discrete tokens:

$$p(\text{tokens}) = \prod_{i=1}^{N} p(t_i \mid t_1, \ldots, t_{i-1};\; \text{prompt})$$

The architecture is a causally-masked Transformer (decoder-only). Key parallels: - PixelCNN's masked convolutions $\leftrightarrow$ causal attention mask in Transformers - WaveNet's dilated convolutions $\leftrightarrow$ the "next-token prediction" paradigm - Both optimize exact log-likelihood via cross-entropy


Training and Evaluation

Training: Maximum likelihood via cross-entropy loss for discretized data:

$$\mathcal{L} = -\frac{1}{D}\sum_{i=1}^{D} \log p_\theta(x_i \mid \mathbf{x}_{<i})$$

This is $O(D)$ per datapoint but fully parallelizable during training (teacher forcing β€” all conditionals computed simultaneously using ground-truth $\mathbf{x}_{<i}$).

Likelihood evaluation: Exact log-likelihood via the chain rule:

$$\log p(\mathbf{x}) = \sum_{i=1}^{D} \log p(x_i \mid \mathbf{x}_{<i})$$

This is a major advantage over GANs (no likelihood) and VAEs (lower bound only).

Bits per dimension (BPD):

$$\text{BPD} = -\frac{\log_2 p(\mathbf{x})}{D}$$

Lower BPD = better compression. PixelCNN achieves ~3.0 BPD on CIFAR-10; state-of-the-art around 2.8 BPD.



Key Terms

Worked Examples

Example 1: Chain Rule Decomposition

An image has 4 pixels (2Γ—2), each taking values in ${0, 1}$. Write the full autoregressive factorization and compute $p(1, 0, 1, 1)$ given: - $p(x_1 = 1) = 0.6$ - $p(x_2 = 0 \mid x_1 = 1) = 0.7$ - $p(x_3 = 1 \mid x_1 = 1, x_2 = 0) = 0.4$ - $p(x_4 = 1 \mid x_1 = 1, x_2 = 0, x_3 = 1) = 0.8$

Solution:

$$p(1,0,1,1) = p(x_1{=}1) \cdot p(x_2{=}0 \mid x_1{=}1) \cdot p(x_3{=}1 \mid x_1{=}1,x_2{=}0) \cdot p(x_4{=}1 \mid x_1{=}1,x_2{=}0,x_3{=}1)$$

$$= 0.6 \cdot 0.7 \cdot 0.4 \cdot 0.8 = 0.1344$$

Click for answer $p(1,0,1,1) = 0.1344 = 13.44\%$. The autoregressive factorization is exact by the chain rule β€” no approximation or lower bound. Each factor conditions on all previous variables.

Example 2: WaveNet Receptive Field

A WaveNet has kernel size $k=3$ and $L=10$ layers with dilation doubling each layer ($d = 1, 2, 4, \ldots, 512$). Compute the receptive field. How many more layers to cover 1 second at 16 kHz?

Solution:

$$\text{RF} = 1 + (k-1)\sum_{\ell=0}^{9} 2^\ell = 1 + 2 \cdot (2^{10} - 1) = 1 + 2 \cdot 1023 = 2047 \text{ samples}$$

At 16 kHz: 2047/16000 β‰ˆ 0.128 seconds.

For 1 second (16,000 samples): need RF β‰₯ 16000. Solve for $L$:

$$1 + 2(2^L - 1) \geq 16000 \implies 2^{L+1} - 1 \geq 16000 \implies 2^{L+1} \geq 16001$$

$2^{13} = 8192, 2^{14} = 16384 \implies L+1 = 14 \implies L = 13$ layers.

Click for answer RF = 2047 samples (0.128 sec). Need 13 layers for 1 sec coverage, only 3 more than the 10 given. The exponential growth means each additional layer doubles the receptive field β€” going from 10 to 13 layers multiplies RF by 8.

Example 3: PixelCNN Mask

For a 3Γ—3 convolution kernel $W$, construct the Type A mask for a 2D grayscale image. Assume center is at position (1,1) with indices (0,0) top-left to (2,2) bottom-right.

Solution:

Type A mask: allow only previously generated pixels (above, or same-row-left). Center not allowed.

| $W_{0,0}$=1 | $W_{0,1}$=1 | $W_{0,2}$=1 | | $W_{1,0}$=1 | $W_{1,1}$=0 | $W_{1,2}$=0 | | $W_{2,0}$=0 | $W_{2,1}$=0 | $W_{2,2}$=0 |

Row 0 (above): all allowed. Row 1 (same row): only left-of-center allowed. Row 2 (below): none allowed.

Click for answer The mask creates a "blind" zone (lower-right quadrant around the center). Type B would differ only in $W_{1,1}=1$ (center allowed). This enforces the raster-scan dependency: pixel $(h,w)$ can only see pixels already generated when processing in row-major order.

Practice Problems

  1. Given a binary sequence of length 3 and autoregressive conditionals $p(x_1{=}1)=0.5$, $p(x_2{=}1|x_1)=0.3$ if $x_1=1$ else 0.4, $p(x_3{=}1|x_1,x_2)=0.6$ if exactly one of $x_1,x_2$ is 1 else 0.2, compute $p(1, 0, 1)$.

    Click for answer $p(1,0,1) = 0.5 \cdot p(x_2{=}0|x_1{=}1) \cdot p(x_3{=}1|x_1{=}1,x_2{=}0)$ $p(x_2{=}0|x_1{=}1) = 1 - 0.3 = 0.7$ $x_1=1, x_2=0$ β†’ exactly one is 1 β†’ $p(x_3{=}1|\ldots) = 0.6$ $p(1,0,1) = 0.5 \cdot 0.7 \cdot 0.6 = 0.21$

  2. Show that for discrete data, the autoregressive negative log-likelihood equals the cross-entropy between the predicted distribution and the one-hot ground truth at each position.

    Click for answer $\mathcal{L} = -\frac{1}{D}\sum_i \log p_\theta(x_i|\mathbf{x}_{ 3. Why can't a standard (undilated) convolutional network achieve the same receptive field as WaveNet without prohibitive depth?
    Click for answer Standard conv receptive field grows linearly: $RF = 1 + L(k-1)$. For $k=3$, $L=30$: $RF = 1 + 30 \cdot 2 = 61$. For WaveNet dilation: $RF = 1 + 2(2^{30}-1) \approx 2 \times 10^9$. A standard network would need ~1 billion layers to match this. Dilated convolutions achieve exponential receptive field growth β€” essential for long sequences like audio.
    4. Explain why autoregressive models are naturally suited for text (language modeling) but face challenges for high-resolution images.
    Click for answer Text is naturally sequential (1D), with ~10^2–10^4 tokens per sample. Images are 2D grids flattened to 1D, yielding $256 \times 256 \times 3 \approx 200,000$ "tokens" for modest resolution. The $O(D)$ sequential sampling becomes prohibitive. Text also has strong local dependencies; images require long-range 2D spatial dependencies that are awkward in 1D orderings. This is why VQ-VAE + autoregressive prior is preferred: compress to a small latent grid first, then model with the autoregressive prior.
    5. Compute bits per dimension for a model achieving $\log p(\mathbf{x}) = -1200$ on a $32 \times 32 \times 3$ image with pixel values in $\{0, \ldots, 255\}$.
    Click for answer $D = 32 \times 32 \times 3 = 3072$ dimensions. $\text{BPD} = -\frac{\log_2 p(\mathbf{x})}{D} = -\frac{\ln p(\mathbf{x})}{D \ln 2} = \frac{1200}{3072 \cdot 0.6931} = \frac{1200}{2129.0} \approx 0.564$ This is unrealistically low β€” SOTA is ~2.8 BPD on CIFAR-10. 0.564 would mean near-perfect compression, achievable only if the data is nearly deterministic.
    --- ## Summary Key takeaways: - **Autoregressive models** factor $p(\mathbf{x}) = \prod_i p(x_i | \mathbf{x}_{ Answer and Explanations **Correct: B)** During training, to predict $x_i$, the model receives the true $\mathbf{x}_{