📐 Concept diagram

25-10 — Multimodal Models Mathematics

Phase: 25 — Frontiers & Active Research Areas Subject: 25-10 — FINAL SUBJECT IN CURRICULUM Prerequisites: 18 (LLM Mathematics), 17 (Deep Learning Architectures), 22 (Generative Models), 20 (Training) Next subject: This is the final subject in the LLM Researcher Mathematics Curriculum. Congratulations on completing the full 241-subject journey from Year 8 arithmetic to PhD-level AI research mathematics.

Learning Objectives

By the end of this subject, you will be able to:

Formulate the contrastive language-image pre-training (CLIP) objective as a symmetric cross-entropy over similarity scores
Derive the cross-attention mechanism for multimodal fusion and compare early vs late fusion strategies
Explain how images are tokenised into sequences for transformer processing (ViT patch embeddings)
Analyse the loss functions used in image-text models (contrastive, generative, masked modelling)
Design a multimodal training pipeline with modality-specific encoders and shared representation spaces

Core Content

1. The Multimodal Learning Problem

Multimodal models process and relate information from multiple modalities — typically text and images, but also audio, video, and other sensor data. The fundamental challenge: aligning representations across modalities with fundamentally different structures.

Key mathematical operations in multimodal learning: - Encoding: Map each modality to a shared embedding space - Alignment: Ensure corresponding concepts from different modalities map to nearby points - Fusion: Combine modality-specific representations for downstream tasks - Generation: Produce output in one modality conditioned on another

⚠️ CRITICAL — Why this matters for LLMs: The frontier of AI is multimodal. GPT-4V, Gemini, Claude, and other frontier models process both text and images. Understanding multimodal architectures is essential for working with or building next-generation AI systems.

2. Vision Transformers (ViT) — Image Tokenisation

To process images with a transformer, images must be converted to sequences of tokens, analogous to text tokenisation.

Patch embedding (Dosovitskiy et al., 2021): An image $I \in \mathbb{R}^{H \times W \times C}$ is divided into a grid of $N$ non-overlapping patches of size $P \times P$:

$$N = \frac{H}{P} \cdot \frac{W}{P}$$

Each patch is flattened to a vector of size $P^2 \cdot C$ and linearly projected to the model dimension $d$:

$$\mathbf{x}_i = \mathbf{E} \cdot \text{flatten}(\text{patch}_i) + \mathbf{p}_i$$

where $\mathbf{E} \in \mathbb{R}^{d \times (P^2 C)}$ is a learned embedding matrix and $\mathbf{p}_i$ is a learned positional embedding.

A special [CLS] token is prepended, and the sequence of $N+1$ tokens is processed by a standard transformer encoder.

For a 224×224 RGB image with 16×16 patches: $N = (224/16)^2 = 196$ patches, each of size $16 \cdot 16 \cdot 3 = 768$, mapped to $d$ dimensions (typically 768 or 1024). The transformer processes 197 tokens — comparable to a 200-word sentence.

3. CLIP — Contrastive Language-Image Pre-training

CLIP (Radford et al., 2021) is the foundational multimodal alignment model. It trains separate text and image encoders to produce embeddings where matched pairs have high cosine similarity.

Architecture: - Image encoder: ViT or ResNet, outputting $\mathbf{v}_i \in \mathbb{R}^d$ - Text encoder: Transformer, outputting $\mathbf{t}_j \in \mathbb{R}^d$

Both encoder outputs are L2-normalised to the unit hypersphere.

Training objective — symmetric cross-entropy over similarity scores:

Given a batch of $B$ (image, text) pairs, compute the $B \times B$ similarity matrix:

$$S_{ij} = \frac{\mathbf{v}_i^\top \mathbf{t}_j}{\tau}$$

where $\tau$ is a learned temperature parameter controlling the sharpness of the softmax.

The loss has two symmetric components:

$$\mathcal{L}{\text{CLIP}} = \frac{1}{2}\left(\mathcal{L}{\text{I}\to\text{T}} + \mathcal{L}_{\text{T}\to\text{I}}\right)$$

Image-to-text loss: For image $i$, the correct text is $i$ (diagonal). Cross-entropy over all $B$ texts:

$$\mathcal{L}{\text{I}\to\text{T}} = -\frac{1}{B}\sum{i=1}^B \log \frac{\exp(S_{ii})}{\sum_{j=1}^B \exp(S_{ij})}$$

Text-to-image loss: For text $j$, the correct image is $j$:

$$\mathcal{L}{\text{T}\to\text{I}} = -\frac{1}{B}\sum{j=1}^B \log \frac{\exp(S_{jj})}{\sum_{i=1}^B \exp(S_{ij})}$$

⚠️ CRITICAL: CLIP's contrastive loss creates a shared embedding space where matched pairs are pulled together and non-matched pairs are pushed apart — this is the same principle as the InfoNCE loss studied in self-supervised learning (Phase 22). The batch size must be large (typically 32,768) because in-batch negatives provide the contrastive signal.

At inference time: CLIP enables zero-shot classification by comparing image embeddings to text embeddings of class names ("a photo of a dog", "a photo of a cat", ...) and selecting the most similar text.

4. Cross-Attention for Multimodal Fusion

While CLIP keeps modalities separate (comparing only at the final embedding), many tasks require deep interaction between modalities. Cross-attention enables this.

Standard cross-attention: Given queries from modality $A$ and keys/values from modality $B$:

$$\text{CrossAttn}(\mathbf{Q}_A, \mathbf{K}_B, \mathbf{V}_B) = \text{softmax}\left(\frac{\mathbf{Q}_A \mathbf{K}_B^\top}{\sqrt{d_k}}\right)\mathbf{V}_B$$

This allows modality $A$ to "attend to" modality $B$, retrieving relevant information. The output is in the dimensionality of modality $A$ but incorporates information from modality $B$.

In practice (e.g., Flamingo, LLaVA): A vision encoder produces image features; a language model (LLM) incorporates these via cross-attention layers interleaved with standard self-attention. The image features serve as keys and values; the text tokens serve as queries.

5. Fusion Strategies: Early vs Late

Strategy	How It Works	Pros	Cons
Early fusion	Concatenate raw features from all modalities before any processing	Simple, all layers see all modalities	Input dimensionality grows, modality-specific patterns harder to learn
Middle fusion	Process modalities separately, then fuse at intermediate layers via cross-attention	Balances modality-specific processing with interaction	More complex architecture
Late fusion	Process modalities completely independently, combine only at final embedding (e.g., CLIP)	Simple, modalities can use different architectures	No cross-modal reasoning — can't answer "is the red ball to the left of the blue cube?"

Modern multimodal LLMs (GPT-4V, Gemini, Claude 3) typically use middle fusion: the image is processed by a vision encoder into features, which are then injected into the LLM's intermediate layers via cross-attention or by concatenating visual tokens with text tokens in the input sequence.

6. Multimodal Tokenisation and Unified Architectures

Unified token approach: Convert everything to tokens in a shared vocabulary: - Text tokens: standard subword tokenisation (BPE) - Image tokens: patch embeddings or discrete VQ-VAE tokens - Audio tokens: spectrogram patches or neural audio codec tokens

A single transformer then processes the interleaved multimodal token sequence — the same architecture handles all modalities. This is the approach used by Gemini and the most recent multimodal LLMs.

Mathematical formulation: Given modalities $m_1, m_2, \ldots, m_k$, each with encoder $E_{m_i}$ mapping to shared token space $\mathbb{R}^d$, the full sequence is:

$$\mathbf{X} = [E_{m_1}(x_{m_1}), E_{m_2}(x_{m_2}), \ldots, E_{m_k}(x_{m_k})]$$

A causal transformer with parameters $\theta$ processes this sequence, with special tokens or learned embeddings indicating modality boundaries. The unified loss is typically next-token prediction across the entire multimodal sequence — the model learns to predict both text tokens and image patch tokens.

Key Terms

CLIP
Cross-attention
Early fusion
Fusion strategy
LLM Researcher Mathematics Curriculum
Late fusion
Middle fusion
Modern multimodal LLMs
Unified token architectures
Vision Transformers

Worked Examples

Example 1: CLIP Similarity Computation

Problem: In a batch of 3 image-text pairs, the L2-normalised embeddings are: - Images: $\mathbf{v}_1 = [0.8, 0.6], \mathbf{v}_2 = [0.3, 0.95], \mathbf{v}_3 = [-0.5, 0.87]$ - Texts: $\mathbf{t}_1 = [0.7, 0.71], \mathbf{t}_2 = [0.2, 0.98], \mathbf{t}_3 = [-0.6, 0.8]$

With temperature $\tau = 0.07$, compute the CLIP loss.

Solution:

Step 1 — Similarity matrix $S_{ij} = \mathbf{v}_i^\top \mathbf{t}_j / \tau$:

$S_{11} = (0.8 \cdot 0.7 + 0.6 \cdot 0.71)/0.07 = (0.56 + 0.426)/0.07 = 0.986/0.07 = 14.086$ $S_{12} = (0.8 \cdot 0.2 + 0.6 \cdot 0.98)/0.07 = (0.16 + 0.588)/0.07 = 0.748/0.07 = 10.686$ $S_{13} = (0.8(-0.6) + 0.6 \cdot 0.8)/0.07 = (-0.48 + 0.48)/0.07 = 0$ ... continuing: $S_{21} = 12.556, S_{22} = 14.171, S_{23} = 8.286, S_{31} = 3.829, S_{32} = 10.157, S_{33} = 14.229$

$$S \approx \begin{bmatrix} 14.09 & 10.69 & 0 \ 12.56 & 14.17 & 8.29 \ 3.83 & 10.16 & 14.23 \end{bmatrix}$$

Step 2 — Image-to-text softmax (row-wise):

Row 1: $\text{softmax}([14.09, 10.69, 0]) \approx [0.967, 0.033, 0.000]$ Row 2: $\text{softmax}([12.56, 14.17, 8.29]) \approx [0.167, 0.832, 0.002]$ Row 3: $\text{softmax}([3.83, 10.16, 14.23]) \approx [0.000, 0.017, 0.983]$

$\mathcal{L}_{\text{I}\to\text{T}} = -(\log 0.967 + \log 0.832 + \log 0.983)/3 = -(-0.034 - 0.184 - 0.017)/3 = 0.235/3 = 0.078$

Step 3 — Text-to-image softmax (column-wise, by symmetry): Similar computation gives $\mathcal{L}_{\text{T}\to\text{I}} \approx 0.081$

Total: $\mathcal{L}_{\text{CLIP}} = (0.078 + 0.081)/2 = 0.0795$

The loss is low because the cosine similarities on the diagonal (matched pairs) are much higher than off-diagonal values — the model has learned good alignment.

Example 2: Cross-Attention Fusion

Problem: A multimodal model has text queries $\mathbf{Q}T \in \mathbb{R}^{5 \times 64}$ (5 tokens, $d=64$) and image key-value pairs $\mathbf{K}_I, \mathbf{V}_I \in \mathbb{R}^{196 \times 64}$ (196 ViT patches, $d=64$). Compute the cross-attention output for the first text token if its query has highest dot-product with image patch 42 ($\mathbf{k}{42}^\top \mathbf{q}1 = 8.0$) and negligible with others ($\approx 0$). The value for patch 42 is $\mathbf{v}{42} = [0.5, 0.5, \ldots, 0.5]$ (uniform 64-dim).

Solution:

Attention weights for token 1: $\alpha_{1,j} = \text{softmax}(\mathbf{k}_j^\top \mathbf{q}_1 / \sqrt{64}) = \text{softmax}(\mathbf{k}_j^\top \mathbf{q}_1 / 8)$

$\mathbf{k}_{42}^\top \mathbf{q}_1 / 8 = 1.0$, all others $\approx 0$

$\alpha_{1,42} = e^{1.0} / (e^{1.0} + 195 \cdot e^0) = 2.718 / (2.718 + 195) = 0.0137$ $\alpha_{1,j \neq 42} = 1 / (2.718 + 195) = 0.00505$ each

Cross-attention output: $\mathbf{o}1 = \sum_j \alpha{1,j} \mathbf{v}j = 0.0137\mathbf{v}{42} + 0.00505\sum_{j\neq 42}\mathbf{v}_j$

If all other values are near-zero ($\mathbf{v}_j \approx \mathbf{0}$): $\mathbf{o}_1 \approx 0.0137 \cdot [0.5, \ldots, 0.5] = [0.0069, \ldots, 0.0069]$

Even though $\mathbf{k}_{42}$ matched strongly with $\mathbf{q}_1$, the large number of keys (196) dilutes the attention weight — softmax over 196 elements means even the best match only gets 1.4% of attention. This is a known issue with cross-attention over large image feature sets; solutions include learned queries (Perceiver) or restricting to top-K patches.

Example 3: Image Patch Tokenisation

Problem: A 384×384 RGB image is tokenised with 32×32 patches for a ViT with $d=1024$. Compute the number of image tokens, the input dimension to the patch embedding, and the size of the embedding matrix $\mathbf{E}$.

Solution:

Number of patches: $N = (384/32)^2 = 12^2 = 144$ tokens (+ 1 CLS = 145 total)

Patch vector dimension: $32 \times 32 \times 3 = 3072$

Embedding matrix $\mathbf{E} \in \mathbb{R}^{1024 \times 3072}$: $1024 \times 3072 = 3,145,728$ parameters (3.1M params — small compared to transformer attention layers).

With positional embeddings $\mathbf{P} \in \mathbb{R}^{145 \times 1024}$: 148,480 additional parameters.

Practice Problems

Problem 1: For the CLIP objective, show that the gradient $\partial \mathcal{L}/\partial \mathbf{v}i$ encourages $\mathbf{v}_i$ to move toward $\mathbf{t}_i$ and away from $\mathbf{t}{j \neq i}$. Derive the exact form.

Problem 2: Why must the temperature $\tau$ in CLIP be learned rather than fixed? What happens if $\tau$ is too small or too large?

Problem 3: Compare the computational cost of early fusion (concatenating 1024-dim text + 1024-dim image features into a 2048-dim input) vs cross-attention fusion (separate 1024-dim encoders with cross-attention layers).

Problem 4: A multimodal LLM processes interleaved text and image tokens in a single sequence. The sequence is: $[IMG_1] [IMG_2] ... [IMG_196] Describe this image.$ The model predicts the next token. Explain why image tokens are NOT autoregressive targets in standard multimodal LLM training.

Problem 5: CLIP enables zero-shot classification by comparing image embeddings to text embeddings of class descriptions. For a 1000-class problem with 10 prompt templates per class ("a photo of a {}", "a blurry photo of a {}", ...), compute the number of forward passes needed and explain the ensembling.

Answers (click to expand)

**Problem 1:** For the image-to-text loss component: $\mathcal{L}_{\text{I}\to\text{T}} = -\frac{1}{B}\sum_i \log \frac{\exp(S_{ii})}{\sum_j \exp(S_{ij})}$ where $S_{ij} = \mathbf{v}_i^\top \mathbf{t}_j / \tau$. $\frac{\partial \mathcal{L}}{\partial \mathbf{v}_i} = \frac{1}{\tau B}\left(\sum_{j \neq i} p_j \cdot \mathbf{t}_j - (1-p_i) \cdot \mathbf{t}_i\right)$ where $p_j = \exp(S_{ij})/\sum_k \exp(S_{ik})$ are the softmax probabilities. The gradient moves $\mathbf{v}_i$ toward $\mathbf{t}_i$ (the positive pair) with weight $(1-p_i)$, and away from all $\mathbf{t}_{j\neq i}$ (the negatives) with weight $p_j$. When $p_i$ is large (good match), the pull toward $\mathbf{t}_i$ is small — the model has already succeeded for this pair. **Problem 2:** $\tau$ controls the concentration of the similarity distribution. Too small: $S_{ij}/\tau$ is large, softmax becomes nearly one-hot, gradients vanish (saturated regime). Too large: $S_{ij}/\tau \approx 0$, softmax becomes near-uniform, the loss provides weak learning signal. The optimal $\tau$ adapts to the scale of the embeddings, automatically tuning the contrastive sharpness. CLIP typically converges to $\tau \approx \exp(2.3) \approx 0.07$ during training. **Problem 3:** Early fusion with self-attention over $N$ tokens of dim $2d$: $O(N^2 \cdot 2d)$ compute, $O((2d)^2)$ parameters per head. Cross-attention with separate encoders: $O(N_T^2 d + N_I^2 d)$ for self-attention plus $O(N_T N_I d)$ for cross-attention. For $N_T \approx N_I$: early fusion $O(4N^2 d)$, cross-attention $O(2N^2 d + N^2 d) = O(3N^2 d)$. Cross-attention is ~25% cheaper, plus maintains modality-specific processing which is often beneficial. **Problem 4:** Image tokens represent continuous visual features, not discrete symbols — there is no vocabulary to predict. The loss is computed only on text tokens (standard next-token prediction). The image tokens are conditioning context — the model attends to them but doesn't predict them. In discrete VQ-VAE tokenisation, image tokens ARE discrete and CAN be autoregressive targets — this is used in models like DALL-E and Parti that generate images autoregressively. **Problem 5:** One forward pass through the image encoder (produces $\mathbf{v}$). For 1000 classes × 10 templates = 10,000 text prompts: one forward pass through the text encoder for each (can be batched). Total: 1 image encode + ~10,000 text encodes (batched, so ~100 forward passes at batch size 100). Ensembling: average the 10 text embeddings per class, then compute cosine similarity between $\mathbf{v}$ and each averaged class embedding. Predict the class with highest similarity. Template ensembling improves robustness to prompt phrasing.

Summary

Vision Transformers convert images to token sequences via patch embedding, enabling transformer processing of visual data.
CLIP aligns image and text representations through a symmetric contrastive loss (InfoNCE), with temperature-scaled cosine similarity as the matching score.
Cross-attention enables deep multimodal fusion by allowing one modality to query another's representations — used in Flamingo, LLaVA, and modern multimodal LLMs.
Unified token architectures (everything-as-tokens) represent the current frontier — a single transformer processes interleaved multimodal sequences.
Fusion strategy (early vs middle vs late) represents a fundamental architectural choice with distinct computational and representational trade-offs.

Pitfalls

Contrastive collapse: If the batch size is too small, CLIP can collapse to a trivial solution (all embeddings map to the same point). Large batch sizes (32K+) are essential for training from scratch.
Modality imbalance: One modality may dominate training. Loss weighting or gradient scaling across modalities is often necessary.
Evaluation contamination: CLIP's zero-shot performance depends heavily on prompt engineering — "a photo of a {}" may underperform on non-photographic domains. Prompt ensembling is standard practice.

Quiz

Question 1: In CLIP's contrastive loss, which pairs are pulled together in the embedding space?

A. All image-text pairs in the entire training dataset B. Matched (diagonal) image-text pairs within each batch, while non-matched (off-diagonal) pairs are pushed apart C. Randomly selected image-text pairs D. Only text-text pairs — images are ignored in the loss

Correct Answer: B

Explanation

- **If you chose A:** CLIP operates on in-batch negatives only — it doesn't compare across the whole dataset (that's computationally infeasible). - **If you chose B:** Correct. The symmetric cross-entropy loss treats diagonal pairs $(i,i)$ as positives and all $j \neq i$ as negatives, creating a contrastive embedding space. - **If you chose C:** Random pairs would provide no learning signal — the loss relies on knowing which pairs truly match. - **If you chose D:** CLIP's loss is symmetric over images and texts — both modalities participate equally.

Question 2: A Vision Transformer (ViT) with 16×16 patches processing a 224×224 RGB image produces how many patch tokens (excluding the [CLS] token)?

A. 14 B. 16 C. 196 D. 224

Correct Answer: C

Explanation

- **If you chose A:** 14 is the grid dimension $(224/16 = 14)$, not the total number of patches. - **If you chose B:** 16 is the patch size, not the number of patches. - **If you chose C:** Correct. $N = (224/16)^2 = 14^2 = 196$ patches, each flattened from $16 \times 16 \times 3 = 768$ values and linearly projected to the model dimension. - **If you chose D:** 224 is the image dimension in pixels, not the number of patches.

Question 3: In cross-attention for multimodal fusion, queries typically come from:

A. The image modality only B. One modality (e.g., text tokens), while keys and values come from another modality (e.g., image patch features) C. Random noise vectors D. The optimizer's gradient buffer

Correct Answer: B

Explanation

- **If you chose A:** Cross-attention is directional — queries can come from either modality depending on the task. Text-querying-image is common in multimodal LLMs. - **If you chose B:** Correct. $\text{CrossAttn}(\mathbf{Q}_A, \mathbf{K}_B, \mathbf{V}_B)$ allows modality A to retrieve relevant information from modality B. In LLaVA/Flamingo-style models, text token queries attend to image patch keys/values. - **If you chose C:** Queries come from actual data representations, not random noise. - **If you chose D:** Gradients are not used as attention queries.

Question 4: What is the primary advantage of unified token architectures (everything-as-tokens) for multimodal models?

A. They train significantly faster than modality-specific architectures B. A single transformer can process any modality — text, images, audio — without modality-specific architectural components, using a shared token vocabulary C. They don't require positional embeddings D. They eliminate the need for any loss function

Correct Answer: B

Explanation

- **If you chose A:** Unified architectures are not necessarily faster — they trade modality-specific efficiency for architectural simplicity. - **If you chose B:** Correct. By mapping every modality to a shared $d$-dimensional token space, a single causal transformer processes interleaved multimodal sequences. This is the approach used by Gemini and frontier multimodal LLMs. - **If you chose C:** Positional embeddings are still needed to encode token order within each modality's segment. - **If you chose D:** A loss function (typically next-token prediction) is always required for training.

Question 5: The temperature parameter $\tau$ in CLIP controls:

A. The learning rate schedule during training B. The sharpness of the contrastive similarity distribution — smaller $\tau$ produces peakier softmax distributions, larger $\tau$ produces more uniform ones C. The image resolution fed to the vision encoder D. The batch size used in training

Correct Answer: B

Explanation

- **If you chose A:** $\tau$ is a parameter inside the loss function $S_{ij} = \mathbf{v}_i^\top \mathbf{t}_j / \tau$, unrelated to the learning rate. - **If you chose B:** Correct. $\tau$ scales the logits before softmax. Too small $\tau$ → saturated softmax (near one-hot, vanishing gradients). Too large $\tau$ → near-uniform softmax (weak learning signal). CLIP learns $\tau$ automatically, typically converging to $\approx 0.07$. - **If you chose C:** Image resolution is a preprocessing hyperparameter, not controlled by the temperature. - **If you chose D:** Batch size is a separate hyperparameter (typically 32,768 for CLIP).

Question 6: What distinguishes middle fusion from late fusion in multimodal architectures?

A. Middle fusion uses more GPUs for parallel processing B. Middle fusion processes modalities separately through initial layers, then fuses at intermediate layers (e.g., via cross-attention), enabling cross-modal reasoning; late fusion only combines at the final embedding layer C. Late fusion always achieves better performance than middle fusion D. There is no meaningful difference — the terms are interchangeable

Correct Answer: B

Explanation

- **If you chose A:** GPU count is an implementation detail, not what defines the fusion strategy. - **If you chose B:** Correct. Late fusion (like CLIP) can only compare at the final embedding — it can't answer "is the red ball to the left of the blue cube?" because that requires cross-modal reasoning. Middle fusion (cross-attention, token concatenation) enables deep interaction between modalities at multiple layers. - **If you chose C:** Middle fusion generally enables richer cross-modal reasoning, though late fusion is simpler and works well for retrieval tasks. - **If you chose D:** The terms refer to fundamentally different architectural choices with distinct capabilities and limitations.

A Note on Completion

You have reached the final subject of the LLM Researcher Mathematics Curriculum — a 241-subject journey spanning 26 phases from Year 8 arithmetic through to the cutting edge of AI research. The concepts covered here — multimodal alignment, contrastive learning, cross-attention fusion — are powering the frontier models of 2026.

From whole-number arithmetic to multimodal transformers: this is the mathematical foundation of modern AI research.

What's next? This curriculum provides the theoretical foundation. The next step is applied: implement these concepts, read current papers (arxiv.org), and contribute to open-source AI research. The mathematics you've studied is the language in which AI research is written — now go speak it.

Progress

Phases

25-10 — Multimodal Models Mathematics

Learning Objectives

Core Content

1. The Multimodal Learning Problem

2. Vision Transformers (ViT) — Image Tokenisation

3. CLIP — Contrastive Language-Image Pre-training

4. Cross-Attention for Multimodal Fusion

5. Fusion Strategies: Early vs Late

6. Multimodal Tokenisation and Unified Architectures

Key Terms

Worked Examples

Example 1: CLIP Similarity Computation

Example 2: Cross-Attention Fusion

Example 3: Image Patch Tokenisation

Practice Problems

Summary

Pitfalls

Quiz

A Note on Completion