L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 3 · Lesson 1

What CLIP Is and How It Was Trained

OpenAI's 2021 model that learned to pair images and text — and quietly became the backbone of the generative image era.
How did a model trained on 400 million image-caption pairs become the universal interpreter between human language and machine vision?

When OpenAI published Learning Transferable Visual Models From Natural Language Supervision in January 2021, the reception among computer vision researchers was somewhere between fascination and alarm. The paper described a model — Contrastive Language–Image Pre-Training, or CLIP — that had been trained on roughly 400 million image-text pairs scraped from the internet, using no curated labels whatsoever. It matched the top-1 accuracy of a ResNet-50 trained on ImageNet without ever seeing a single labeled ImageNet example during training.

That zero-shot result was striking. But the more consequential fact, invisible at the time, was that CLIP's joint embedding space — the mathematical bridge between pixels and words — would within months become the steering mechanism for every major text-to-image system on earth.

The Core Idea: Contrastive Pre-Training

CLIP consists of two separate encoder networks: a text encoder (a Transformer) and an image encoder (either a Vision Transformer or a modified ResNet). Each encoder takes its respective input — a sentence or an image — and projects it into a shared 512-dimensional embedding space.

Training uses a contrastive objective. Given a batch of N image-text pairs, the model produces N image embeddings and N text embeddings. The goal is to maximize the cosine similarity between the N correctly matched pairs while minimizing similarity for all N²−N mismatched pairs. This is called the InfoNCE loss, and it was borrowed from self-supervised representation learning work, particularly from Google Brain's SimCLR (Chen et al., 2020).

The critical insight is that scale makes this work. With small datasets, the contrastive signal is noisy and the model learns little. At 400 million pairs, the statistical regularities of language align tightly with visual regularities. The word "sunset" converges to a region of embedding space that actually neighbors orange-sky photographs — not because anyone told it to, but because the internet's image-caption pairing behavior encodes that relationship billions of times.

CLIP Training Loop (Simplified)
Image Batch
Image Encoder (ViT or ResNet)
Embedding I₁…Iₙ
Caption Batch
Text Encoder (Transformer)
Embedding T₁…Tₙ
InfoNCE Loss: maximize sim(Iₙ, Tₙ), minimize sim(Iₙ, Tₘ≠ₙ)
The WIT Dataset and Data Curation

OpenAI called their training corpus WebImageText (WIT) — 400 million image-text pairs collected by searching the public internet for images whose surrounding HTML text contained one of 500,000 queries. Those queries were seeded from English Wikipedia's most frequent bigrams, then expanded. The resulting dataset is diverse but noisy: alt-text for stock photography, captions from news sites, social media descriptions.

The research team explicitly avoided using ImageNet, CIFAR, or any labeled benchmark data, to test whether purely naturalistic supervision could compete. By several metrics — linear probe accuracy on STL-10, zero-shot on EuroSAT, few-shot on Birdsnap — it could, and often did.

Later researchers, particularly the LAION project (discussed in Lesson 2), used CLIP's own similarity scores to filter massive web crawls, creating LAION-400M and LAION-5B by retaining only image-text pairs that CLIP itself judged to be well-matched. CLIP thus became both an artifact of web-scale training and the quality-control tool for the next generation of web-scale training.

Key Architecture Variants

OpenAI released multiple CLIP model sizes at launch. ViT-B/32 (Vision Transformer, 32×32 patches) offered the best speed-accuracy tradeoff and became the most-used in downstream systems. ViT-L/14 (14×14 patches, much larger) became the standard for high-quality generation pipelines. Stable Diffusion 1.x used ViT-L/14; DALL-E 2 used a custom CLIP variant trained on more data. The patch size controls granularity: smaller patches capture finer spatial detail but cost much more compute.

Key Terms
CLIPContrastive Language–Image Pre-Training. OpenAI model (2021) trained to align image and text representations in a shared embedding space via contrastive loss on 400M web image-text pairs.
Contrastive Loss (InfoNCE)Training objective that pulls matching image-text pairs together in embedding space while pushing mismatched pairs apart. Scales effectively to batch sizes in the thousands.
Vision Transformer (ViT)Image encoder architecture that splits images into fixed-size patches and processes them with a Transformer. Introduced by Dosovitskiy et al. (Google Brain, 2020).
Zero-Shot TransferThe ability to perform a classification task on a dataset never seen during training, using only a natural-language description of each class (e.g., "a photo of a dog").
WIT (WebImageText)OpenAI's 400M image-text training corpus for CLIP, assembled from public internet sources using Wikipedia-derived queries.

Lesson 1 Quiz

What CLIP Is and How It Was Trained
What training objective does CLIP use to align image and text representations?
Correct. InfoNCE loss pulls matching image-text pairs together and pushes all N²−N mismatched pairs apart within each training batch.
Not quite. CLIP uses no ImageNet labels at all — it trains entirely on web-scraped image-text pairs with a contrastive objective, not classification loss.
Approximately how many image-text pairs comprised OpenAI's WIT training corpus for CLIP?
Correct. WIT contained roughly 400 million image-text pairs sourced from the public internet using Wikipedia-derived queries as seeds.
Not quite. 400 million is the correct figure for WIT. The 5 billion figure belongs to LAION-5B, a later dataset that used CLIP's own scores to filter its contents.
How did later projects like LAION use CLIP in building their own datasets?
Correct. LAION-400M and LAION-5B were both filtered using CLIP cosine similarity scores — only pairs above a threshold were kept, making CLIP a quality-control tool for subsequent training data.
Not quite. LAION used the existing CLIP model's similarity scoring to filter web crawls — retaining pairs CLIP judged as well-matched — rather than retraining or fine-tuning it.

Lab 1 — CLIP's Training and Architecture

Discuss CLIP's design choices with an AI tutor. At least 3 exchanges to complete.

Your Investigation

You have a working understanding of how CLIP was trained. Now go deeper: explore why certain design choices matter, what the contrastive objective actually optimizes for, and what limitations arise from training on noisy web data.

Suggested starting point: "Why does CLIP work better at larger batch sizes, and what happens if you train it on a small, curated dataset instead of messy web data?"
CLIP Architecture Tutor
L1 Lab
Welcome. I'm here to help you explore CLIP's training methodology and architecture in depth. What aspect of CLIP's design would you like to interrogate — the contrastive objective, the data pipeline, the encoder choices, or something else?
Module 3 · Lesson 2

The Embedding Space

How CLIP's shared 512-dimensional space makes "a photo of a serene Japanese garden at dusk" point to something specific — and why that matters for generation.
What does it actually mean for a word and an image to be "close" in embedding space, and how do diffusion models exploit that proximity?

In April 2022, when Stability AI's researchers were assembling what would become Stable Diffusion, they faced a fundamental question: how do you take a text prompt typed by a user and convert it into a signal that can guide a denoising process operating on a grid of pixels? The answer they borrowed — and that DALL-E 2, Midjourney, and nearly every other system of that era borrowed — was CLIP's embedding space. Not because it was perfect, but because it was the only structure that had been trained at sufficient scale to make the geometry of language correspond to the geometry of images.

What an Embedding Space Is

An embedding is a vector — a list of numbers — that represents an object in a mathematical space. CLIP encodes both images and text into the same 512-dimensional space (or 768-dimensional for ViT-L/14). Two objects that are "similar" end up with vectors that point in nearly the same direction. Similarity is measured by cosine similarity: the cosine of the angle between two vectors. A score of 1.0 means identical direction; 0.0 means orthogonal (unrelated); negative values mean opposite.

After CLIP training, a photograph of a dog and the text "a dog" have cosine similarity near 0.9+. The text "a small brown terrier running through grass" sits closer to a photo matching that description than to a photo of a large white poodle sitting still. The space has learned semantic structure: proximity reflects conceptual relationship, not pixel similarity.

Linear Probes and What They Reveal

A useful test for embedding quality is the linear probe: freeze all encoder weights, extract embeddings from labeled examples, train only a single linear classifier on top, and measure accuracy. If the embedding space has structured the data well, a linear boundary will separate classes cleanly.

In the original CLIP paper, ViT-L/14 linear probes achieved 85.4% top-1 accuracy on ImageNet — competitive with supervised models trained end-to-end on the same benchmark. This was remarkable because CLIP had never seen ImageNet labels. The embeddings had spontaneously organized a 1,000-class taxonomy that a linear classifier could exploit.

The same spatial structure that makes linear probes work is what makes CLIP useful for image generation guidance: if embeddings have clean semantic geometry, then navigating toward a text embedding in that space moves you toward images that look like what the text describes.

Semantic Structure in CLIP Embedding Space (Conceptual)
"golden retriever"
📷 photo of golden retriever
·
high cosine sim ≈ 0.88
"golden retriever"
vs
📷 photo of airplane
·
low cosine sim ≈ 0.18
"puppy"
"golden retriever"
·
text-text proximity encodes semantics
CLIP Guidance in DALL-E 2

OpenAI's DALL-E 2 (April 2022) made the role of CLIP embedding space explicit in its architecture. The pipeline had three stages: (1) a text encoder converts the prompt to a CLIP text embedding; (2) a prior model (a diffusion model or autoregressive model) generates a CLIP image embedding conditioned on that text embedding; (3) a decoder diffusion model generates pixels conditioned on the image embedding.

The prior is the key innovation: it generates a point in the CLIP image embedding space that corresponds to the text, then the decoder synthesizes an image whose actual pixels match that location. This two-stage approach let the model benefit from CLIP's rich semantic structure while keeping the pixel-space decoder focused on visual quality.

The approach was documented in the paper Hierarchical Text-Conditional Image Generation with CLIP Latents (Ramesh et al., 2022), where ablation studies showed that using CLIP image embeddings via the prior substantially outperformed conditioning directly on text embeddings alone.

Limitations of CLIP Embedding Space

CLIP's embedding space has well-documented failure modes that directly affect image generation quality. Attribute binding is a persistent problem: CLIP often cannot reliably distinguish "a red cube next to a blue sphere" from "a blue cube next to a red sphere." The two sentences produce similar embeddings because the component words are the same — spatial and relational structure is weakly encoded.

Research from the University of Toronto and Vector Institute (Yuksekgonul et al., 2022, "When and Why Vision-Language Models Behave like Bag-of-Words") showed that CLIP's text-image matching behaves more like a bag-of-words model than a compositional language model: it responds to which concepts are present, not to their relationships. A model guided solely by CLIP will generate images where the right objects appear but in the wrong spatial configuration.

This limitation directly motivated later conditioning strategies: T5 text encoders (used in Imagen), cross-attention on full token sequences (used in Stable Diffusion's U-Net), and eventually the shift to dedicated text-image models like SDXL's dual-encoder approach combining CLIP and OpenCLIP.

Real-World Impact

The bag-of-words weakness documented in Yuksekgonul et al. (2022) is directly observable in early Stable Diffusion outputs. Prompts like "a cat sitting on a red chair" frequently produced images where the chair was not red, or where a cat and a red object were both present but unrelated. This was not a failure of the diffusion model's visual capability — it was a failure of the CLIP embedding to encode the relationship "sitting on" with sufficient specificity to guide the denoiser.

Key Terms
Cosine SimilarityMeasure of vector alignment. Value between −1 and 1. Used to measure how "close" two CLIP embeddings are; higher scores mean more semantically related.
Linear ProbeEvaluation method: freeze encoder, extract embeddings, train a single linear classifier. Good performance indicates the embedding space has semantic structure.
Prior (in DALL-E 2)A model that generates CLIP image embeddings conditioned on CLIP text embeddings, bridging the modality gap before pixel decoding.
Bag-of-Words BehaviorWhen a model responds to which words are present in a prompt but ignores their order and relational structure (e.g., treating "A on B" and "B on A" similarly).
Attribute BindingThe task of correctly associating attributes (color, size, position) with the specific objects they modify. A known weakness of CLIP-only conditioning.

Lesson 2 Quiz

The Embedding Space
What does the DALL-E 2 "prior" model do in the generation pipeline?
Correct. The prior bridges the modality gap: it maps from text embedding space to image embedding space, allowing the decoder to synthesize pixels that correspond to that image-space location.
Not quite. The prior doesn't generate pixels — it generates a CLIP image embedding from a CLIP text embedding. The pixel-level decoder is a separate stage.
What does research by Yuksekgonul et al. (2022) reveal about CLIP's text-image matching?
Correct. "When and Why Vision-Language Models Behave like Bag-of-Words" (Yuksekgonul et al., 2022) showed CLIP treats word presence more than word relationships — a direct cause of attribute binding failures in generation.
Not quite. The paper found the opposite: CLIP ignores relational structure and responds primarily to which concepts are mentioned, not how they relate to each other.
What does a high linear probe accuracy on ImageNet embeddings indicate about CLIP?
Correct. Linear probe accuracy tests whether embeddings are semantically organized. High accuracy without task-specific training means the space has learned a meaningful structure that aligns with the classification taxonomy.
Not quite. CLIP explicitly avoided ImageNet training data. The high linear probe score arises from the emergent semantic structure of the embedding space, not from labeled supervision.

Lab 2 — Embedding Space and Generation Guidance

Explore how CLIP's geometry shapes generation quality. At least 3 exchanges to complete.

Your Investigation

You now understand that CLIP's embedding space has semantic structure but also well-documented weaknesses like attribute binding. Explore how these properties translate into practical generation behaviors and what architectural choices address them.

Suggested starting point: "If CLIP embeddings are bag-of-words, why do text-to-image models still produce reasonable images from compositional prompts? What compensates for this weakness?"
Embedding Space Tutor
L2 Lab
Ready to explore CLIP's embedding space and its role in generation. What would you like to understand — the geometry of the space, how the prior in DALL-E 2 works, or why bag-of-words behavior causes certain generation failures?
Module 3 · Lesson 3

Vision Encoders in Diffusion Models

From CLIP's global embedding to token-by-token cross-attention — how conditioning became richer and more controllable between 2021 and 2023.
Why did Stable Diffusion abandon single-vector CLIP conditioning in favor of cross-attention on the full token sequence — and what did that change about how prompts work?

When Robin Rombach and colleagues at the CompVis group at LMU Munich published High-Resolution Image Synthesis with Latent Diffusion Models in December 2021, they introduced a conditioning mechanism that would become standard across the entire field. Rather than using a single CLIP embedding vector to steer the denoiser, they fed the full token sequence of CLIP's text encoder — one vector per token — into the U-Net via cross-attention layers inserted at multiple resolutions. The U-Net could then attend to individual words at different spatial scales, rather than responding to one compressed summary vector.

This architecture became Stable Diffusion when released publicly in August 2022. Its conditioning mechanism meant that a prompt like "a red apple on the left, a green pear on the right" gave the denoiser access to each word's individual embedding, not just a blended average — a step toward solving the attribute binding problem, even if it didn't fully solve it.

Single-Vector vs. Token-Sequence Conditioning

Early CLIP-guided generation (as in CLIP-guided diffusion from late 2021) typically worked by computing the CLIP embedding of the target prompt and then using gradient ascent during the diffusion process to steer samples toward higher cosine similarity with that embedding. This is a global conditioning mechanism: one vector, one direction.

The LDM / Stable Diffusion approach is different. The text encoder (CLIP ViT-L/14 in SD 1.x) processes the prompt token-by-token, producing a sequence of embeddings — typically 77 vectors of dimension 768. These are fed into the U-Net's cross-attention layers: the spatial features of the image being denoised act as queries, the token embeddings act as keys and values. Each spatial region of the image can attend to whichever words are most relevant to it.

This mechanism is structurally similar to how attention works in language models — but the queries come from a different modality (image features) than the keys/values (text embeddings). The result is that different parts of the image can be conditioned by different parts of the prompt simultaneously.

Cross-Attention Conditioning in Stable Diffusion's U-Net
Prompt: "red apple left, green pear right"
CLIP Text Encoder
77 token embeddings (Keys + Values)
Noisy Latent (spatial features)
Queries
Cross-Attention: each spatial region attends to relevant tokens → conditioned feature map → denoiser output
Prompt Weighting and Token Importance

Because SD uses token-level conditioning, users and toolkits discovered they could manipulate individual token embeddings to amplify or dampen their effect. The practice of prompt weighting — writing (word:1.4) in Automatic1111 or similar interfaces — works by scaling the embedding vector of specific tokens before they are passed to the cross-attention layers. A higher weight makes that token's embedding louder relative to others, biasing the denoiser toward features associated with that word.

This was not a documented feature at launch — it emerged from community experimentation and was later formalized in tools. It reveals that the model's relationship to text is fundamentally about the geometry of individual token embeddings in the conditioning space, not a monolithic reaction to the prompt as a whole sentence.

T5 Encoders — A Different Philosophy

Google Brain's Imagen (Saharia et al., May 2022) made a different architectural choice: instead of CLIP's text encoder, it used T5-XXL — a 4.6-billion parameter text encoder from the T5 language model family (Raffel et al., 2020), trained entirely on text with no image supervision. The motivation was that T5's language understanding was richer than CLIP's, which had been trained primarily to match images rather than to understand language compositionally.

The Imagen paper reported that the T5 text encoder contributed more to image quality and prompt fidelity than the diffusion model size itself — a striking result that suggested the quality of the text representation mattered as much as the visual architecture. Scaling the language encoder from T5-Small to T5-XXL improved FID scores more than scaling the U-Net parameters by the same factor.

This finding influenced SDXL (2023), which used a dual-text-encoder approach: both CLIP ViT-L and OpenCLIP ViT-bigG, concatenated, giving access to 1,280 + 1,280 = 2,560-dimensional conditioning sequences. The two encoders capture different aspects of the prompt — CLIP ViT-L is stronger at visual-semantic matching, OpenCLIP bigG provides additional token-level richness from a differently-trained model.

Stable Diffusion 1.x Conditioning

CLIP ViT-L/14 text encoder. 77 tokens × 768 dims. Cross-attention at multiple U-Net resolutions. Strong visual-semantic alignment, weak compositional reasoning. Supports prompt weighting via token scaling.

Imagen Conditioning

T5-XXL text encoder (4.6B params). No image training — pure language understanding. Strongest compositional and relational reasoning of the 2022 generation. Requires substantially more memory for the encoder alone.

DALL-E 2 Conditioning

CLIP text embedding → prior → CLIP image embedding. Single-vector conditioning of the decoder. Semantically rich but loses fine syntactic structure. Strong at style and concept transfer.

SDXL Conditioning (2023)

Dual encoder: CLIP ViT-L + OpenCLIP ViT-bigG, concatenated. 2,560-dimensional combined sequence. Also conditions on image dimensions and crop coordinates. Better prompt adherence than SD 1.x.

Key Terms
Cross-Attention ConditioningMechanism where image spatial features (queries) attend to text token embeddings (keys/values) at multiple U-Net resolutions, enabling localized text-image binding.
Token Sequence ConditioningFeeding all per-token embeddings from a text encoder into the denoiser, rather than a single summary vector. Preserves word-level information that global embeddings discard.
T5-XXL4.6B parameter Google text encoder trained on text only (no images). Used by Imagen; demonstrated that language encoder quality is a major determinant of image fidelity.
Prompt WeightingScaling individual token embeddings before cross-attention to amplify or suppress specific words' influence on generation.
Dual Encoder (SDXL)SDXL's conditioning architecture combining CLIP ViT-L and OpenCLIP ViT-bigG embeddings, concatenated to form a richer 2,560-dim conditioning sequence.

Lesson 3 Quiz

Vision Encoders in Diffusion Models
In Stable Diffusion's cross-attention conditioning, what roles do the text token embeddings and the spatial image features play?
Correct. The spatial features of the noisy latent act as queries — each region of the image "asks" which words are relevant to it — while text tokens supply the keys and values.
Not quite. It's the reverse: spatial image features are queries, text token embeddings are keys and values. This allows each image region to attend to specific words in the prompt.
What did Google Brain's Imagen paper find about the T5 text encoder's contribution to generation quality?
Correct. Imagen's ablations showed the text encoder was the most impactful component to scale — T5-XXL improved generation quality more than enlarging the U-Net by the same compute budget.
Not quite. Imagen's key finding was that the text encoder quality dominates: scaling T5 from small to XXL produced larger FID improvements than scaling the diffusion U-Net by an equivalent amount.
How does prompt weighting (e.g., writing "word:1.4" in Automatic1111) affect Stable Diffusion's generation?
Correct. Prompt weighting works at the embedding level — scaling a token's vector makes it geometrically larger in the conditioning space, giving it stronger influence over the cross-attention queries.
Not quite. Prompt weighting scales the individual token embedding vector, not the guidance scale or token count. It's a geometric operation in the conditioning space.

Lab 3 — Cross-Attention and Encoder Comparison

Compare conditioning strategies across generation systems. At least 3 exchanges to complete.

Your Investigation

You've seen how CLIP token-sequence conditioning, T5 conditioning, and dual-encoder approaches differ. Now dig into the tradeoffs: compute costs, what each approach handles well, and how cross-attention mechanics shape what prompts can and cannot control.

Suggested starting point: "Why would SDXL use two different CLIP encoders rather than just one larger one? What does each encoder contribute that the other doesn't?"
Conditioning Architecture Tutor
L3 Lab
Let's explore how different text conditioning architectures shape generation. I can help you compare CLIP token conditioning vs T5 vs SDXL's dual encoder, discuss cross-attention mechanics, or analyze why certain prompt structures work better in some systems than others. Where would you like to start?
Module 3 · Lesson 4

OpenCLIP, SigLIP, and Beyond

After OpenAI's CLIP came a wave of open-weight reproductions, scaled variants, and architectural innovations — reshaping which vision encoders power which generation systems.
Why did the open-source community reproduce CLIP, what did they improve, and how did the vision encoder landscape diverge between proprietary and open systems by 2023?

OpenAI released CLIP model weights in January 2021, but not the training code, not the WIT dataset, and not the recipe to reproduce it at scale. For researchers who wanted to build on CLIP — or audit its biases — this was a significant constraint. In late 2021, the nonprofit LAION (Large-scale Artificial Intelligence Open Network) began assembling LAION-400M: a 400-million-pair dataset filtered using CLIP similarity scores from OpenAI's own released model. By 2022 they had expanded to LAION-5B, five billion pairs, the largest publicly documented image-text dataset at the time.

Simultaneously, researchers at the University of Washington collaborated with Stability AI and others to train OpenCLIP — an open-source reimplementation of CLIP training using LAION data. By 2022, OpenCLIP ViT-H/14 trained on LAION-2B matched or exceeded OpenAI's ViT-L/14 on several zero-shot benchmarks. By 2023, OpenCLIP ViT-G/14 had substantially surpassed it, becoming the encoder used in SDXL's larger conditioning stream.

OpenCLIP: Reproducing and Improving CLIP

OpenCLIP (Ilharco et al., 2021; Cherti et al., 2022) is a PyTorch reimplementation of the CLIP training procedure using publicly available data. The key findings from the OpenCLIP scaling paper (Reproducible Scaling Laws for Contrastive Language-Image Learning, Cherti et al., NeurIPS 2022) were:

Scale follows predictable laws. Zero-shot performance on ImageNet scales as a power law with compute, following similar laws to those found in language model scaling. Models trained with 10× more compute consistently outperformed smaller counterparts in a predictable way, allowing researchers to forecast performance before training.

Data quality matters more than raw scale. Models trained on LAION-2B (filtered with CLIP scores) outperformed models trained on raw unfiltered web data at the same dataset size. The filtering step — which used CLIP to filter CLIP's own training data — produced measurable gains.

Open weights enable systematic comparison. Because OpenCLIP released full model checkpoints at multiple scales and training stages, researchers could study how visual representations evolve during training and how different architectural choices (patch size, width, depth) trade off against each other — something impossible with closed models.

SigLIP: Replacing InfoNCE with Sigmoid Loss

Google introduced SigLIP (Sigmoid Loss for Language Image Pre-Training, Zhai et al., ICCV 2023) as a significant departure from CLIP's training objective. Rather than the InfoNCE contrastive loss — which requires comparing each sample against all others in a batch — SigLIP uses a binary sigmoid loss applied independently to each image-text pair.

The practical consequence is that SigLIP does not require large batch sizes to work well. InfoNCE's quality scales with batch size because you need many negative examples to define the contrast. Sigmoid loss treats each pair as an independent binary classification: "does this text match this image?" — and this can be trained with much smaller batches while maintaining or improving representation quality.

SigLIP models trained on WebLI (Google's proprietary 10B image-text dataset) achieved 84.5% zero-shot ImageNet accuracy with ViT-So400M — surpassing OpenAI's original CLIP ViT-L/14 (75.5%) and matching some supervised methods. More importantly, SigLIP was adopted as the vision encoder for Google's Gemini multimodal models and the PaLI family, marking a shift away from CLIP architecture in some of the highest-profile production systems of 2023.

EVA-CLIP and the Scale Race

Beijing Academy of Artificial Intelligence (BAAI) released EVA-CLIP in 2023, pushing CLIP-style training to previously untested scales. EVA-CLIP 18B (Chen et al., 2024) trained an 18-billion parameter vision encoder using a modified contrastive objective, achieving 80.7% zero-shot ImageNet top-1 accuracy — at the time the highest reported for a CLIP-style model.

EVA-CLIP was also notable for its training strategy: it initialized the vision encoder from EVA (Exploring the Limits of Masked Visual Pre-training at Scale), a masked autoencoder pre-trained on aligned image embeddings from an existing CLIP model. This two-stage approach — pre-train the vision encoder to reconstruct CLIP features, then fine-tune on image-text pairs — produced more efficient learning than training from random initialization at the same scale.

EVA-CLIP-8B became the vision encoder in FLUX.1 (Black Forest Labs, 2024), one of the most capable open image generation models, demonstrating how the vision encoder race directly feeds into generation quality.

Vision Encoders in Current Production Systems (2023–2024)

SDXL: CLIP ViT-L + OpenCLIP ViT-bigG (dual encoder, concatenated). FLUX.1: EVA-CLIP-8B (vision), T5-XXL (text), with separate CLIP conditioning stream. Gemini: SigLIP. GPT-4V: Undisclosed (likely CLIP variant). LLaVA-1.5 (open): CLIP ViT-L/14 @ 336px. The diversity reflects that there is no single best encoder: systems optimize for different tradeoffs between zero-shot generalization, compositional accuracy, and training efficiency.

Implications for Image Generation

The choice of vision encoder in a generation system has measurable downstream effects. Encoders trained with better language understanding (T5, larger CLIP variants) improve prompt fidelity. Encoders trained at higher resolution or with finer patch sizes (ViT-L/14 @ 336px vs ViT-B/32) preserve more spatial detail in the conditioning signal. Encoders trained on more diverse data (LAION-5B, WebLI) generalize to unusual prompts more reliably.

The community has largely moved away from the idea of a single universal CLIP encoder toward multi-encoder conditioning — using two or more encoders whose strengths are complementary — and toward larger language-model backbones (T5, LLaMA) as text encoders in systems where prompt fidelity is paramount. The 2021 CLIP architecture was the foundation; by 2024 it had become one ingredient in a much more complex conditioning stack.

Key Terms
OpenCLIPOpen-source reimplementation of CLIP training (LAION + UW + Stability AI). Demonstrated reproducible scaling laws; ViT-bigG variant used in SDXL.
LAION-5BFive-billion image-text pair dataset assembled by LAION using CLIP similarity filtering. Largest public training corpus for contrastive vision-language models as of 2022.
SigLIPGoogle's 2023 CLIP variant replacing InfoNCE loss with sigmoid binary classification loss. Works well at small batch sizes; used in Gemini and PaLI.
EVA-CLIPBAAI's scaled CLIP family. EVA-CLIP-8B is used in FLUX.1; initialized from masked autoencoder pre-training on CLIP features before contrastive fine-tuning.
InfoNCE vs Sigmoid LossInfoNCE requires large batches to provide negative examples; sigmoid loss treats each pair independently, removing the large-batch requirement while maintaining quality.

Lesson 4 Quiz

OpenCLIP, SigLIP, and Beyond
What key practical advantage does SigLIP's sigmoid loss have over CLIP's InfoNCE loss?
Correct. InfoNCE needs many negatives in each batch to define the contrast; sigmoid loss asks only "does this pair match?" for each pair independently, enabling effective training at much smaller batch sizes.
Not quite. SigLIP's advantage is about batch size requirements. The sigmoid formulation makes each pair's loss independent of others in the batch, eliminating the need for the large batches that InfoNCE requires.
What two-stage training strategy does EVA-CLIP use to achieve efficient learning at scale?
Correct. EVA initializes by learning to reconstruct CLIP embedding targets from masked image patches, then switches to contrastive training. This gives the vision encoder a strong semantic initialization before the expensive contrastive phase.
Not quite. EVA-CLIP's first stage is masked autoencoding trained to predict existing CLIP features — giving the vision encoder a semantic head-start before contrastive fine-tuning on image-text pairs.
The OpenCLIP scaling paper (Cherti et al., 2022) found that filtering training data with CLIP similarity scores before training a new CLIP model on that data produced what result?
Correct. OpenCLIP experiments showed CLIP-filtered data consistently outperformed unfiltered data at equivalent scale, confirming that data quality — even when defined circularly by CLIP itself — matters more than raw dataset size.
Not quite. The OpenCLIP paper found clear, consistent improvements from CLIP-filtered data over unfiltered data at the same size. Data quality dominated over raw scale.

Lab 4 — The Modern Vision Encoder Landscape

Analyze tradeoffs across OpenCLIP, SigLIP, and EVA-CLIP. At least 3 exchanges to complete.

Your Investigation

You've seen how the vision encoder space fragmented after OpenAI's CLIP into open-weight reproductions, training objective variants, and extreme-scale models. Now analyze the strategic and technical tradeoffs that determine which encoder gets used in which generation system.

Suggested starting point: "If I'm designing a new text-to-image system today, how would I decide between using SigLIP vs OpenCLIP vs a T5 encoder for text conditioning? What are the most important factors to weigh?"
Vision Encoder Strategy Tutor
L4 Lab
Welcome to the modern vision encoder landscape. We can explore the technical tradeoffs between SigLIP, OpenCLIP, EVA-CLIP, and T5 encoders, discuss how FLUX.1 and SDXL made their choices, or dig into why data filtering with CLIP creates a self-referential but effective pipeline. What's your first question?

Module 3 — Module Test

15 questions across all four lessons. 80% required to pass.
1. What does CLIP stand for?
Correct. CLIP = Contrastive Language–Image Pre-Training, introduced by OpenAI in January 2021.
Incorrect. CLIP stands for Contrastive Language–Image Pre-Training.
2. How large was OpenAI's WIT dataset used to train the original CLIP?
Correct. WIT contained approximately 400 million image-text pairs scraped from the public internet.
Incorrect. WIT had approximately 400 million pairs. 5B is LAION-5B, a later dataset.
3. What metric does CLIP use to measure similarity between image and text embeddings?
Correct. Cosine similarity measures the angle between two embedding vectors, ranging from −1 to 1.
Incorrect. CLIP uses cosine similarity — the cosine of the angle between two vectors — to measure how aligned image and text embeddings are.
4. Zero-shot classification in CLIP works by:
Correct. CLIP zero-shot classification embeds each class description (e.g., "a photo of a cat") and picks the class whose text embedding is most similar to the image embedding.
Incorrect. CLIP's zero-shot approach uses cosine similarity between the image embedding and text embeddings of class descriptions — no fine-tuning or lookup tables required.
5. What is the "prior" in DALL-E 2's generation pipeline?
Correct. The prior bridges text and image embedding spaces, allowing the pixel decoder to be conditioned on image-space representations rather than text-space ones directly.
Incorrect. The prior is the model that maps CLIP text embeddings to CLIP image embeddings — a bridge between modality spaces before pixel generation.
6. In Stable Diffusion's U-Net, cross-attention layers condition the denoiser by having:
Correct. Spatial features are queries — each image region asks which tokens are relevant — while text tokens provide the keys and values that answer those queries.
Incorrect. It's the spatial features that are queries; text tokens are keys and values. Each image region attends to relevant words, not vice versa.
7. According to Yuksekgonul et al. (2022), CLIP's "bag-of-words" behavior means it:
Correct. CLIP cannot reliably distinguish "A above B" from "B above A" because it responds to concept presence rather than relational structure — a direct cause of attribute binding failures.
Incorrect. The bag-of-words finding is that CLIP responds to which words appear in a prompt, not how they relate to each other — making spatial/relational prompts unreliable.
8. What text encoder did Google Brain's Imagen use instead of CLIP?
Correct. Imagen used T5-XXL, a 4.6B parameter text-only encoder, demonstrating that language model quality significantly impacts image generation fidelity.
Incorrect. Imagen used T5-XXL — a 4.6 billion parameter text encoder trained purely on text data with no image supervision.
9. Imagen's ablation study found that scaling the text encoder had what effect compared to scaling the diffusion model?
Correct. Imagen's key finding: the text encoder is the most impactful component to scale. A bigger language encoder beat a bigger diffusion model at equal compute.
Incorrect. Imagen found the text encoder more impactful — scaling T5 from small to XXL produced larger FID gains than equivalent scaling of the U-Net.
10. SDXL's dual text-encoder conditioning combines which two encoders?
Correct. SDXL concatenates embeddings from CLIP ViT-L (768-dim) and OpenCLIP ViT-bigG (1280-dim) to form a 2,560-dimensional conditioning sequence.
Incorrect. SDXL uses CLIP ViT-L and OpenCLIP ViT-bigG, concatenated for a combined 2,560-dimensional conditioning stream.
11. OpenCLIP was trained primarily on which dataset?
Correct. OpenCLIP used LAION datasets — publicly assembled image-text pairs that were themselves filtered using OpenAI's CLIP similarity scores.
Incorrect. OpenCLIP used LAION datasets (400M, then 2B and 5B variants), which were assembled and filtered using CLIP scores — making it an open alternative to WIT.
12. SigLIP differs from CLIP in its training objective by:
Correct. SigLIP's sigmoid loss asks "does this pair match?" for each pair independently — no large-batch negative comparisons needed, unlike InfoNCE.
Incorrect. SigLIP's core change is replacing InfoNCE (which needs large batches of negatives) with a sigmoid binary loss applied independently to each image-text pair.
13. Which image generation model uses EVA-CLIP-8B as its vision encoder?
Correct. FLUX.1 (Black Forest Labs, 2024) uses EVA-CLIP-8B as its vision encoder, paired with T5-XXL for text conditioning.
Incorrect. EVA-CLIP-8B is used by FLUX.1 (Black Forest Labs), a 2024 open image generation model.
14. The LDM (Latent Diffusion Model) paper introduced which specific conditioning improvement over prior CLIP-guided diffusion?
Correct. LDM (Rombach et al., 2021) introduced cross-attention over the full token sequence, allowing spatial regions of the image to attend to individual words rather than a single compressed embedding.
Incorrect. LDM's contribution was inserting cross-attention layers into the U-Net to attend to all 77 text token embeddings — enabling localized, word-level conditioning instead of a single global vector.
15. What did the OpenCLIP scaling paper (Cherti et al., NeurIPS 2022) find about CLIP performance vs. compute?
Correct. OpenCLIP demonstrated reproducible power-law scaling — 10× more compute consistently produced predictable improvements, mirroring the scaling laws found in language models.
Incorrect. OpenCLIP's key finding was that zero-shot CLIP performance follows a clean power law with compute — predictable, reproducible, and consistent across model sizes.