Module 2 · Lesson 1

From Words to Pixels: The Text-to-Image Pipeline

How a sentence becomes a photorealistic image in seconds — and what happens at every step.

Why does changing a single word in a prompt sometimes produce a completely different image?

When Stability AI released Stable Diffusion publicly on August 22, 2022, researchers could for the first time run a full text-to-image pipeline on a consumer GPU. Within 48 hours the model had been downloaded hundreds of thousands of times. For the first time, the entire chain — from tokenized text to rendered pixel — was open and inspectable.

The Four-Stage Pipeline

Every major text-to-image system — Stable Diffusion, DALL·E 3, Midjourney, Imagen — shares the same conceptual pipeline, even if the internals differ. Understanding this pipeline explains why prompts work the way they do.

Text Encoding — Your prompt is tokenized and converted into a dense vector embedding by a language model (e.g. CLIP or T5).

↓

Conditioning — The embedding is fed as a conditioning signal into the image-generating network, telling it what the output should look like.

↓

Iterative Denoising — Starting from pure noise, the diffusion model runs 20–50 denoising steps, gradually shaping pixels that match the conditioning.

↓

Decoding / Upscaling — In latent-diffusion models the low-resolution latent is decoded back to pixel space by a VAE; optional upscalers add detail.

Stage 1 in Depth: Text Encoders

The quality of the text encoding determines the ceiling of what the model can achieve. Stable Diffusion 1.x used OpenAI's CLIP ViT-L/14, which encodes up to 77 tokens. This hard limit is why very long prompts are often truncated — tokens beyond position 77 are ignored entirely. SD 2.x switched to OpenCLIP; SDXL added a second encoder (OpenCLIP ViT-bigG) and concatenates both embeddings, nearly doubling the semantic capacity.

Google's Imagen (2022) took a different approach: it used a frozen T5-XXL language model — the same family used for text translation — as its text encoder. T5-XXL had 4.6 billion parameters dedicated purely to understanding language, far larger than CLIP. The paper showed this drove more faithful prompt adherence, especially for complex compositional prompts like "a blue cube on top of a red sphere."

Why It Matters

When you add qualifiers like "golden hour lighting, Canon 5D, f/1.8" to a prompt, you are adding tokens whose embeddings push the conditioning vector into regions of the latent space the model learned to associate with those aesthetic qualities. This is not magic — it is vector arithmetic on learned representations.

Stage 3 in Depth: Denoising Steps and CFG

The iterative denoising step introduces one of the most important user-facing parameters: Classifier-Free Guidance (CFG) scale. During training, diffusion models learn two things simultaneously — how to denoise conditioned on the text prompt, and how to denoise unconditionally (with the prompt replaced by empty tokens). At inference, CFG blends the two predictions:

output = unconditional + CFG_scale × (conditional − unconditional)

A CFG of 1.0 ignores the prompt. A CFG of 7–9 (the typical default) produces prompt-faithful results. Values above 15 tend to oversaturate and distort. This equation was introduced in the paper Classifier-Free Diffusion Guidance by Ho & Salimans (2022) and has become the default control mechanism across virtually all diffusion-based systems.

Key Terms

Latent SpaceA compressed mathematical representation of images where the diffusion process actually runs — typically 8× smaller than pixel space, reducing compute by 64×.

VAEVariational Autoencoder — the encoder/decoder pair that compresses images into latent space and reconstructs them back to pixels.

CFG ScaleClassifier-Free Guidance scale — controls how strongly the model follows the text prompt versus exploring freely.

Denoising StepsThe number of iterative refinement passes; more steps = more quality but slower generation.

The Scheduler: Controlling the Denoising Path

Between the noise schedule and the U-Net is the sampler (or scheduler), which determines how noise is removed at each step. Different samplers — DDIM, DPM++ 2M Karras, PNDM, Euler Ancestral — take different paths through the same probability landscape, producing images with different textures, sharpness, and detail. Stable Diffusion's community extensively benchmarked these after the 2022 open release, finding that DPM++ 2M Karras reached near-peak quality in just 20–25 steps while DDPM required 1,000.

Lesson 1 Quiz

Test your understanding of the text-to-image pipeline.

What is the primary role of the text encoder in a text-to-image pipeline?

Correct. The text encoder transforms your words into a dense vector that serves as the conditioning signal — it tells the diffusion model what to generate.

Not quite. The text encoder specifically converts language into vector embeddings. Noise generation, decoding, and step-count are handled by other components.

Stable Diffusion 1.x had a 77-token limit. What was the practical consequence for users writing long prompts?

Correct. CLIP's fixed 77-token context window meant anything written past that limit simply wasn't encoded — a key reason skilled prompters front-load their most important descriptors.

Incorrect. The model didn't error or slow down — it just silently truncated. Tokens beyond 77 had no effect on the output.

A CFG (Classifier-Free Guidance) scale of 1.0 would produce what kind of output?

Correct. At CFG=1.0 the guidance term cancels out — the model generates freely without being steered by your text. The "sweet spot" for most use cases is 7–9.

Wrong. CFG=1.0 means no guidance from the prompt. Maximum prompt adherence comes from higher values (7–15+), though very high values cause distortion.

Lab 1: Pipeline Mechanics

Discuss the text-to-image pipeline with your AI tutor. Complete 3 exchanges to finish the lab.

Your Challenge

You are a prompt engineer at a design studio. A client wants to understand why their elaborate 200-word prompt is producing mediocre results. Use what you learned about text encoders, CFG, and denoising steps to diagnose and fix the problem.

Starter question: "Our prompt is 200 words long and detailed, but Stable Diffusion 1.5 seems to ignore half of it. Walk me through exactly why this is happening and what we should do differently."

AI Tutor

Pipeline Mechanics

Welcome to the pipeline lab. I'm your AI tutor for this session. Ask me anything about text encoders, CFG scale, denoising, or samplers — or dive straight into the client scenario above. What would you like to explore?

Module 2 · Lesson 2

Prompt Engineering: The Science of Steering Latent Space

Structured techniques that consistently produce better results — backed by documented model behavior.

Why do professional prompt engineers often front-load subject matter and back-load style — and does the order actually matter?

When Midjourney launched version 4 in November 2022, its user community ran systematic prompt experiments documented on Reddit (r/midjourney) and Discord. One widely-shared analysis compared 400 paired prompts with identical content but different token ordering. The finding: subject nouns placed in the first 20 tokens correlated with stronger representation in the final image than the same nouns placed after position 40. This mirrored what CLIP attention maps showed — earlier tokens receive proportionally higher attention weight during cross-attention conditioning.

Anatomy of a High-Performance Prompt

There is no universal formula, but high-performing prompts across Stable Diffusion, Midjourney, and DALL·E 3 consistently share a structure that reflects how text encoders process language:

Subject — Who or what is in the image. Be specific: "a red fox" outperforms "an animal."

Action / Pose — What the subject is doing: "sitting on a mossy log, looking left."

Setting / Environment — Where: "in a fog-filled pine forest at dawn."

Lighting & Atmosphere — "golden hour rim lighting, volumetric fog."

Style / Medium — "National Geographic photograph, Nikon D850, 400mm telephoto, f/2.8."

Quality Boosters — "award-winning, 8K, ultra-detailed, sharp focus." (Use sparingly — overuse dilutes the semantic field.)

Negative Prompts: What You Exclude Matters

Negative prompts leverage the unconditional branch of CFG. When you provide a negative prompt, the model's conditional output is steered away from the negative embedding and toward the positive one. Common categories:

Artifact Suppression

"blurry, low quality, watermark, jpeg artifacts, cropped, deformed, ugly, bad anatomy" — removes common failure modes from SD 1.5.

Style Exclusion

"cartoon, anime, painting, illustration" — pushes results toward photorealism when the base model defaults to mixed styles.

Composition Control

"extra limbs, two heads, duplicate, mirror" — reduces anatomical errors common in human figure generation.

Content Steering

"text, watermark, signature, logo" — removes unwanted typography that CLIP-based models frequently hallucinate.

Prompt Weighting and Attention Manipulation

Stable Diffusion's A1111 interface and ComfyUI both support attention weight syntax. Wrapping a token in parentheses increases its weight: (golden:1.4) applies 1.4× weight to that token's embedding. Square brackets decrease it: [background:0.8]. This directly manipulates the cross-attention vectors the U-Net receives at each denoising step. SDXL's dual-encoder architecture even allows different weights on each encoder's output.

In 2023, researchers at Weizmann Institute (Prompt-to-Prompt paper, Hertz et al.) showed that by simply swapping the cross-attention maps between two prompts, you could edit specific elements of a generated image while preserving the overall composition — confirming that individual token attentions genuinely map to spatial regions in the output.

Real Technique — "Style Reference Anchors"

Midjourney's --sref parameter (added 2024) and DALL·E 3's system-prompt style instructions both evolved from the community practice of appending known artistic styles to anchor the output. "In the style of Ansel Adams" or "Caravaggio chiaroscuro lighting" works because CLIP was trained on image-text pairs that include art historical descriptions — those tokens have genuine geometric meaning in embedding space.

The DALL·E 3 Paradigm Shift: Natural Language Prompts

OpenAI's DALL·E 3 (released September 2023) fundamentally changed prompt engineering. Instead of CLIP's token-level conditioning, DALL·E 3 was trained with GPT-4-generated caption rewrites — so it understands full sentences, relative clauses, and compositional logic. You can write "a red ball to the left of a blue cube, which is on top of a wooden table, in a room with green walls" and get reliable spatial compliance. Earlier systems with CLIP encoders routinely failed at such relational prompts. This demonstrated that the text encoder's architecture, not just the diffusion model, determines compositional ability.

Lesson 2 Quiz

Test your prompt engineering knowledge.

Why does placing the primary subject in the first tokens of a prompt generally improve results in CLIP-conditioned models?

Correct. Cross-attention in the U-Net weights earlier token positions more heavily, so front-loading your subject increases its influence on spatial layout and representation.

Not correct. The mechanism is cross-attention weighting in the U-Net — earlier tokens have greater influence. The VAE and denoising step count are not involved.

What mechanism allows negative prompts to work in a diffusion pipeline using CFG?

Correct. In CFG, the model computes both a positive-conditioned and negative-conditioned noise prediction. The guidance formula pushes the output away from the negative direction and toward the positive.

Incorrect. Negative prompts work through CFG's directional steering during denoising — not through post-processing, discriminators, or raw vector subtraction in the encoder.

What made DALL·E 3 (2023) significantly better at compositional prompts like "a red ball to the left of a blue cube" compared to earlier CLIP-based systems?

Correct. OpenAI found that the bottleneck was data quality, not model size. GPT-4-generated captions taught the system relational language that CLIP's web-scraped alt-text lacked.

Wrong. The key was training data — GPT-4-rewritten captions gave DALL·E 3's encoder genuine understanding of spatial relationships, something CLIP-based systems lacked regardless of resolution or steps.

Lab 2: Prompt Engineering Workshop

Practice constructing and critiquing prompts with your AI tutor. Complete 3 exchanges to finish.

Your Challenge

A brand photographer needs a consistent photorealistic image of their product — a matte black coffee thermos — in a Nordic kitchen setting. Their current prompt produces blurry results with cartoon-like styling. Analyze their prompt and rewrite it using what you know about prompt structure, negative prompts, and token ordering.

Their current prompt: "a thermos in a kitchen, nice, realistic, good lighting, detailed, beautiful, high quality, amazing, perfect"

AI Tutor

Prompt Engineering

Let's workshop this prompt together. What do you think is going wrong with "a thermos in a kitchen, nice, realistic, good lighting…"? Start by diagnosing the problem, and we'll build a better version from there.

Module 2 · Lesson 3

Conditioning Beyond Text: ControlNet, IP-Adapter, and Structural Guidance

How spatial and visual conditioning signals give you precise control over composition, pose, and depth.

If text prompts control what an image contains, what controls exactly where everything goes?

On February 10, 2023, researcher Lvmin Zhang posted ControlNet to Hugging Face. Within days it had tens of thousands of downloads. ControlNet solved one of the most persistent complaints about diffusion models: compositional unpredictability. You could write the perfect prompt, but the model placed elements wherever it wanted. ControlNet changed this by injecting structural conditioning signals — edge maps, depth maps, pose skeletons — directly into the U-Net's encoder blocks. For the first time, users could hand the model a stick-figure pose and guarantee the generated figure matched it exactly.

How ControlNet Works Architecturally

ControlNet creates a trainable copy of the U-Net's encoder layers, connected to the original U-Net via "zero convolution" layers — convolutional layers initialized to zero so they inject nothing at the start of training and gradually learn to add structural guidance. The conditioning image (edges, depth, pose, etc.) passes through this parallel encoder, and its outputs are added to the corresponding layers of the original U-Net at each resolution.

Crucially, the original U-Net weights are frozen during ControlNet training. This means ControlNet can be trained relatively cheaply (Zhang used a single RTX 3090 for one week) and added to any existing checkpoint without destroying its prior capabilities.

Types of ControlNet Conditioning

Canny Edge

Extracts sharp edge lines from a reference image. The model generates content that follows those edges precisely — ideal for maintaining object outlines and architectural shapes.

Depth Map

A grayscale depth estimate (near=white, far=black). The model preserves foreground/background relationships and spatial layering from the reference scene.

OpenPose

A skeleton of 18 body keypoints extracted from a reference photo. Guarantees the generated figure replicates the reference pose regardless of clothing, age, or style.

Scribble / Soft Edge

Looser structural guidance from hand-drawn sketches. The model interprets rough lines as compositional intent, filling in detail freely within those constraints.

Normal Map

Surface normal information for controlling 3D surface appearance — useful for product visualization where lighting direction must follow specific geometry.

MLSD (Line Detection)

Detects straight lines — ideal for architectural and interior images where perspective lines and walls must remain precise.

IP-Adapter: Image Prompting

Released by Tencent's AI Lab in August 2023, IP-Adapter introduced a different kind of conditioning: using an image as a style or content reference rather than text. IP-Adapter adds a parallel cross-attention mechanism that accepts CLIP image embeddings (instead of text embeddings) and injects them alongside the text conditioning at each U-Net layer.

The result: you provide a reference photo and a text prompt, and the model generates images that combine the style/content of your reference image with the semantic guidance of your text. A typical use: "Generate a product photo of [our watch] in the aesthetic of [this reference fashion photo]." IP-Adapter achieves this in one pass, whereas previous approaches required fine-tuning the entire model.

Real-World Deployment — Adobe Firefly

Adobe Firefly's "Structure Reference" and "Style Reference" features (2024) are commercially deployed implementations of these techniques. Structure Reference extracts a depth/edge map from your reference image and uses it to condition generation. Style Reference uses image embedding conditioning similar to IP-Adapter. Adobe's IP was designed from scratch for commercial training data compliance, but the pipeline architecture is identical in principle.

T2I-Adapter and the Composability of Conditioning

A key advance is that multiple conditioning signals can be composed simultaneously. Using ComfyUI or A1111's multi-controlnet support, you can stack: an OpenPose skeleton (for human body), a depth map (for spatial layout), and a Canny edge map (for object outlines) — all simultaneously conditioning the same generation. Each ControlNet contributes its guidance additively to the U-Net residuals, weighted by a per-net "conditioning strength" parameter. Setting one controller to 0.3 and another to 0.8 blends their influence proportionally.

Key Insight

The modular nature of ControlNet — frozen base weights, parallel trainable encoder, zero-initialized connection — became a template for the entire field of "adapter" architectures. LoRA, IP-Adapter, T2I-Adapter, and Adobe Firefly's style references all follow this same pattern: inject additional conditioning without touching the base model's learned knowledge.

Lesson 3 Quiz

Test your understanding of ControlNet and structural conditioning.

What is the purpose of "zero convolution" layers in ControlNet's architecture?

Correct. Zero-initialized convolutions ensure that at the start of ControlNet training, the structural adapter adds nothing to the U-Net. It gradually learns to contribute meaningful signals without disrupting pre-trained knowledge.

Incorrect. Zero convolutions serve a training stability purpose — they allow the new ControlNet branch to "ease in" without corrupting the frozen base model from the first gradient step.

If you want to generate a product image that exactly replicates the pose of a model in a reference photo but with completely different clothing and setting, which ControlNet type is most appropriate?

Correct. OpenPose extracts 18 skeletal keypoints and gives them to the model as structural guidance. The generated figure will match the pose but appearance details — clothing, face, hair — are determined by your text prompt.

Wrong. Canny Edge would preserve clothing outlines too. MLSD handles architecture. Depth preserves 3D layout but not specific body positions. OpenPose is the right tool for pose-specific guidance.

How does IP-Adapter inject image reference conditioning into the generation process?

Correct. IP-Adapter's key insight is a decoupled cross-attention mechanism — a parallel attention pathway that processes CLIP image embeddings and injects them at every U-Net layer, complementing (not replacing) text conditioning.

Incorrect. IP-Adapter doesn't fine-tune or caption — it adds a lightweight parallel attention layer trained to inject CLIP image embeddings. This is why it works in a single inference pass without model modification.

Lab 3: ControlNet Strategy

Work through structural conditioning scenarios with your AI tutor. Complete 3 exchanges to finish.

Your Challenge

An e-commerce company wants to generate product images of their furniture line in multiple room settings without reshooting. They have existing product photos and want consistent results. Design a ControlNet conditioning strategy for this use case.

Scenario: They have a high-res photo of a walnut dining table. They want to generate it in (a) a Scandinavian minimalist setting, (b) a moody industrial loft, and (c) a bright Japanese-inspired room — maintaining the table's exact shape, proportion, and surface texture. Which conditioning types would you stack, at what weights, and why?

AI Tutor

ControlNet Strategy

Great scenario — product visualization is one of ControlNet's strongest real-world applications. Let's think through this systematically. What conditioning types are you considering first, and what's your reasoning for preserving the table's shape vs. its surface texture?

Module 2 · Lesson 4

Inpainting, Outpainting, and Image-to-Image: Editing the Generated World

How diffusion models let you selectively modify, extend, and transform existing images — and the pipeline mechanics behind each technique.

When an image is "95% perfect," how do you fix the remaining 5% without regenerating everything?

In August 2022, OpenAI demonstrated DALL·E 2's outpainting capability by expanding Jan Vermeer's 17th-century painting Girl with a Pearl Earring beyond its original canvas — revealing a full room, windows, and surroundings the painter never depicted. The demo went viral because it made the pipeline's core logic visible: the model could reason about what should exist beyond an image's boundary using only the existing content as context. This is computationally identical to inpainting — both mask-and-fill operations — just applied to the border.

Image-to-Image (img2img): Strength and Noise

The most fundamental editing operation is img2img. Rather than starting from pure Gaussian noise, img2img encodes your existing image into the latent space, adds a controlled amount of noise (determined by the denoising strength parameter, from 0.0 to 1.0), and then runs the standard denoising process conditioned on your new prompt.

At strength 0.0, no noise is added — the output is identical to the input. At strength 1.0, full noise is added — the result is equivalent to text-to-image. The practical range is 0.5–0.75: enough noise to allow meaningful changes while retaining the overall composition, colors, and layout of the source image.

Real Use Case — Adobe Firefly Generative Fill

Adobe Photoshop's Generative Fill (May 2023) is a production-grade inpainting pipeline. Users draw a selection, type a prompt, and Firefly generates 3 options. The underlying mechanism: the selected region is treated as the inpainting mask, the surrounding pixels are encoded as context, and the model fills the masked area conditioned on both the surrounding image and the text prompt — all in the Photoshop interface.

Inpainting: Masking and Context-Aware Fill

Inpainting is img2img with a binary mask. Pixels in the masked region receive noise; pixels outside the mask are preserved. The model must generate content that is both semantically consistent with the text prompt and visually coherent with the unmasked surrounding pixels — a constraint enforced by encoding the full image into the latent space and only adding noise where the mask is active.

Runway ML's Stable Diffusion inpainting model (late 2022) trained on deliberately masked training pairs — images where regions were masked out and the model had to reconstruct them. This training regime produced a model that understood "fill this region consistently with context," rather than just "generate image from text." Models trained specifically for inpainting substantially outperform applying img2img with a mask on a standard model.

Outpainting: Extending Beyond the Canvas

Outpainting is inpainting applied to an extended canvas. The original image is placed in the center (or to one side) of a larger blank canvas. The blank areas are treated as the inpainting mask. The model fills them conditioned on the original image content — inferring perspective, lighting, and scene continuation from the existing pixels as context.

The main technical challenge is maintaining global coherence — the extended region must match lighting direction, perspective, and color temperature of the original. Stable Diffusion XL's inpainting model (2023) uses a larger receptive field in the attention layers, giving it better awareness of the full scene context when filling extended regions.

Inpainting at Scale: Commercial Deployments

Beyond Photoshop, several platforms have deployed inpainting commercially at scale. Canva's Magic Eraser and Magic Edit (2023) use diffusion inpainting for consumer photo editing. Getty Images and Shutterstock integrated generative fill for their licensed asset libraries. Google's Magic Eraser in Pixel phones (using an on-device diffusion model) fills removed objects using the surrounding pixels as context — running the complete inpainting pipeline on a mobile processor.

Consistency Across Edits: The Remaining Challenge

The central unsolved problem of iterative editing is identity and consistency preservation. Each inpainting or img2img operation is stochastic — the model samples from a probability distribution each time. Apply inpainting twice to the same region with the same prompt and you get different results. Techniques for addressing this include: locking the random seed, using deterministic samplers (DDIM), IP-Adapter face conditioning for person consistency, and ControlNet reference-only mode — but no technique yet provides pixel-perfect reproducibility across multiple edits without additional fine-tuning.

Key Terms — Lesson 4

Denoising StrengthIn img2img, the fraction of noise added to the input image before denoising. 0.0 = unchanged; 1.0 = pure text-to-image. Typical range: 0.4–0.75.

Inpainting MaskA binary image (black/white) indicating which pixels should be regenerated (white) and which preserved (black).

OutpaintingExtending an image beyond its original borders by treating the extended canvas as an inpainting mask and conditioning generation on the existing content.

Stochastic SamplingThe inherent randomness in diffusion model generation; the same inputs produce different outputs each run unless the random seed is fixed.

Lesson 4 Quiz

Test your knowledge of inpainting, outpainting, and img2img.

In img2img, setting the denoising strength to 0.2 versus 0.9 — what is the fundamental difference in what happens to the input image?

Correct. Denoising strength controls how much noise is added to the latent before denoising begins. Low strength = small deviation from input. High strength = mostly new generation from the text prompt.

Incorrect. Denoising strength is about how much noise is injected into the latent, not step count, pixel selection, or resolution. It controls the degree of departure from the original image.

Why does Google's Magic Eraser on Pixel phones represent a technically significant deployment of inpainting?

Correct. Running diffusion inpainting on a mobile chip demonstrates that the pipeline can be heavily optimized and quantized to run locally — no cloud required. This has significant implications for privacy and latency.

Wrong. The notable aspect is that it runs entirely on-device on a mobile processor — not the cloud, not a large model, not 3D. This required significant model compression and optimization.

What is the primary reason that applying inpainting twice to the same masked region with the same prompt produces different results each time?

Correct. Each generation starts from a freshly sampled random noise tensor. Even with an identical prompt, mask, and model, different noise seeds produce different images. Fixing the random seed is the primary way to get reproducible outputs.

Incorrect. The source of stochasticity is the initial random noise sample — the starting point of the denoising process. The text encoder and VAE are deterministic; it's the noise initialization that varies.

Lab 4: Editing Pipeline Design

Design inpainting and img2img workflows with your AI tutor. Complete 3 exchanges to finish.

Your Challenge

A real estate marketing team has high-quality photos of vacant apartments that need to look furnished and styled for listings. They want to: (1) add furniture to empty rooms, (2) change wall colors, and (3) extend narrow room shots to show more space. Design a complete diffusion editing workflow for each task.

Start with task 1: adding furniture to an empty living room photo. Walk through which technique you'd use (img2img, inpainting, or outpainting), the denoising strength you'd set, any ControlNet conditioning, and how you'd handle consistency across multiple generated options.

AI Tutor

Editing Pipeline

Real estate virtual staging is a commercial-scale use case for exactly these techniques. Let's design this carefully. For task 1 — adding furniture to an empty room — what's your initial instinct about which editing technique fits best, and why? Think about what you need to preserve versus what you need to change.

Module 2 Test

15 questions across all four lessons. Score 80% or higher to pass.

1. In a latent diffusion model, which component is responsible for converting pixel-space images into the compressed latent representation?

Correct. The VAE encoder compresses images from pixel space into the latent space where diffusion occurs. The VAE decoder converts them back.

Incorrect. The VAE encoder handles compression to latent space. CLIP handles text; the U-Net handles denoising; CFG is a guidance parameter.

2. SDXL improved prompt understanding over SD 1.5 primarily by:

Correct. SDXL's dual-encoder design — OpenCLIP ViT-L and OpenCLIP ViT-bigG — provides richer text conditioning than SD 1.5's single CLIP encoder.

Incorrect. SDXL added a second, larger text encoder. Steps, architecture, and negative prompts were not the primary improvement.

3. The CFG guidance formula is: output = unconditional + CFG_scale × (conditional − unconditional). What does a CFG of 7.5 actually do mathematically?

Correct. CFG scale is a multiplier on the difference vector between conditioned and unconditioned outputs — it amplifies the prompt's directional influence in the noise prediction.

Incorrect. CFG is a multiplier on the (conditional − unconditional) vector, not a blend percentage, step count, or partial application of the prompt.

4. Samplers like DPM++ 2M Karras achieve high quality in 20–25 steps while DDPM required 1,000. What is the key reason for this efficiency gain?

Correct. Advanced samplers use higher-order numerical solvers and adaptive step sizes that reach high-quality results in far fewer function evaluations than DDPM's simple uniform step approach.

Incorrect. The efficiency comes from smarter numerical solving — not architecture size, skipped conditioning, or pixel vs. latent space differences.

5. When writing a prompt, placing "golden retriever" after 30+ other tokens (beyond the effective attention range) will likely result in:

Correct. Later token positions receive less cross-attention weight during U-Net conditioning. Important subjects should appear early in the prompt.

Incorrect. Token position matters for cross-attention weighting. Later tokens have proportionally less influence — a key reason skilled prompters front-load key subjects.

6. The negative prompt "extra limbs, two heads, duplicate" primarily targets which common diffusion model failure mode?

Correct. Human anatomy is a known weakness of diffusion models — they frequently generate extra fingers, merged limbs, or duplicate body parts. These negative terms steer generation away from those failure modes.

Incorrect. This set of negative terms specifically addresses anatomical errors. Separate negative prompts handle blur, typography, and color issues.

7. The Prompt-to-Prompt paper (Hertz et al., Weizmann Institute) demonstrated which key insight about how diffusion models represent spatial information?

Correct. Prompt-to-Prompt showed that each token's cross-attention map has a direct spatial correspondence in the generated image — a foundational insight enabling text-based image editing.

Incorrect. The key finding was that individual token attention maps map to spatial image regions — enabling targeted editing by manipulating those attention maps.

8. Why does ControlNet freeze the original U-Net weights during training?

Correct. Freezing the base U-Net is what makes ControlNet modular and efficient — you train only the parallel encoder, preserving everything the base model already knows how to generate.

Incorrect. The reason is architectural modularity — preserving the base model's capabilities. Freezing allows ControlNet to be trained cheaply and added to any checkpoint.

9. Which conditioning type would best help a model maintain accurate perspective lines when generating architectural interior images?

Correct. MLSD specifically detects straight lines and is optimized for architectural applications where perspective accuracy and wall/floor/ceiling lines must be preserved precisely.

Incorrect. MLSD (Multi-scale Line Segment Detector) is the right tool for architecture — it detects straight lines including perspective convergence lines. Depth maps preserve 3D layout but not line precision.

10. IP-Adapter was released by Tencent's AI Lab in 2023. What fundamental problem did it solve?

Correct. IP-Adapter solved the "image prompting" problem — using a reference image's style or content — without expensive per-image fine-tuning, through a lightweight parallel cross-attention mechanism.

Incorrect. IP-Adapter specifically enabled efficient image-reference conditioning via a parallel attention layer — no fine-tuning required per reference image.

11. In the DALL·E 2 outpainting demonstration with Vermeer's "Girl with a Pearl Earring," why is the operation technically identical to inpainting?

Correct. Outpainting is inpainting where the masked region happens to be at the canvas boundary. The pipeline is identical: encode surrounding context, mask target region, condition generation on context.

Incorrect. Outpainting is literally the same pipeline as inpainting — it's just inpainting where the masked area is at the edge. Same mask-and-fill logic, different mask placement.

12. A photographer wants to use img2img to subtly change the season in a landscape photo from summer to autumn while keeping the composition identical. What denoising strength range should they use?

Correct. Mid-range denoising strength (0.4–0.6) allows meaningful stylistic change — color palette, foliage — while the low noise level preserves the underlying composition, perspective, and spatial layout.

Incorrect. Very high strength would destroy the composition; very low strength wouldn't change the season. The mid range (0.4–0.6) is the practical sweet spot for stylistic edits.

13. What does "conditioning strength" control when stacking multiple ControlNets simultaneously in ComfyUI?

Correct. Conditioning strength scales how much each ControlNet's output is added to the U-Net's intermediate activations — letting you balance, for example, a strong depth constraint with a lighter pose suggestion.

Incorrect. Conditioning strength is a weight parameter on each ControlNet's additive contribution to U-Net residuals — not resolution, step count, or per-net CFG.

14. Google's Imagen (2022) used T5-XXL as its text encoder instead of CLIP. What specific capability did this improve most noticeably?

Correct. Imagen's paper specifically highlighted improved compositional faithfulness from T5-XXL's richer language understanding — demonstrating that the text encoder architecture is a primary bottleneck in prompt adherence.

Incorrect. T5-XXL's main benefit was richer language understanding for compositional and relational prompts — not resolution, speed, or style terms specifically.

15. Runway ML trained a dedicated inpainting model rather than using standard img2img with a mask. Why does a purpose-trained inpainting model outperform this workaround?

Correct. The training objective matters — a model trained specifically on masked reconstruction learns context-aware filling as a core skill, not as an afterthought applied to a model trained only for full-image generation.

Incorrect. The key is training objective. Purpose-trained inpainting models learn from masked-reconstruction pairs, giving them explicit context-aware fill capability. Hardware and VAE differences are not the primary factor.