Image Generation Models

1. The Prompt-to-Prompt paper (Hertz et al., 2023) showed that cross-attention maps are spatially interpretable. What practical capability did this research enable?

Correct. If specific attention maps correspond to specific image regions, you can perform targeted edits — swap the noun while preserving the layout, change an attribute in one region while leaving others intact. This confirmed diffusion models build genuine spatial-semantic correspondence, not just global statistical matching.

Prompt-to-Prompt's key contribution: showing cross-attention maps reliably encode which prompt tokens activate which spatial regions. This made targeted editing possible — modify the attention map for "cat" → "dog" while preserving all other spatial structure. Demonstrated genuine spatial-semantic correspondence inside the diffusion model.

2. T2I-Adapter was released by which research group?

Correct. T2I-Adapter and IP-Adapter were both produced by Tencent AI Lab researchers in 2023.

T2I-Adapter came from Tencent AI Lab (Mou et al., 2023), as did IP-Adapter.

3. What is "prompt bleeding" in Stable Diffusion?

Correct. Prompt bleeding occurs when attention weights on a specific token are set too high (e.g., above 1.5), causing that element to dominate the composition at the expense of other prompt components.

Incorrect. Prompt bleeding is the result of excessively high token attention weights — the over-emphasized element dominates unnaturally while other prompt components are suppressed.

4. In a cascade filtering pipeline, what is the primary reason hard filters run before defect detectors?

Correct. Hard filters (rule-based, milliseconds) eliminate obviously disqualifying images cheaply before the more compute-intensive learned defect detectors and aesthetic scorers run on a reduced, pre-filtered candidate set.

Not correct. The reason is computational efficiency: hard filters are extremely cheap (rule-based, milliseconds) and reduce the image pool before more expensive learned classifiers and scoring models need to run.

5. What does "Rembrandt lighting" specify in an image prompt?

Correct. "Rembrandt lighting" is a photographic term for a specific single-source lighting pattern producing a triangle of light on the shadow-side cheek, existing as its own training cluster independent of Rembrandt's paintings.

Incorrect. Rembrandt lighting is a well-defined photographic term — a 45° single-source key light producing a characteristic triangle of light on the shadow-side cheek. It activates a photographic training cluster, not a painting style cluster.

6. What does CLIP stand for?

Correct. CLIP = Contrastive Language–Image Pre-Training, introduced by OpenAI in January 2021.

Incorrect. CLIP stands for Contrastive Language–Image Pre-Training.

7. In Midjourney, how are negative prompts specified?

Correct. Midjourney uses --no as its negative equivalent, appended directly to the end of the prompt. It does not have a separate negative prompt field like Stable Diffusion.

Incorrect. Midjourney does not have a separate negative prompt field. Negative specifications are added using the --no parameter flag appended to the main prompt (e.g., --no blurry, watermark).

8. Black Forest Labs adapted ControlNet for FLUX.1 in 2024. What was the key architectural difference from UNet-based ControlNet?

Correct. Diffusion transformers don't have UNet skip connections, so the FLUX ControlNet variant injects into attention feature maps instead.

The key adaptation was targeting attention feature maps for injection — transformers don't have the encoder skip connections that UNet-based ControlNet relies on.

9. Adobe Firefly differs most significantly from Stable Diffusion and Midjourney in which aspect of its training data?

Correct. Adobe Firefly was trained exclusively on licensed content — Adobe Stock, openly licensed works, and public domain images — making it the primary choice for commercial production workflows where copyright clearance is required.

Incorrect. Firefly's distinguishing characteristic is its licensed training dataset — only images with appropriate licensing were used, which differentiates its commercial viability and affects which artist names are in its dataset.

10. What is the Precision/Recall framework's advantage over FID for diagnosing generative model failures?

Correct. Precision measures fidelity — what fraction of generated images look realistic. Recall measures diversity — what fraction of the real distribution is covered. FID conflates both into one number; P&R separates them, enabling more targeted diagnosis.

Not correct. The advantage is diagnostic separation: Precision tells you whether generated images are high quality, Recall tells you whether the model covers the full real distribution. FID combines both into one number, making it impossible to know which is failing.

11. What does ᾱ_t (alpha-bar at timestep t) represent in the DDPM forward process?

Correct. ᾱ_t = ∏(1−β_i) for i=1 to t. When ᾱ_t ≈ 0, the latent is pure noise; when ᾱ_t ≈ 1, almost no noise has been added.

Incorrect. ᾱ_t is the cumulative noise schedule product — it determines the signal-to-noise ratio at each forward step.

12. You need to apply rough hand-drawn outlines as ControlNet conditioning input. Which conditioning type is specifically designed to tolerate imprecise input?

Correct. Scribble ControlNet models are specifically trained on rough, imprecise input to tolerate the variation of hand-drawn outlines.

The Scribble/Sketch conditioning type is trained to handle rough input. Canny requires precise reference imagery; it would produce very different results from hand-drawn lines.

13. A professional architectural visualization pipeline uses two ControlNets simultaneously — depth for scene layout and Canny for facade detail. Which weighting approach is correct?

Correct. Establishing a priority hierarchy — primary conditioning at higher weight, secondary at lower — prevents conflicting guidance artifacts where the two signals disagree.

When stacking ControlNets, a weight hierarchy is essential. Primary structure (depth) gets higher weight; secondary detail (Canny) gets lower weight to prevent conflict artifacts.

14. What metric does CLIP use to measure similarity between image and text embeddings?

Correct. Cosine similarity measures the angle between two embedding vectors, ranging from −1 to 1.

Incorrect. CLIP uses cosine similarity — the cosine of the angle between two vectors — to measure how aligned image and text embeddings are.

15. SDXL improved prompt understanding over SD 1.5 primarily by:

Correct. SDXL's dual-encoder design — OpenCLIP ViT-L and OpenCLIP ViT-bigG — provides richer text conditioning than SD 1.5's single CLIP encoder.

Incorrect. SDXL added a second, larger text encoder. Steps, architecture, and negative prompts were not the primary improvement.

16. The alpha (α) parameter in LoRA controls:

Correct. The effective weight is W₀ + α·BA — alpha scales how much the low-rank update contributes. The ratio α/r is what practically matters for controlling LoRA strength.

Alpha is a scaling constant in the formula W₀ + α·BA. It determines how strongly the trained LoRA update is applied. The ratio α/r is what practitioners actually tune to control the LoRA's contribution strength.

17. Which sampler was identified as the community-recommended default for ControlNet workflows based on a large r/StableDiffusion comparison study?

Correct. DPM++ 2M Karras at 25 steps and CFG 7 was the consensus recommendation from the community comparison study.

DPM++ 2M Karras was the community consensus. It balances step efficiency and quality and handles ControlNet conditioning reliably.

18. In the professional iteration sequence, what is the recommended first step when refining a failing prompt?

Correct. The professional sequence starts with establishing subject and composition, evaluating whether those are correct before adding style, lighting, quality, and negative components in subsequent steps.

Incorrect. The systematic iteration sequence begins with establishing subject and composition — confirming the fundamental visual structure before layering style, lighting, and quality modifications.

19. In Classifier-Free Guidance, what is done during training to enable the unconditional denoising path?

Correct. With probability ~10–20%, the conditioning text embedding is replaced with a null embedding during training, causing the model to learn both conditional and unconditional denoising simultaneously.

Incorrect. CFG training drops the text conditioning randomly, teaching the single model to denoise both with and without text guidance.

20. The lawsuit filed by Sarah Andersen, Kelly McKernan, and Karla Ortiz in January 2023 specifically raised concerns about LoRA because:

Correct. The accessibility of LoRA — consumer GPU training on 10–30 images — meant that by the time of the lawsuit, hundreds of LoRAs had already been trained specifically on named living artists' styles and shared publicly, without those artists' consent.

The core concern was accessibility: LoRA made it trivial for anyone to train a model specifically targeting an individual artist's style with a consumer GPU and a small image scrape. Civitai already hosted dozens of named-artist LoRAs when the lawsuit was filed.

Final Exam