Module 7 · Lesson 1

The Anatomy of an Image Prompt

Every element you include — or omit — reshapes what the model renders.

What separates a vague request from a precise visual specification?

When Stability AI released Stable Diffusion 1.4 publicly in August 2022, millions of users immediately discovered the same frustration: typing "a beautiful landscape" produced wildly inconsistent results. Within weeks, communities on Reddit and Discord had developed shared prompt libraries — structured templates specifying subject, style, medium, lighting, and quality boosters. These templates circulated as the first informal codifications of image prompt engineering.

Why Prompts Are Not Just Descriptions

Text-to-image models translate token sequences into latent-space coordinates. The model was trained on billions of image-caption pairs, meaning it learned statistical associations between words and visual features — not semantic understanding. When you write "dog," the model activates weighted features across breed, pose, background, lighting, and style simultaneously. Specificity narrows that probability distribution toward your intent.

OpenAI's internal documentation for DALL-E 3 (released October 2023) explicitly noted that longer, more descriptive prompts produced more consistent results, and the system was designed to rewrite short user prompts into expanded specifications before generation. This architectural choice confirmed what practitioners already knew empirically: prompt completeness matters.

The Six-Component Anatomy

Experienced prompt engineers decompose every image request into six addressable components. Missing any component leaves the model free to fill the gap with its own statistical default.

Subject

Who or what is the primary focus. Include count, species/type, pose, expression, and action. "A single red fox sitting on its haunches, ears alert, facing left" outperforms "a fox."

Style

The aesthetic register. Reference movements (impressionism, brutalism), named artists whose style is in the training set, or composite descriptors ("gritty cyberpunk noir"). Style tokens carry enormous distributional weight.

Medium

The physical or digital rendering material. Oil on canvas, watercolor, pencil sketch, 3D render, photograph, linocut — each activates distinct texture and color statistical clusters.

Lighting

The light source and quality. Golden hour, overcast diffuse, harsh rim lighting, bioluminescent glow, candlelight. Lighting is the single most powerful mood modifier available in a prompt.

Quality / Technical

Resolution and rendering quality signals. Tokens like "8K," "highly detailed," "sharp focus," "photorealistic," or "trending on ArtStation" were correlated with high-quality images in training data — including them biases outputs toward that cluster.

Negative Prompt

What to exclude. Supported by most diffusion systems via classifier-free guidance. Common negatives: "blurry, low quality, watermark, extra limbs, deformed hands, text, logo."

Weak vs. Strong Prompts

⚠ Weak Prompt

"A city at night"

✓ Strong Prompt

"Aerial photograph of Tokyo's Shinjuku district at 2 AM, neon reflections on wet asphalt, rain-soaked streets, Canon 5D Mark IV, f/1.8, bokeh, cinematic blue-teal color grade, ultra-detailed, 8K resolution. Negative: blurry, cartoon, illustration, daytime."

Research Note

A 2023 study by researchers at Johns Hopkins and Adobe found that adding specific lighting descriptors alone improved human-rated image quality scores by 23% on Stable Diffusion XL outputs, controlling for all other prompt variables. Lighting specificity was the single highest-leverage component tested.

Token Weight and Order

In most diffusion models, earlier tokens receive higher attention weight during cross-attention. This means front-loading your most critical subject and style information pays dividends. Stable Diffusion's CLIP encoder processes tokens in sequence, and the model's cross-attention layers assign diminishing influence to later tokens in long prompts.

Some interfaces (Automatic1111, ComfyUI) implement explicit weight syntax: (red fox:1.4) multiplies that token's attention weight by 1.4. Overusing high weights causes "prompt bleeding" — artifacts where the emphasized element dominates unnaturally. The sweet spot for most tokens is 1.1–1.5.

Cross-attention:The mechanism by which diffusion U-Nets integrate text conditioning. Each spatial position in the image latent attends to text tokens, determining which words shape which regions.

CFG Scale:Classifier-Free Guidance scale — controls how strongly the model follows the prompt vs. generates freely. Values of 7–12 are typical; higher values increase prompt adherence but reduce diversity and can cause oversaturation.

Negative Prompt:A secondary text input processed via the same cross-attention mechanism but subtracted from the conditioning signal, steering generation away from specified features.

Lesson 1 Quiz

The Anatomy of an Image Prompt — 3 questions

1. In Stable Diffusion's CLIP encoder, which tokens generally receive the highest attention weight during generation?

Correct. The CLIP encoder processes tokens sequentially, and cross-attention layers assign diminishing weight to later tokens. Front-loading subject and style information maximizes influence.

Not quite. In CLIP-based diffusion models, earlier tokens receive higher cross-attention weight because of how the sequential encoder feeds into the U-Net's attention mechanism.

2. A 2023 Adobe/Johns Hopkins study found that adding which specific prompt component alone improved human-rated image quality scores by approximately 23%?

Correct. Lighting specificity was the single highest-leverage component in that controlled study, improving quality ratings by 23% on Stable Diffusion XL outputs.

Incorrect. The study isolated lighting descriptors as the single most impactful variable, boosting quality scores by 23% even when all other prompt variables were held constant.

3. What risk arises from setting explicit token attention weights too high (e.g., above 1.5) in Automatic1111?

Correct. Over-weighted tokens cause "prompt bleeding" — artifacts where that element dominates the composition unnaturally, often at the expense of other prompt components.

Not correct. Excessively high token weights cause "prompt bleeding," where the emphasized concept dominates the image unnaturally and other described elements are suppressed.

Lab 1 — Prompt Anatomy Workshop

Practice decomposing and improving image prompts using the six-component framework

Your Task

You'll work with an AI coach to analyze and strengthen image prompts. Start by submitting a weak prompt, then the coach will help you identify which of the six anatomy components are missing and how to add them. Aim for at least three exchanges to complete the lab.

Try starting with: "I want to generate an image of a forest. What components am I missing?"

Prompt Anatomy Coach

M7 · L1

Welcome to the Prompt Anatomy Workshop. Share any image prompt — strong or weak — and I'll help you dissect it using the six-component framework: Subject, Style, Medium, Lighting, Quality/Technical, and Negative Prompt. We can also practice going from a vague idea to a fully specified prompt. What are you working on?

Module 7 · Lesson 2

Style, Medium, and Artistic Reference

How to invoke visual traditions, materials, and artists' names without infringing or hallucinating.

Why does writing "in the style of Vermeer" produce such different results than writing "photorealistic"?

In September 2023, Getty Images won a significant ruling allowing its copyright lawsuit against Stability AI to proceed in US federal court. The core allegation: Stable Diffusion had ingested millions of Getty watermarked images, and certain prompts — particularly those referencing specific photographic styles — could produce outputs bearing distorted watermark artifacts. This case forced practitioners to confront a real question: when you invoke a style or artist name in a prompt, what exactly are you activating?

How Style Tokens Work

When a diffusion model is trained on captioned images, artist names and stylistic labels appear alongside thousands of images from that artist or movement. The model learns a dense cluster of visual features — palette, brushstroke texture, compositional tendencies, subject matter — associated with that label. Invoking "Rembrandt" does not access a specific painting; it activates the statistical centroid of every Rembrandt-captioned training image.

This explains why style references are powerful but imprecise. "In the style of Rembrandt" produces baroque chiaroscuro, golden ochres, and psychological portraiture — but not a copy of any specific work. The output is a statistical interpolation of the artist's visual signature.

Legal Context

The 2023 class action lawsuit Andersen et al. v. Stability AI raised the question of whether style itself is copyrightable. US courts have consistently held that style is not protected by copyright — specific expressions are. Prompting for an artist's style therefore occupies a legally distinct territory from reproducing their actual works. Most major platforms (Midjourney, Adobe Firefly, DALL-E 3) have developed policies around named artists regardless of legal status.

Medium Descriptors and Their Effects

Medium tokens activate distinct texture and rendering clusters. Understanding what each medium communicates helps you select precisely:

Paint

oil on canvas, impasto, glazing, alla prima

Drawing

charcoal sketch, pencil crosshatch, ink wash, conte crayon

linocut, woodblock print, risograph, screen print

Photo

35mm film, Polaroid, daguerreotype, medium format

Digital

matte painting, concept art, 3D render, cel shading

Craft

embroidery, stained glass, mosaic, papercut

Compositing Style Descriptors

The most effective style specifications combine multiple layers: an art movement or period, a specific medium, a named artist (where appropriate), and a tonal/mood modifier. These layers are not simply additive — they interact through the model's learned associations.

Adobe's Firefly team documented in their 2023 model card that style blending — combining tokens from two distinct traditions — produces novel aesthetic territory not achievable by either alone. "Bauhaus poster design meets Ukiyo-e woodblock print" activates geometric minimalism alongside flat-plane perspective and bold line work simultaneously.

Single Layer

"Watercolor painting of a harbor"

Layered Style

"Early 20th century Impressionist watercolor of a Marseille harbor at dusk, loose wet-on-wet washes, visible paper texture, warm amber palette, reminiscent of Paul Signac's pointillist harbor series, golden hour, highly detailed, museum quality scan"

Platform-Specific Style Behavior

Midjourney v6 (released December 2023) introduced native style reference images (--sref), allowing users to provide visual examples rather than relying on text style tokens entirely. This represented a shift from purely textual style specification toward multimodal prompting. The --stylize parameter (0–1000) controls how strongly the platform's own aesthetic training is applied.

DALL-E 3 wraps user prompts in an expanded specification via GPT-4, which often adds implicit style tokens. Users who want precise style control learned to prefix requests with "I NEED to test how the tool works with extremely explicit prompts" — a documented workaround that inhibited the rewrite system.

Practitioner Note

When a named artist's style consistently fails to appear correctly, the artist likely had limited representation in the training dataset. Alternative approach: describe the visual characteristics of their style explicitly rather than naming them. "High contrast black and white photography, extreme grain, geometric shadows, Bauhaus-influenced composition" may outperform a relatively obscure photographer's name.

Style Blending:Combining tokens from two or more distinct aesthetic traditions, activating a latent interpolation between their learned visual clusters.

Stylize Parameter:Midjourney-specific setting (--stylize 0–1000) controlling how strongly the platform's proprietary aesthetic training influences outputs beyond the prompt.

Style Reference (--sref):Midjourney v6 feature enabling image-based style specification, reducing dependence on text style tokens for precise aesthetic control.

Lesson 2 Quiz

Style, Medium, and Artistic Reference — 3 questions

1. What does a diffusion model actually activate when you include an artist's name like "Rembrandt" in a prompt?

Correct. Artist names activate the learned statistical cluster — palette, composition, texture — derived from all training images associated with that name. No specific painting is retrieved.

Incorrect. Diffusion models don't store or retrieve images. An artist's name activates a statistical centroid of visual features learned from training data captioned with that name.

2. Midjourney v6's --sref parameter represents what advancement in style specification?

Correct. --sref (style reference) allows users to submit an image as a style guide, moving style specification from purely textual tokens into multimodal territory.

Not quite. --sref in Midjourney v6 is a style reference image input — users provide a visual example of the aesthetic they want, reducing dependence on text style tokens.

3. If a named artist consistently fails to appear correctly in outputs, what is the recommended alternative approach?

Correct. If an artist had limited training data representation, their name token carries weak activations. Describing their stylistic characteristics explicitly — contrast, palette, composition — is more effective.

Incorrect. When an artist name underperforms, the model likely had limited exposure to that artist during training. Describing their visual style characteristics explicitly produces more reliable results.

Lab 2 — Style and Medium Composer

Build layered style specifications and explore medium descriptor combinations

Your Task

Work with the AI coach to compose layered style specifications. Try combining art movements with specific mediums, or blend two aesthetic traditions. The coach can suggest medium-specific vocabulary, evaluate your style combinations, and help you find alternatives when an artist name is too obscure.

Try: "I want to create an image that blends two art styles together. Can you help me write the style component of the prompt?"

Style & Medium Coach

M7 · L2

Welcome to the Style and Medium Composer lab. I can help you build rich, layered style specifications for image prompts — combining art movements, specific mediums, artist references, and tonal modifiers into coherent aesthetic directions. Share a subject or visual idea and we'll craft the style layer together. What kind of image are you aiming for?

Module 7 · Lesson 3

Lighting, Composition, and Camera Language

Cinematographic and photographic vocabulary gives diffusion models the most precise visual geometry instructions available.

How do you specify not just what appears in an image, but exactly how it is seen?

When Sony Pictures Imageworks began experimenting with Stable Diffusion for concept art in 2023, their visual development artists found that their existing cinematography vocabulary translated directly and powerfully into prompt engineering. Terms from their standard shot sheets — "Dutch angle," "rack focus," "motivated key light from practical source" — produced dramatically different compositions than vague descriptions. Cinematographic language had entered the prompt engineer's toolkit not as metaphor, but as precise technical specification.

Lighting Vocabulary

Lighting descriptors activate learned associations between words and light-quality statistics across the training dataset. Because photography and cinematography have developed precise nomenclature over a century, these terms appear consistently captioned across millions of training images, giving them high activation reliability.

Golden Hour

Warm amber-orange light, long shadows, soft diffuse quality. Strong romantic and nostalgic associations. Activates a well-defined training cluster with high consistency.

Rembrandt Lighting

A triangle of light on the shadow-side cheek, dramatic single-source 45° key light. Named for the painter — now a photographic term with its own dense training cluster independent of Rembrandt's paintings.

Volumetric Light

Visible light rays (crepuscular rays, god rays). Activates atmospheric scattering aesthetics — dusty interiors, forest shafts of light, fog banks.

Backlit / Silhouette

Light source behind subject. Produces rim lighting, lens flare, and edge-definition aesthetics. Strong compositional drama.

Bioluminescence

Self-generated cool blue-green organism light. Strong fantasy and underwater associations. Highly specific cluster.

Overcast / Diffuse

Flat, shadowless, cool-neutral light from cloud cover. Excellent for portrait skin rendering. Documentary and editorial aesthetic.

Camera and Lens Language

Because the training dataset contained billions of photographs, camera and lens specifications are among the most reliable prompt components. They encode not just framing but the entire optical physics of image formation — depth of field, distortion, perspective compression.

Focal Length

14mm wide, 35mm standard, 85mm portrait, 200mm telephoto

Aperture

f/1.4 shallow DOF, f/8 deep focus, f/22 hyperfocal

Film Stock

Kodak Portra 400, Fuji Velvia 50, Ilford HP5, Cinestill 800T

Shot Type

extreme close-up, medium shot, wide establishing, bird's eye

Angle

Dutch angle, worm's eye, overhead flat lay, eye level

Lens Effect

anamorphic bokeh, barrel distortion, tilt-shift, fisheye

Documented Finding

In Midjourney's internal testing documentation (referenced in their 2023 model notes), specifying a camera model (e.g., "Hasselblad 500CM," "Leica M6") consistently shifted outputs toward medium-format aesthetic characteristics — larger tonal range, specific color rendering, and film grain patterns — even without specifying film stock separately. Camera brand names carry implicit medium associations.

Compositional Direction

Composition tokens guide the model's spatial organization of image elements. These work less reliably than lighting or camera tokens because compositional rules are more abstract and less consistently labeled in training data — but certain well-documented rules still activate reliably.

The rule of thirds — placing subjects at grid intersections — is widely referenced in photography education and appears across training captions. "Rule of thirds composition," "leading lines," "symmetrical composition," and "negative space" all have training precedent. More complex compositional instructions ("spiral composition following the golden ratio") have weaker cluster representation and less predictable results.

Color Grading

Color grading vocabulary from film post-production is highly effective. "Teal and orange color grade" (a dominant Hollywood look from the 2010s), "desaturated muted palette," "Wes Anderson pastel symmetry," and "warm analog filter" all activate well-defined aesthetic clusters because they appear consistently in film review discourse and cinematography documentation that entered training data.

Depth of Field (DOF):The range of distance that appears acceptably sharp. Controlled by aperture (f-number), focal length, and subject distance. Shallow DOF (e.g., f/1.4) isolates subjects; deep DOF (e.g., f/22) keeps everything sharp.

Anamorphic Bokeh:The oval-shaped out-of-focus highlights and horizontal lens flares characteristic of anamorphic cinema lenses. A strong cinematic aesthetic marker.

Crepuscular Rays:The technical term for visible shafts of light through atmosphere. Appearing in scientific and photographic writing, this term activates volumetric light clusters reliably.

Lesson 3 Quiz

Lighting, Composition, and Camera Language — 3 questions

1. According to Midjourney's 2023 model notes, specifying a camera brand like "Hasselblad 500CM" tends to produce what effect even without additional specification?

Correct. Camera brand names carry implicit medium associations learned during training. Hasselblad activates medium-format characteristics — tonal range, color rendering, grain patterns — without needing explicit film stock specification.

Incorrect. Camera model names carry implicit learned associations. Specifying a Hasselblad activates medium-format aesthetic clusters (tonal range, color rendering) even without separately specifying film stock or format.

2. Why do compositional instructions like "spiral composition following the golden ratio" work less reliably than lighting tokens in image prompts?

Correct. Lighting terms have dense, consistent training representation from photography education and practice. Abstract compositional rules appear less frequently and less consistently in training captions, producing weaker activations.

Not quite. The issue is training data distribution: lighting terminology appears consistently captioned across millions of images, while complex compositional concepts have less consistent labeling, weakening their prompt activation.

3. What does "Cinestill 800T" specified in a prompt primarily communicate to a diffusion model?

Correct. Cinestill 800T is a tungsten-balanced cinema film known for halation (red glow around lights) and pronounced grain in low-light conditions — a dense, specific aesthetic cluster in photography training data.

Incorrect. Film stock names like Cinestill 800T specify aesthetic clusters: this particular stock is known for tungsten color balance, distinctive halation around light sources, and high-ISO grain in night photography contexts.

Lab 3 — Cinematography Prompt Builder

Apply lighting, camera, and compositional vocabulary to craft precise visual specifications

Your Task

Work with the AI coach to add precise cinematographic and photographic language to image prompts. Practice specifying lighting setups, camera equipment, focal lengths, film stocks, and compositional directives. Aim to transform a flat scene description into a fully realized visual specification.

Try: "I want to photograph a street musician at night. Help me add lighting and camera language to make this prompt specific."

Cinematography Coach

M7 · L3

Welcome to the Cinematography Prompt Builder. I'll help you apply professional lighting setups, camera specifications, film stocks, and compositional vocabulary to your image prompts. Share a scene you want to visualize and we'll build out the technical visual layer together. What's your scene?

Module 7 · Lesson 4

Advanced Techniques: Iteration, Negative Prompts, and Platform Differences

Prompt engineering is not a single transaction — it is a systematic refinement process with platform-specific grammar.

How do professionals turn a first-generation failure into a final polished output through structured iteration?

When Volkswagen's creative agency used Midjourney to develop concept imagery for the 2023 ID.7 launch campaign, their internal process required an average of 47 generation iterations per final approved image. Their documented methodology — later shared at the Cannes Lions festival — involved systematic prompt evolution: establishing the base composition first, then layering style, then refining with negative prompts, and finally using image-to-image (img2img) refinement. No successful prompt was arrived at in a single attempt.

Systematic Iteration Strategy

Experienced practitioners treat prompt engineering as a controlled experiment. Rather than rewriting the entire prompt when a generation fails, they isolate variables: change only the lighting, observe the effect, then adjust style. This single-variable iteration strategy prevents the common mistake of simultaneously changing multiple components and being unable to attribute which change produced which result.

The standard professional iteration sequence is:

Step 1

Establish subject and composition. Start with the core subject and its basic spatial arrangement. Evaluate: is the subject recognizable and positioned correctly?

Step 2

Add style and medium. Once composition is stable, layer in the aesthetic direction. Evaluate: does the style read correctly without disrupting composition?

Step 3

Refine lighting. Specify the lighting setup. This often produces the largest single quality jump in otherwise correct compositions.

Step 4

Add quality boosters and negative prompts. Once the core elements are correct, add technical quality tokens and systematically exclude observed artifacts.

Step 5

img2img or inpainting. Use a good generation as a seed for further refinement, preserving composition while adjusting details at lower denoising strength.

Effective Negative Prompt Construction

Negative prompts are not a generic list to copy-paste — they should be built responsively from observed outputs. When a generation produces a specific artifact, add the descriptor for that artifact to the negative prompt. The most reliable negative prompt vocabulary addresses the model's known failure modes:

Quality

blurry, low quality, jpeg artifacts, noisy, overexposed

Anatomy

extra limbs, deformed hands, missing fingers, fused fingers

Composition

cropped, out of frame, cut off, bad framing

Style

cartoon, anime, illustration (if unwanted), watermark

Text

text, logo, signature, username, watermark

Face

bad face, asymmetrical, crossed eyes, ugly, disfigured

Platform Grammar Differences

Each major platform has developed distinct prompt conventions. Cross-platform prompts require adaptation, not copy-paste:

Stable Diffusion (Automatic1111/ComfyUI): Supports explicit attention weighting via (token:weight) syntax. Separate positive and negative prompt fields. CFG scale, sampler, and step count are exposed. LoRA and embedding tokens can be injected via <lora:name:weight> syntax. Prompt length matters — CLIP truncates at 77 tokens.

Midjourney v6: Uses natural language more effectively than SD due to its proprietary text encoder. Parameter flags appended to prompt: --ar (aspect ratio), --q (quality), --stylize (aesthetic strength), --chaos (variation), --no (negative equivalent). No separate negative prompt field — negatives go after --no. Responds poorly to explicit attention weighting syntax from SD.

DALL-E 3: Rewrites prompts internally via GPT-4. Best accessed via API with system prompt control. Very literal with spatial instructions. Strongest copyright filtering of the three — artist names are more frequently blocked or moderated.

Adobe Firefly: Trained exclusively on licensed content. Artist name behavior differs significantly — only artists with licensing agreements are in the dataset. Strongest for commercial use. Integrated into Creative Cloud workflows with selection-based inpainting.

CLIP Token Limit

Standard CLIP encoders truncate prompts at 77 tokens. In Stable Diffusion interfaces, prompts longer than approximately 55–60 words will have their later portions silently ignored. Some implementations (SD XL, extended CLIP) handle longer prompts, but practitioners should always front-load critical information. Midjourney and DALL-E 3 use longer-context text encoders and are less susceptible to this constraint.

Seed Control

Most platforms expose a seed parameter — an integer that initializes the noise pattern before denoising begins. Fixing a seed and changing only one prompt element allows true A/B comparison: the composition and layout remain consistent while the changed token's effect is isolated. This is the practitioner's most powerful debugging tool and is documented in Automatic1111's and Midjourney's official parameter references.

img2img:Image-to-image generation using an existing image as a starting point. Denoising strength (0–1) controls how much the output deviates from the input. Lower values (0.3–0.5) preserve composition while applying prompt modifications.

Inpainting:Selective regeneration of masked regions within an image, leaving unmasked areas intact. Enables precise local correction without re-generating the entire composition.

LoRA (Low-Rank Adaptation):Fine-tuned model adapters that inject specific styles, faces, or concepts into Stable Diffusion without full model retraining. Loaded via <lora:name:weight> syntax and function as high-precision style tokens unavailable through base prompting.

Lesson 4 Quiz

Advanced Techniques: Iteration, Negative Prompts, and Platform Differences — 3 questions

1. In the systematic iteration strategy, what is the recommended purpose of fixing a seed while changing one prompt element?

Correct. A fixed seed initializes the same noise pattern, so composition and layout remain consistent. Changing only one prompt element then isolates that token's contribution — functioning as a controlled experiment.

Incorrect. Fixing a seed ensures the same noise initialization, so layout and composition are held constant. This enables true A/B comparison: only the changed prompt element varies, letting you attribute effects accurately.

2. Stable Diffusion's standard CLIP encoder truncates prompts at approximately how many tokens?

Correct. Standard CLIP encoders have a 77-token limit. Prompts exceeding approximately 55–60 words will have later content silently truncated — which is why front-loading critical information is essential.

Incorrect. Standard CLIP in Stable Diffusion truncates at 77 tokens — roughly 55–60 words. Content beyond that limit is silently ignored, making front-loading critical prompt elements essential.

3. In img2img generation, what does a denoising strength of 0.35 primarily preserve?

Correct. A low denoising strength (0.3–0.5) retains much of the input image's composition — subject position, major shapes, spatial arrangement — while the prompt influences detail, texture, and style. Higher values give the prompt more control, eventually replacing the input entirely near 1.0.

Not correct. Denoising strength of 0.35 (low) preserves the composition and layout of the input image while allowing the prompt to modify finer details and stylistic elements. Higher denoising gives the prompt greater control over the output.

Lab 4 — Iteration and Refinement Simulator

Practice systematic prompt iteration strategy and platform-specific adaptation

Your Task

Work with the AI coach to practice the professional iteration sequence: start with a base prompt, identify failures, apply targeted fixes, and build toward a complete specification. Also practice adapting prompts between different platforms (Stable Diffusion, Midjourney, DALL-E 3). Aim for at least three exchanges.

Try: "Here's my current Stable Diffusion prompt that isn't working well: 'a woman standing in a field, beautiful.' Walk me through the iteration process to improve it."

Iteration Coach

M7 · L4

Welcome to the Iteration and Refinement Simulator. Share a prompt that's not producing what you want, and we'll work through the systematic refinement process together — isolating variables, adding targeted components, building negative prompts from observed artifacts, and adapting for different platforms. What prompt are you working with, and what's going wrong?

Module 7 — Final Test

Prompt Engineering for Images · 15 questions · Pass at 80%

1. Which of the six prompt anatomy components has been empirically shown to produce the largest single quality improvement when added to an otherwise complete prompt?

Correct. A 2023 Adobe/Johns Hopkins study found lighting descriptors alone improved human-rated quality scores by 23% on Stable Diffusion XL outputs.

Incorrect. The 2023 Adobe/Johns Hopkins study isolated specific lighting descriptors as the single highest-leverage prompt component, improving quality scores by 23%.

2. Why does DALL-E 3 sometimes produce results that don't match short user prompts very precisely?

Correct. DALL-E 3 was designed to expand short user prompts into detailed specifications via GPT-4, which can introduce interpretations not intended by the original user.

Incorrect. DALL-E 3 wraps user prompts in an expanded specification using GPT-4 before sending to the image model, which can produce results that reflect GPT-4's interpretation rather than the user's exact intent.

3. What does "Rembrandt lighting" specify in an image prompt?

Correct. "Rembrandt lighting" is a photographic term for a specific single-source lighting pattern producing a triangle of light on the shadow-side cheek, existing as its own training cluster independent of Rembrandt's paintings.

Incorrect. Rembrandt lighting is a well-defined photographic term — a 45° single-source key light producing a characteristic triangle of light on the shadow-side cheek. It activates a photographic training cluster, not a painting style cluster.

4. In Midjourney, how are negative prompts specified?

Correct. Midjourney uses --no as its negative equivalent, appended directly to the end of the prompt. It does not have a separate negative prompt field like Stable Diffusion.

Incorrect. Midjourney does not have a separate negative prompt field. Negative specifications are added using the --no parameter flag appended to the main prompt (e.g., --no blurry, watermark).

5. "Cinestill 800T" in a prompt primarily activates which aesthetic cluster?

Correct. Cinestill 800T is a tungsten-balanced cinema film stock known for its distinctive halation (red glow around light sources) and pronounced grain in low-light night photography contexts.

Incorrect. Cinestill 800T is a tungsten-balanced film known for its halation effect around artificial light sources and high-ISO grain — an aesthetic strongly associated with night and low-light photography.

6. What is "prompt bleeding" in Stable Diffusion?

Correct. Prompt bleeding occurs when attention weights on a specific token are set too high (e.g., above 1.5), causing that element to dominate the composition at the expense of other prompt components.

Incorrect. Prompt bleeding is the result of excessively high token attention weights — the over-emphasized element dominates unnaturally while other prompt components are suppressed.

7. Adobe Firefly differs most significantly from Stable Diffusion and Midjourney in which aspect of its training data?

Correct. Adobe Firefly was trained exclusively on licensed content — Adobe Stock, openly licensed works, and public domain images — making it the primary choice for commercial production workflows where copyright clearance is required.

Incorrect. Firefly's distinguishing characteristic is its licensed training dataset — only images with appropriate licensing were used, which differentiates its commercial viability and affects which artist names are in its dataset.

8. In the professional iteration sequence, what is the recommended first step when refining a failing prompt?

Correct. The professional sequence starts with establishing subject and composition, evaluating whether those are correct before adding style, lighting, quality, and negative components in subsequent steps.

Incorrect. The systematic iteration sequence begins with establishing subject and composition — confirming the fundamental visual structure before layering style, lighting, and quality modifications.

9. Why should negative prompts be built responsively from observed outputs rather than copied from generic lists?

Correct. Generic negative lists can suppress desired elements (e.g., negating "illustration" when you want an illustrative style). Targeted negatives built from observed artifacts address actual problems without collateral suppression.

Incorrect. Generic negative lists risk suppressing desired elements and may not target your actual problem artifacts. Building negatives from what you specifically observe in failed outputs produces more precise, less damaging exclusions.

10. The Stability AI vs. Getty Images lawsuit (2023) was notable because it revealed what specific visual artifact in some generated outputs?

Correct. The lawsuit alleged that Stable Diffusion had ingested Getty watermarked training images, and certain photographic style prompts could produce outputs bearing distorted watermark artifacts — evidence of training data memorization.

Incorrect. The core allegation involved distorted watermark artifacts appearing in outputs for certain photographic style prompts — suggesting the model had learned statistical patterns from Getty's watermarked training images.

11. What does the Midjourney --chaos parameter control?

Correct. --chaos in Midjourney controls how diverse or varied the four generated images in a batch are from each other. Higher values produce more experimental, varied results; lower values cluster outputs more tightly.

Incorrect. Midjourney's --chaos parameter controls the degree of variation between the four generated options in a batch. Higher values produce more experimentally diverse outputs; lower values produce more consistent, predictable batches.

12. Style blending — combining tokens from two distinct aesthetic traditions — works because of which model property?

Correct. Diffusion model latent spaces are continuous. Activating two style clusters simultaneously produces an interpolation between them — a novel aesthetic hybrid that doesn't exist as a discrete training category.

Incorrect. Because diffusion model latent spaces are continuous, activating two distinct style clusters simultaneously produces an interpolated output — a novel hybrid aesthetic between the two traditions.

13. In the professional VW/Midjourney documented process (Cannes Lions 2023), approximately how many generation iterations were required per approved final image?

Correct. Volkswagen's creative agency documented an average of 47 generation iterations per final approved image — establishing that professional prompt engineering involves systematic, extensive iteration rather than single-shot generation.

Incorrect. The documented VW process averaged 47 iterations per approved image — underscoring that professional image prompt engineering is a systematic refinement process, not a single-attempt endeavor.

14. What is the practical effect of specifying "f/1.4" in an image prompt intended for a photorealistic result?

Correct. f/1.4 is a very wide aperture that produces extremely shallow depth of field in photography — the model's training on captioned photographs means this token reliably activates blurred backgrounds (bokeh) and sharp subject isolation.

Incorrect. f/1.4 is a wide aperture in photography producing shallow depth of field — a sharp subject against a blurred bokeh background. The model learned this association from millions of captioned photographic training examples.

15. Which combination of prompting techniques would be MOST effective for refining a nearly-correct generation without regenerating the entire composition?

Correct. img2img at low denoising strength preserves the existing composition while allowing targeted prompt-based modification of details. Inpainting isolates only the problem region for regeneration, leaving the rest intact.

Incorrect. When a composition is nearly correct, img2img at low denoising strength (0.3–0.4) or targeted inpainting are the correct tools — they preserve what's working while modifying the specific problem areas.