In January 1839, Louis Daguerre announced the daguerreotype to the French Academy of Sciences, and the painter Paul Delaroche reportedly declared that "from today, painting is dead." He was wrong, of course β but he was also not entirely wrong. Photography did not kill painting; it reorganized the entire economy of image-making. Portrait studios that took weeks of oil-on-canvas work collapsed within a decade. New professions appeared. The definition of artistic authorship became genuinely contested in courts and academies for the next fifty years. The technology arrived faster than the frameworks to understand it.
Something structurally similar is happening now. In August 2022, an AI-generated image β ThéÒtre D'OpΓ©ra Spatial, produced using Midjourney by Jason Allen β won first place at the Colorado State Fair fine arts competition. The judges did not know it was AI-generated. Within weeks, working illustrators began publicly tracking lost commissions. Getty Images banned AI-generated content in September 2022, then reversed course and launched its own licensed AI tool in 2023. Adobe integrated generative AI into Photoshop in May 2023. The timeline from "research curiosity" to "industry infrastructure" was roughly 18 months.
This course examines the machinery underneath that speed. Specifically, it focuses on diffusion models β the dominant technical paradigm behind Stable Diffusion, DALLΒ·E 3, Midjourney, and Adobe Firefly. You will learn how they work mathematically, how prompts translate into pixels, where they fail and why, and how to evaluate their outputs critically. The goal is not mastery of every parameter but the ability to reason clearly about what these systems are actually doing β which is the prerequisite for using them well and for understanding their limits honestly.
On August 22, 2022, Stable Diffusion 1.4 was released publicly by Stability AI β the first major open-weights diffusion model that could run on a consumer GPU. Within 72 hours, developers had wrapped it in web UIs, integrated it into Blender plugins, and begun fine-tuning it on custom datasets. What made this possible was not just the release of weights but the relative compactness of the underlying architecture: a U-Net denoiser trained on LAION-5B, roughly 860 million parameters, capable of generating a 512Γ512 image in under 30 seconds on an Nvidia RTX 3080. The release demonstrated something important β the core diffusion algorithm was efficient enough to democratize. Understanding why requires going back to the thermodynamics metaphor at the heart of the method.
The name "diffusion" is not metaphorical decoration. It refers directly to the physics of how dye spreads through water β a process that, in the forward direction, destroys structure irreversibly. The researchers who developed DDPM (Denoising Diffusion Probabilistic Models, Ho et al., 2020) asked a precise question: if a neural network can learn to reverse one small step of noise addition, can you chain those reversals together to generate coherent images from pure noise? The answer turned out to be yes β and the implications of that yes are still expanding.
A diffusion model is trained in two phases. The first β the forward process β requires no learning at all. You take a clean training image and add a small, mathematically precise amount of Gaussian noise at each of T timesteps (typically 1,000). At step 1, the image looks nearly identical to the original. At step 500, it is recognizably degraded. At step 1,000, it is statistically indistinguishable from pure random noise. This schedule is fixed before training begins β it is not learned.
The critical property of this schedule is that it is Markovian: each noisy image depends only on the image one step earlier, not on the full history. This simplifies the mathematics dramatically. You can also compute the noise at any arbitrary timestep directly, without running through all prior steps β a shortcut that makes training feasible at scale. The noise added follows a Gaussian distribution with a variance controlled by a fixed schedule (linear, cosine, or sigmoid in different papers), so you always know exactly what was added.
The forward process produces a vast training dataset at no annotation cost: pairs of (noisy image at step t, clean image, timestep t). Every image in your training corpus becomes 1,000 training examples automatically. This is why diffusion models could be trained on LAION-5B β 5.8 billion image-text pairs β without requiring expensive human labeling of noise levels.
The forward process is the training data generator. By adding known quantities of noise in a controlled schedule, the model always has ground truth for what was destroyed β making the reverse task learnable.
The reverse process is what the neural network actually learns. Given a noisy image at timestep t, the network must predict either: (a) the original clean image, or (b) the noise that was added β these two framings are mathematically equivalent but have different training dynamics. Ho et al. found that predicting the noise (called epsilon prediction, or Ξ΅-prediction) worked better in practice. The network's job at each step is to estimate what random static was layered on top of the underlying signal.
The architecture used for this prediction is a U-Net β originally developed for biomedical image segmentation in 2015 by Ronneberger et al. at the University of Freiburg. A U-Net has an encoder path that progressively compresses the image into a compact representation, and a decoder path that expands it back to full resolution. Crucially, skip connections link corresponding encoder and decoder layers, preserving fine spatial detail while also incorporating global context. In diffusion models, the timestep t is embedded as a numerical vector and injected into every layer of the U-Net, so the network always knows how noisy the input is and calibrates its prediction accordingly.
Training the U-Net is straightforward: feed in a noisy image and a timestep, have the network predict the noise, compare its prediction to the actual noise that was added (which you know), and backpropagate the mean squared error. Repeat across millions of image-timestep pairs. After training, the network has learned a surprisingly general model of how images look β what textures, edges, lighting, and compositional structures are probable β because recovering signal from noise requires understanding what signal looks like.
At inference time β when you actually want to generate an image β you start with pure Gaussian noise (a tensor of random numbers the same size as your target image) and run the reverse process step by step. At each step, the U-Net predicts the noise component, you subtract it (scaled appropriately), and you obtain a slightly less noisy image. After T steps, you have a clean image.
The original DDPM paper (2020) used T = 1,000 steps, which was too slow for practical use. A year later, Song et al. at Stanford introduced DDIM (Denoising Diffusion Implicit Models), a deterministic sampling algorithm that can produce high-quality images in 20β50 steps by taking larger, more carefully calculated steps through the noise schedule. This reduced generation time by roughly 20Γ without retraining the underlying model. The insight was that the Markovian constraint is only needed during training, not during sampling β you can skip steps if you account for the skipped noise analytically.
Later work by Lu et al. (2022) introduced DPM-Solver, which reduced steps further to 10β20 for many use cases. Today, schedulers like DPM-Solver++, PNDM, and LCM (Latent Consistency Models, 2023) can produce acceptable images in as few as 4 steps by distilling the multi-step process into a learned shortcut. Speed and quality remain in tension β fewer steps means less refinement β but the trajectory has been consistently toward faster generation at equivalent or better quality.
Forward process: adding noise in T steps (fixed, not learned). Reverse process: removing noise in T steps (learned by U-Net). Scheduler: the algorithm governing step size and noise variance during sampling. Ξ΅-prediction: predicting the noise rather than the clean image β the standard training objective in most modern diffusion models.
Use the AI tutor below to deepen your understanding of the core diffusion mechanism. Engage seriously β surface questions get surface answers. Push into specifics: what happens at different timesteps, why U-Nets specifically, what DDIM actually changed.
In December 2021, Robin Rombach and colleagues at Ludwig Maximilian University Munich published "High-Resolution Image Synthesis with Latent Diffusion Models" β the paper that would become the direct technical basis for Stable Diffusion. The core insight was computational: running diffusion directly on 512Γ512 pixel images (786,432 numbers) was expensive enough to require industrial GPU clusters. But if you first compressed the image into a 64Γ64Γ4 latent representation (16,384 numbers), you could run the same diffusion process on a 48Γ smaller tensor. The key was a separately trained autoencoder that could compress images into this latent space and decompress them back with high fidelity. Rombach's paper demonstrated that most of the perceptually relevant structure of an image survives compression into the latent space β and therefore that denoising in latent space is effectively the same as denoising in pixel space, at a fraction of the cost.
A Variational Autoencoder (VAE) is trained as a separate component, before and independent of the diffusion model. Its job is to learn a compressed representation of images. The encoder takes a full-resolution image and maps it to a lower-dimensional tensor β in Stable Diffusion's case, 8Γ spatial compression in each dimension, plus 4 latent channels. The decoder takes that compressed tensor and reconstructs the original image as faithfully as possible.
The "variational" aspect introduces an important regularization: the latent space is trained to be approximately Gaussian β meaning similar images cluster near each other, and the space doesn't have arbitrary holes or discontinuities. This property matters enormously for diffusion, because the forward process starts from Gaussian noise. If the latent space is also Gaussian-shaped, then starting from noise and denoising gives you something that the decoder can meaningfully reconstruct. If the latent space had arbitrary topology, noise initialization would produce garbage.
The VAE is trained on a perceptual loss (does the reconstruction look similar to the original according to a pre-trained image classifier?) plus a reconstruction loss (are the pixel values numerically close?) plus a KL divergence term (is the latent distribution close to Gaussian?). These three objectives in tension produce a latent space that is both compact and semantically organized.
Stable Diffusion runs diffusion on a 64Γ64Γ4 latent tensor rather than a 512Γ512Γ3 pixel tensor. That is roughly 48Γ fewer operations per denoising step. On an RTX 3080, this difference is between generating an image in 25 seconds versus approximately 20 minutes. The latent diffusion architecture is what made consumer-grade generation possible.
The latent representations learned by a well-trained VAE have an interpretable geometry, even though this structure is not explicitly supervised. Directions in latent space tend to correspond to semantically meaningful variations: moving along certain axes shifts color temperature, others shift spatial composition, others shift object identity. This emergent structure is what makes latent space arithmetic sometimes work β the phenomenon where (latent of "king") minus (latent of "man") plus (latent of "woman") roughly equals (latent of "queen"), familiar from word embeddings, has weak analogues in image latent spaces.
More practically, the compressed latent representation preserves high-frequency structural information (edges, textures) through the skip-connection-like paths in the VAE decoder, while the low-dimensional bottleneck forces the model to capture global composition. When the diffusion U-Net operates on this latent representation, it is effectively learning a prior over compressed image descriptions β not over raw pixel values.
The VAE also introduces an important source of quality limitation: decoder artifacts. Certain fine details β text rendering, fingers, high-frequency textures β are difficult to reconstruct faithfully through the compression bottleneck. When Stable Diffusion generates blurry text or malformed hands, part of the blame belongs to the VAE decoder's limitations at these structure types, not only to the diffusion model's prior. Stable Diffusion XL (2023) and SDXL-VAE attempted to address this with a higher-capacity decoder and training data that included more fine-detail examples.
Not all major diffusion models use the latent approach. DALLΒ·E 2 (OpenAI, 2022) used a pixel-space diffusion model for its final stage, operating at 64Γ64 then upsampled through a separate diffusion upsampler. Imagen (Google Brain, 2022) similarly used a cascade of pixel-space diffusion models at increasing resolutions (64β256β1024). These approaches traded computational cost for avoiding the VAE reconstruction bottleneck β they never had to decompress from a latent space, so they avoided decoder artifacts entirely.
The latent approach won commercially because the compute savings outweighed the quality tradeoffs for most applications. DALLΒ·E 3 (2023) shifted OpenAI back toward a latent-space approach with a heavily improved VAE. Google's Imagen 2 and the underlying infrastructure of Midjourney v5 and v6 also use latent-space variants. The consensus in the research community by 2023β2024 was that with a sufficiently high-capacity VAE, latent diffusion matches or exceeds pixel diffusion quality at a fraction of the cost.
For Latent Diffusion Models: (1) VAE encoder compresses input or provides latent space structure. (2) Text (or other conditioning) is encoded separately. (3) Diffusion U-Net denoises in latent space over T steps, guided by conditioning. (4) VAE decoder reconstructs the final pixel image from the denoised latent. Quality and speed both depend on all four components.
The VAE is often treated as a black box, but its design choices directly shape what kinds of images a model can and cannot generate well. Use the tutor to explore what the latent space actually contains, how the perceptual loss changes what gets preserved, and why certain content types are systematically harder to reconstruct.
In April 2022, OpenAI released DALLΒ·E 2 and demonstrated something that surprised even researchers who had been following the field: the model could interpret prompts like "a photo of a bowl of cherries in the style of Vermeer, dramatic side lighting" and produce images that were coherently responsive to each element of that description. This was not possible in the 2021 DALLΒ·E, which struggled with compositional prompts. The difference was a new conditioning mechanism: rather than feeding text directly to the diffusion model, DALLΒ·E 2 used CLIP embeddings β a shared vision-language representation space β to guide generation. Understanding this mechanism requires understanding how text and image representations are made compatible, and how that compatibility is used during the denoising process.
Before a text prompt can guide a diffusion model, it must be converted into a numerical representation that the U-Net can process. Stable Diffusion 1.x used the CLIP text encoder (ViT-L/14, OpenAI 2021) β a transformer that maps tokenized text to a sequence of 768-dimensional vectors. Stable Diffusion 2.x switched to OpenCLIP ViT-H/14, producing 1024-dimensional embeddings trained on a larger and more carefully filtered dataset. SDXL (2023) took this further by using two text encoders simultaneously: OpenCLIP ViT-bigG (1280-dimensional) and the original CLIP ViT-L (768-dimensional), concatenating their outputs to produce a richer 2048-dimensional conditioning signal.
The key property of CLIP embeddings is that they were trained to align text and image representations in a shared space: text describing an image should be close in embedding space to the image itself. This alignment is what makes CLIP embeddings useful as conditioning signals β they carry visual semantic meaning, not just syntactic structure. A CLIP embedding of "sunset over mountains" contains information about warm colors, horizontal gradients, and jagged silhouettes, encoded implicitly through the contrastive training objective on 400 million image-text pairs.
For DALLΒ·E 3 (2023), OpenAI moved to a different conditioning regime: rather than CLIP, they used T5 (a pure-language transformer from Google) for text encoding. T5 embeddings are better at preserving syntactic and relational structure β understanding prompts like "a red cube on top of a blue sphere" more reliably than CLIP, which tends to represent concepts as unordered bags of features. The tradeoff is that T5 embeddings have no inherent visual grounding, so they require a larger and better-trained diffusion model to interpret them correctly.
CLIP text encoders carry implicit visual information (trained on image-text pairs) but struggle with spatial relationships and complex compositions. T5 preserves linguistic structure better but needs more model capacity to translate that structure into visual outputs. DALLΒ·E 3's high compositional accuracy is partly attributable to T5 + a better-recaptioned training dataset.
The mechanism by which text embeddings actually influence the U-Net is cross-attention. In the U-Net's intermediate layers, the spatial feature maps (representing the current state of the noisy latent) are treated as queries, while the text embedding sequence is treated as keys and values. At each cross-attention operation, every spatial position in the image attends to every token in the text prompt, weighting its influence based on learned relevance scores.
This means the conditioning is not applied once at the beginning but at every cross-attention layer throughout the denoising process. The U-Net is constantly re-querying the text embedding, asking (in effect) "given this prompt, what should this spatial region look like at this noise level?" Different attention heads specialize for different aspects of the conditioning: some heads are primarily responsible for style, others for object identity, others for spatial arrangement.
The cross-attention maps produced during generation are interpretable and have been studied explicitly. In a 2023 paper from Google Research (Prompt-to-Prompt, Hertz et al.), researchers showed that specific text tokens reliably activate specific spatial regions in the cross-attention maps β and that you can edit images by manipulating these maps rather than by changing pixels. This research confirmed that the diffusion model is building a genuine spatial-semantic correspondence between prompt tokens and image regions, not simply pattern-matching at a global level.
Even with cross-attention, early conditioned diffusion models tended to produce images that were only weakly responsive to the prompt β they looked like plausible images, but not necessarily of the thing described. The fix was Classifier-Free Guidance (CFG), introduced by Ho and Salimans at Google Brain in 2021.
CFG works by training the model with both conditioned (with text) and unconditioned (with empty text) examples. At inference time, you run the denoiser twice per step: once with your text prompt and once with an empty prompt. The final noise prediction is: unconditioned prediction + guidance_scale Γ (conditioned prediction β unconditioned prediction). This extrapolates in the direction that makes the output more consistent with the conditioning. A guidance scale of 1.0 is equivalent to no guidance; 7.5 is the Stable Diffusion default; values above 15 typically produce over-saturated, distorted outputs as the model is pushed far outside its training distribution.
CFG is computationally expensive: it doubles the number of U-Net forward passes per step. Various approximations have been developed β Perturbed Attention Guidance (2024) and Autoguidance (2024) attempt to achieve similar effects with a single forward pass, using perturbations of the attention maps rather than a separate unconditioned pass. At the time of writing, CFG remains the dominant guidance method in production systems, though its compute cost has spurred significant research into single-pass alternatives.
High CFG scales push the model toward image modes that are maximally consistent with the prompt. This often increases sharpness and prompt adherence while reducing diversity. Too high, and the model exceeds the probability distribution it was trained on β producing artifacts, over-saturation, and anatomical distortions. The guidance scale is one of the most impactful inference parameters a user controls.
Conditioning is where the abstract mechanism meets the practical question of why your prompt produces what it produces. Use the tutor to explore: what makes some prompts more effective than others at a mechanistic level, how CFG scale affects outputs, and what the cross-attention architecture implies about the limits of spatial composition in current models.
In January 2023, a viral thread on Twitter documented every attempt to get Stable Diffusion 2.1 to generate a person holding exactly four coins in their left hand. The model consistently produced images with three coins, six coins, coins partially fused with fingers, hands that had too many fingers holding approximate-coin-shaped objects. None of 47 attempts produced the correct result. The thread became a reference point in arguments about AI capability β cited both by critics who said it proved AI couldn't count, and by defenders who said it proved nothing about intelligence. Both were arguing past the interesting question, which is: why exactly does this happen? The answer is architectural and illuminating about the limits of the entire class of models.
Diffusion models do not have an explicit symbolic reasoning layer. They do not count objects; they learn statistical distributions over what images containing "four coins" tend to look like. In the training data (LAION-5B or similar), images labeled with "four coins" vary enormously β different angles, lighting, arrangements, overlaps. The model learns a blurry average of these visual patterns, not the rule that produces them.
More precisely: the U-Net operates on continuous activations. It has no mechanism to say "I have placed exactly N instances of object X." During each denoising step, it is making a local probabilistic prediction about what the image should look like given the current noisy state and the conditioning. Whether there are 3 or 5 coins emerges from the accumulated momentum of many local decisions β and nothing in the architecture enforces global count consistency.
This is why models like DALLΒ·E 3 made specific improvements to counting by dramatically improving their training data through recaptioning β having GPT-4V write detailed, count-explicit captions for training images. This didn't fix the architectural limitation; it worked around it by ensuring the statistical prior better matched count-explicit descriptions. The improvement is real but fragile: push counts high enough (7 coins, 12 coins) and the failures return.
Counting failures stem from the absence of discrete symbolic representation in continuous neural networks. The model approximates the visual statistics of count-labeled images rather than implementing the counting rule. Improvements via better training data work around this but do not eliminate it.
Prompts specifying spatial relationships β "the red sphere is to the left of the blue cube" β fail at high rates in most diffusion models. The underlying cause relates to how CLIP embeddings encode meaning. CLIP training used contrastive objectives on image-text pairs where the text was typically a caption describing the image's content, not its precise spatial layout. The training signal did not reward precise spatial binding, so the embeddings don't encode it reliably.
Cross-attention can, in principle, bind attributes to spatial positions β and the Prompt-to-Prompt work (Hertz et al., 2023) showed that attention maps do develop loose spatial specialization. But this specialization is learned statistically, not architecturally guaranteed. A text token for "left" will activate attention patterns associated with the left half of images in the training set, but only as a probabilistic tendency. It does not force the object to appear on the left.
Research efforts to fix this include structure-guided generation (providing bounding boxes as additional conditioning, as in GLIGEN, 2023), multi-object diffusion (segmenting the generation into object-level passes), and training on datasets with explicit spatial annotation. These help but each introduces new constraints: bounding box conditioning requires users to specify layouts, which defeats the convenience of natural language generation.
Diffusion models trained on the standard LAION-5B dataset are notoriously bad at rendering legible text within images. The structural reason is that text rendering requires precise spatial arrangement of strokes into recognizable symbols β a task that requires understanding discrete character shapes and their sequential arrangement, not just the statistical distribution of pixel patterns near text-like regions.
The U-Net's denoising process has no mechanism that corresponds to "write these specific characters in this sequence at this position." Text tokens from the CLIP encoder describe what the text says, not how the specific glyphs look spatially. The model learned that images with prompts mentioning a word tend to have text-like regions β but learned this distribution too coarsely to reproduce specific characters reliably.
DALLΒ·E 3 made substantial improvements in text rendering, partly through training on synthetic images with clean, diverse text and partly through the T5 encoder's better preservation of the specific character sequence. Imagen 2 (Google, 2023) and Stable Diffusion 3 (2024) also made significant progress using improved text encoders and architecture changes. SD3 specifically introduced a transformer-based architecture (DiT, Diffusion Transformer) rather than a U-Net, and the attention mechanism in DiT handles long-range dependencies β including character sequences β more directly.
The "six-fingered hand" failure is one of the most widely observed artifacts in early diffusion models and merits specific explanation. Human hands are topologically complex β five distinct articulated appendages emerging from a palm in consistent spatial arrangement β and are photographically variable. In the training data, hands are often partially occluded, blurred, unusual in gesture, or seen from ambiguous angles. The model's prior over hand appearance is therefore wide and uncertain.
More importantly, the diffusion process operates locally: at each denoising step, the U-Net makes predictions based on local receptive fields. Whether a specific region looks like a finger is evaluated locally. The constraint that exactly five non-overlapping fingers must emerge from a palm in biologically consistent arrangement is a global topological constraint. The diffusion process has no mechanism to enforce global topological consistency; it optimizes for local plausibility at each step. The result is that individual regions each look finger-like, but their aggregate arrangement violates the global constraint.
By 2024, models like Midjourney v6, DALLΒ·E 3, and Stable Diffusion 3 substantially reduced hand artifacts through larger training datasets that over-represented clear hand photographs, better VAE decoders, and architecture improvements that increased effective receptive field size. The improvement demonstrates that the problem is addressable through scale and data, but the architectural analysis explains why it exists in the first place.
Counting, spatial relations, text rendering, topological consistency β these failures share a common structure. They each require either discrete symbolic reasoning (counting), global spatial consistency (spatial arrangement, finger topology), or precise sequential structure (character sequences). Diffusion models are continuous, local, probabilistic generative processes. They are exceptionally good at learning the statistical texture of visual appearance and are structurally disadvantaged at tasks requiring global discrete constraints. Progress on these failures has come from better data and scale, not from architectural fixes to this fundamental property.
Each failure mode we covered has a specific architectural cause. The goal of this lab is to practice tracing from observable artifact to root cause β which will help you both use these models more effectively and evaluate claims about their capabilities more critically.