When Adobe demonstrated Generative Fill live at its MAX conference, the audience watched a photographer erase a crowded tourist from the Colosseum and replace him with contextually appropriate stone pavement β in under eight seconds. The crowd gasped. But the mechanism behind that moment had been in development for years, rooted in a branch of machine learning called latent diffusion modeling that had nothing to do with photography at all.
Understanding why generative fill works so convincingly β and where it still fails β requires going one layer deeper than the magic show.
Inpainting is the process of reconstructing missing or masked regions of an image so that the result is visually coherent with the surrounding content. The term predates AI entirely β conservators used it for centuries to restore damaged paintings. In digital imaging it first appeared in academic computer vision literature in 2000, when BertalmΓo, Shapiro, Sapiro and Caselles published their landmark paper "Image Inpainting" demonstrating a PDE-based approach for filling small gaps.
Early digital inpainting was purely algorithmic: the system sampled pixels from nearby regions and tiled or blended them inward. Photoshop's Content-Aware Fill, introduced in CS5 in 2010, used a patch-based algorithm (drawing from research by Barnes et al. on PatchMatch) that searched the image for visually similar texture regions and synthesized a fill. This worked well for simple backgrounds like grass, sky, and water but failed on structured or semantic content β faces, architecture, text, complex objects.
The shift to deep-learning-based inpainting began in earnest around 2018 with NVIDIA's paper "Image Inpainting for Irregular Holes Using Partial Convolutions" (Liu et al., 2018), which showed that convolutional neural networks could fill arbitrarily shaped holes with semantically plausible content. But those early models still had telltale artifacts at mask boundaries and limited resolution.
Modern generative fill β as implemented in Adobe Firefly, Stable Diffusion's inpainting pipeline, DALL-E 3's edit mode, and Google's Magic Eraser β is built on latent diffusion models (LDMs). The foundational paper, "High-Resolution Image Synthesis with Latent Diffusion Models" by Rombach et al. (CompVis group, 2022), introduced the core insight: instead of running the expensive diffusion process in pixel space, compress the image into a lower-dimensional latent space using a variational autoencoder (VAE), diffuse there, and decode back to pixels.
For inpainting, the model receives three inputs simultaneously: the original image (encoded to latent), a binary mask (1 = region to fill, 0 = region to preserve), and an optional text prompt. The masked region is set to noise in latent space, and the denoising U-Net iteratively removes that noise, conditioned on both the surrounding unmasked latents and the text embedding. The preserved region exerts structural and color pressure on what gets generated in the filled region, which is why good generative fill respects lighting, perspective, and texture continuity.
Adobe Firefly's implementation adds one more layer: the model was trained exclusively on licensed Adobe Stock images, openly licensed content, and public domain material β a deliberate choice made in response to ongoing litigation against other AI image providers. This training data decision shapes the aesthetic and capability range of Firefly's fills.
The shape, feathering, and placement of your selection mask directly influences output quality. Diffusion inpainting models use the mask boundary to infer what should grow inward. A hard, pixel-precise mask on a complex edge (hair, foliage, transparent glass) gives the model contradictory structural signals. Slightly feathering or expanding the mask by 2β4 pixels almost always yields better edge blending β the model needs a transition zone to interpolate between preserved and generated content.
Generative fill excels at removing objects against consistent backgrounds, extending skies, and adding generic environmental elements. It struggles with highly specific subject identity (a particular person's face, a trademarked logo, a precise architectural detail), with complex transparency, and with regions where physical consistency matters critically β reflections in water, cast shadows from a removed object, or glass refractions. Knowing this shapes the decision of whether AI fill is the right tool or whether manual compositing will be required.
In this lab you'll talk through the technical underpinnings of generative fill with an AI tutor. Ask about how masks are processed, how the VAE works in context, why text prompts guide fill direction, and where the algorithm tends to fail. The goal is to move from surface-level familiarity to genuine conceptual understanding.
In 2023, National Geographic publicly reaffirmed its longstanding ban on AI-generated or AI-altered content in its editorial photography, explicitly calling out generative fill as a tool that violates the documentary integrity of its images. The ban underscored something photographers often miss: every masking decision is also an editorial decision. What you choose to include, exclude, or replace is not a neutral technical choice β it is authorship. Developing masking fluency means understanding not just the how, but the when.
Photoshop's AI-backed selection tools have improved dramatically since the introduction of Select Subject (2018, powered by Adobe Sensei) and the dedicated Object Selection Tool (2019). In the context of generative fill, four modes are most relevant:
1. Object Selection Tool (Rectangular/Lasso refinement): Adobe Sensei analyses the image and identifies distinct objects. You click or drag over a region, the model detects the object boundary, and returns a pixel-level mask. In 2023, Adobe upgraded this with a "Hover" mode where objects highlight as you move your cursor β without any click. This works well for well-defined foreground objects against clean backgrounds.
2. Select Subject: A one-click operation that masks the dominant subject of the entire image. Useful for portraits and product shots, less reliable for group scenes or images with multiple competing subjects. Internally it uses the same Sensei segmentation model as Object Selection.
3. Select and Mask Workspace with AI Refinement: Accessed via the Refine Edge / Select and Mask dialog, this mode allows manual correction with the Refine Edge Brush β which is specifically tuned for hair, fur, and foliage. It uses a separate edge-aware model to detect fine strands and semi-transparent edges that solid selections miss.
4. Quick Mask + Manual Paint: The oldest method, fully manual. You paint a mask directly in Quick Mask mode using brushes. Total control, highest time cost. Remains indispensable for complex compositional edits where automated selection repeatedly misfires.
Once a selection is made, three parameters directly affect generative fill quality: expansion, feathering, and smoothing. The Modify > Expand command (or the Expand field in Select and Mask) grows the selection outward β critically important when filling in a removed object, because you want to include the shadow fringe and edge pixels that belong to the removed element. Leaving those edge pixels unmasked causes a ghosting halo in the final output.
Feathering adds a soft gradient to the mask edge β the selection transitions from 100% to 0% across a specified pixel radius. For most generative fills, a feather of 2β5 pixels is appropriate for mid-sized objects. Very large objects on complex backgrounds may benefit from 8β12 pixels of feather. Zero feathering produces a hard cut and almost always creates visible seaming.
Smoothing reduces the jaggedness of selection edges β particularly important when using lasso tools on curved surfaces. A smoothing value of 2β4 pixels prevents the serrated edge artifact that appears when hard jagged masks are used with diffusion models, which generate content at a continuous resolution.
When removing a physical object from a scene, its shadow remains in the image after masking just the object itself. Generative fill cannot logically infer that a shadow belongs to a now-absent object β it will attempt to preserve and integrate it as meaningful scene content. Always extend your mask to include the object's cast shadow. If the shadow falls across a complex surface (carpet texture, pavement cracks, a second subject), mask it with Quick Mask mode for precision β Object Selection will rarely detect a shadow as belonging to the same masking group as the object.
Generative fill models do not randomly invent content β they condition on a context window drawn from the pixels surrounding the masked region. In Stable Diffusion's inpainting pipeline (used by many third-party tools), this context is drawn from the full image resized to 512Γ512 or 1024Γ1024 depending on model version. Adobe Firefly processes context differently, operating at higher native resolution with its own architecture.
Practically, this means the spatial relationship between your mask and surrounding content matters. If you mask the center of a plain blue sky, the model sees predominantly sky-blue context and will fill with contextually appropriate sky. If you mask a region that spans a horizon line β half sky, half water β the model must simultaneously satisfy two different surface types and may produce inconsistent results or visible fill seams. In such cases, splitting the mask into two separate fills, one above and one below the horizon, produces significantly better results.
Text prompts, when provided, add a third input dimension: semantic guidance. "Cobblestone road" tells the model what the fill region should depict, and it will attempt to reconcile that description with the structural context. Leaving the prompt blank instructs the model to generate purely from surrounding context β useful for object removal on consistent backgrounds, but risky on complex scenes where you want specific replacement content.
Always work on a duplicate layer or a Smart Object before applying generative fill in Photoshop. Generative Fill in Photoshop 2024 non-destructively creates a new "Generative Layer" automatically, but in earlier versions and in third-party tools this protection is not guaranteed. The fill is not easily undone once baked into a flattened layer, and you will often want to compare multiple fill variations β Photoshop's Generative Fill generates three variations by default for exactly this reason.
Present specific masking challenges to the AI tutor and work through the optimal selection strategy. Describe the scene (subject, background complexity, edge type), what you've tried, and what went wrong. The tutor will walk through which Photoshop masking tool to use, what mask parameters to set, and how to sequence multiple fills when a single fill fails. Shadows, reflections, hair edges, and transparent objects are all fair game.
In April 2023, World Press Photo withdrew a prize from Spanish photographer Γlvaro MoriΓ±o after an investigation found that AI generation β specifically inpainting-style content synthesis β had materially altered the image. The photograph of a flock of flamingos had areas of background sky that investigators concluded were AI-extended. The prompt that generated those pixels left no metadata trace. It was the invisible prompt β the one nobody sees β that undid the image's credibility.
This case established a precedent: in photojournalism, provenance of every pixel now matters. But in commercial and creative photography, the prompt is simply the most powerful creative control available to the artist. Knowing how to write it well is an essential skill.
When you provide a text prompt to generative fill, it is encoded by a text encoder β typically CLIP (Contrastive Language-Image Pre-training) or a successor like OpenCLIP or T5 β into a vector embedding. That embedding is passed as conditioning information into the cross-attention layers of the denoising U-Net at every denoising step. The model uses it to steer the direction of denoising toward images that are semantically consistent with the prompt.
Critically, the prompt competes with the structural context from surrounding pixels. If you mask out a person in front of a fireplace and write "brick wall," the model will attempt to satisfy both the brick wall prompt and the spatial cues from the fireplace surround, mantelpiece, and ambient orange light. The result may be a brick wall that is unnaturally warm and irregularly shaped β the structural context wins on local spatial consistency, the prompt wins on surface texture and subject matter. Understanding this competition is key to writing prompts that work with context, not against it.
Empty prompts (no text) tell the model to rely entirely on context. This is ideal for simple background removal (grass, sky, water, plain walls) and often worse for anything requiring specific replacement content. The choice between empty and text-prompted fill is itself a craft decision.
Effective fill prompts follow a different grammar than text-to-image generation prompts. In full image generation, you describe a complete scene. In fill, you describe only the region being generated β and that region must integrate with a surrounding scene it cannot change. This shifts prompt strategy significantly:
Surface-first language: Begin with the physical surface or material, not the mood. "Weathered concrete wall with horizontal crack lines" is more useful than "gritty urban backdrop" because the model needs to place a texture, not interpret an aesthetic.
Lighting continuity hints: If the surrounding image is lit from the left with warm afternoon sun, include "warm side lighting from the left" in your prompt. The model cannot see the existing light direction β it infers it from pixel gradients in the surrounding context, but a prompt that reinforces this inference produces more consistent shadow and highlight directionality in the fill.
Avoid over-specification on details that must match context: Don't specify "red brick" if the surrounding wall is orange-toned β the model will attempt to generate genuinely red bricks that clash with the existing color environment. Allow color to be controlled by context; use prompts to control structure and surface type.
Negative-space description: Firefly and some other tools accept both positive and negative guidance. "No people, no text, no signage" removes common fill artifacts in urban scenes. In Stable Diffusion, negative prompts are an explicit parameter; in Photoshop's Generative Fill, the prompt field accepts natural-language phrasing that includes exclusions.
Object removal on grass: Leave blank or use "grass lawn, natural daylight" β context usually sufficient.
Extending architectural background: "Continuation of [material], matching perspective, no new objects" β explicitly signals structural continuation.
Adding environmental elements: "Low morning fog drifting through pine trees, soft diffused light" β specific, physical, lighting-aware.
Removing distracting signage: "Bare painted wall, flat surface, slight ambient shadow" β tells model what IS there, not just what isn't.
Adobe Photoshop's Generative Fill generates three variations by default per prompt, accessible via the Properties panel. Before accepting any fill, examine all three β they often differ significantly in texture, object placement, and boundary integration. The best strategy is to generate once, review all three, then generate again with a refined prompt if none are acceptable. Do not simply accept the first variation.
When variations are consistently unsatisfactory in the same way (always too dark, always placing an unwanted object near the edge, always producing a seam at the bottom), the problem is usually the mask, not the prompt. Expand, feather, or shift the mask boundaries and regenerate before changing the prompt text. Mask geometry and prompt text address different failure modes β diagnosing which is causing the problem is a key professional skill.
Some practitioners keep a prompt log β a simple text document recording which prompt/mask combinations produced acceptable results for common fill scenarios (sky extensions, foliage gaps, interior wall removal, crowd thinning). Because generative fill uses stochastic sampling, you cannot reproduce an exact output, but you can reproduce the conditions that reliably produce acceptable outputs.
Adobe Firefly embeds Content Credentials (C2PA standard) metadata into images when generative fill is used. This includes a machine-readable record that AI content was added and a cryptographic signature. Export to JPEG or PNG preserves this metadata by default in Photoshop 2024. Some publications β including the Associated Press, which published updated AI guidelines in August 2023 β now require photographers to declare any AI-assisted alterations. Understanding that your prompt choices create a traceable record changes the ethical calculus of how you use the tool.
Describe specific fill scenarios to the AI tutor and workshop your prompt text together. The tutor will evaluate your drafts, explain why certain phrasings work or fail, suggest specific wording improvements, and help you build a repertoire of reliable prompt patterns for common photographic scenarios. Bring real problems: sky extensions, architecture removals, crowd thinning, product background cleanup.
When Adobe launched automated Sky Replacement in Photoshop 2021 (version 22.0), it used a luminosity masking algorithm rather than generative AI β the sky was detected by tonal range, a replacement sky image was composited in, and lighting effects were applied to the foreground to simulate the color cast of the new sky. It was technically impressive but demonstrably artificial: if the replacement sky's cloud formations were lit from the right and the foreground was lit from the left, no automatic correction could reconcile the contradiction.
By 2023, generative outpainting offered a different approach: rather than replacing a detected region with a library image, the model synthesized continuation of the existing sky β extending whatever real sky existed outward. The tool worked with the photograph's actual lighting conditions rather than against an imported image's incompatible lighting. For the first time, expanding a frame felt like photography rather than compositing.
Outpainting in Photoshop is accessed by expanding the canvas beyond the image boundary (Image > Canvas Size) and then selecting the empty area for generative fill. The model conditions on the edge pixels of the existing image and extends them outward. This can be used to change an image's aspect ratio (expanding a 3:2 image to 16:9 for widescreen use without cropping), to add visual breathing room around a tightly composed shot, or to recover a poorly framed image where a critical element is partially cut off at the edge.
Key constraint: outpainting performs well when edges are consistent in texture and structure (plain sky, uniform wall, flat ground). It struggles when edges contain complex, partially-visible elements β a partially cropped face at the image edge, a building whose structural geometry is cut mid-window, or a waterfall whose direction implies continuation not compatible with the edge geometry. In these cases, the model must invent the continuation of a structure it cannot see the full context of, and the result is often geometrically inconsistent.
The practical workflow for frame expansion involves multiple incremental outpainting passes rather than one large expansion. Expanding 15β20% per pass, then using that new content as context for the next pass, produces significantly better results than attempting a 100% expansion in a single step. The incremental approach ensures each new passage has context from the previous generation rather than only from the original image edge.
Sky replacement using generative fill differs from Photoshop's dedicated Sky Replacement tool in a fundamental way: instead of compositing a separate image, you mask the existing sky and prompt the model to generate a new one in its place. This means the generated sky is synthesized to match the color temperature, horizon color gradient, and structural boundary of the specific image β it emerges from the photograph's context rather than being imported into it.
Effective sky replacement via generative fill requires a precise sky mask. The boundary between sky and non-sky content is often the most complex edge in landscape photography: tree branches, power lines, rooftop details, and atmospheric haze all create semi-transparent transitions. Photoshop's Select Sky command (introduced in Photoshop 2021) uses a dedicated segmentation model for sky detection and is more reliable than general-purpose Object Selection for this task. After auto-selection, refinement with the Refine Edge Brush at tree canopy boundaries is nearly always necessary.
Prompt recommendations for sky fills: specify time of day, weather condition, and cloud type specifically. "Golden hour, scattered cirrus clouds, deep blue zenith" produces dramatically different results than the generic "dramatic sky." Because the model generates sky conditioned on the horizon line of the existing image, dramatic lighting in the prompt will be synthesized to match the horizon gradient already present β a technically elegant behavior that makes sky replacement far more convincing than traditional compositing methods.
One of the most commercially valuable applications of inpainting in editorial and event photography is crowd thinning β selectively masking individual people from a scene to create a cleaner, less cluttered image. The technique requires masking each person individually (not in a group mask) so the model can fill each gap with contextually appropriate background. Group masking of multiple people in a complex scene forces the model to invent too large a region at once, producing visible tiling artifacts. Process one person per fill operation, starting with those furthest from the camera (smallest in frame), and work toward the foreground β this ensures each subsequent fill has more realistic context from the previous fills already in place.
The Associated Press updated its AI policy in August 2023 to explicitly prohibit altering the editorial content of news photographs using AI tools, including "the addition, alteration, or removal of content within the frame." The same policy permits AI tools for adjusting overall technical quality β noise reduction, sharpening, color correction β that do not alter what the image depicts.
In commercial photography β advertising, product photography, real estate, stock β the constraints are reversed. Clients routinely commission generative fill to remove distracting elements, extend scenes to fit new aspect ratios, replace overcast skies with blue skies for real estate listings, and add seasonal elements to lifestyle imagery. The ethical question in commercial contexts is not documentary integrity (there is no documentary claim) but disclosure: consumers seeing a real estate listing with an AI-generated blue sky replacing a real overcast sky are receiving implicitly misleading information about typical weather conditions. Several countries are developing advertising standards guidance on AI-altered commercial imagery.
Professional photographers navigating this landscape need to operate under different standards simultaneously depending on client type and use context. A photojournalist and a product photographer may use identical tools under completely different ethical frameworks β understanding which framework applies is as important as mastering the tools themselves.
1. Diagnose whether fill or manual compositing is the right tool for this edit. 2. Duplicate the layer or ensure you're working non-destructively. 3. Create the selection using the most appropriate Photoshop tool for the edge type. 4. Expand the selection 2β4px, apply appropriate feathering and smoothing. 5. Include cast shadows and reflections in the mask. 6. Write a context-aware prompt: surface-first, lighting-consistent, color-agnostic. 7. Generate three variations; evaluate all before accepting. 8. If all variations share the same flaw, modify the mask β not the prompt. 9. Document your prompt for future reference. 10. Check Content Credentials metadata and declare AI use per client or publication requirements.
Bring your most complex generative fill challenges to this lab. You might be planning a sky replacement for a real estate shoot, figuring out how to extend a portrait for a new aspect ratio, or working through the ethics of AI alteration for a specific client context. The AI tutor can walk you through step-by-step workflows, compare tool choices (Sky Replacement vs. generative fill, Photoshop vs. Lightroom Denoise), and discuss the ethical and disclosure considerations that apply to your specific use context.