Module 4 · Lesson 1

The Training Data Problem

How copyrighted text enters AI systems — and why the law hasn't caught up

If an AI reads every book ever written, does it own none of them — or all of them?

In the summer of 2023, the Authors Guild organized a letter signed by more than 10,000 writers — including Nora Roberts, John Grisham, and George R.R. Martin — demanding that AI companies compensate authors whose books had been used to train large language models. The letter was not a lawsuit. It was a warning shot. Within months, several of those same authors did file suit, and the legal battles that followed would reshape how the entire AI industry thought about copyright.

What "Training" Actually Means

When an AI language model is trained, it ingests enormous quantities of text. GPT-4, for instance, was trained on hundreds of billions of tokens — words, punctuation marks, fragments — scraped from the web, digitized books, academic papers, code repositories, and more. The model does not store these texts the way a hard drive stores a file. Instead, it adjusts billions of numerical parameters — weights — so that it becomes better at predicting what comes next in a sequence of words.

The distinction matters legally: no copy of a book lives inside a trained model in any readable form. What lives there is a statistical residue — patterns compressed into numbers. This is why AI companies have consistently argued that training is "transformative use" under copyright law, similar to how a human author reads thousands of novels before writing their own. Critics counter that the scale is incomparable and the commercial benefit unmistakable.

Documented Case — The Books3 Dataset

In 2023, journalist Alex Reisner and researcher Anna Ridler helped surface the contents of Books3, a dataset of approximately 196,640 entire books scraped from a piracy site called Bibliotik. Books3 was used to train models including Meta's LLaMA and several early versions of other large language models. Authors could search for their own titles and find them listed. Comedian Sarah Silverman and novelists Christopher Golden and Richard Kadrey were among those who filed suit against Meta specifically because of Books3.

The Scraping Pipeline

Most training data for large language models passes through several stages before it reaches a model. A web crawler — Common Crawl is the most widely used — continuously downloads pages from across the internet. That raw data is then filtered, deduplicated, and sometimes augmented with higher-quality sources like Wikipedia, academic papers, or licensed datasets.

The problem is that the web contains copyrighted material at extraordinary scale. News articles, song lyrics, screenplay excerpts, forum posts containing quoted prose, fan-fiction sites reproducing published passages — all of it enters the pipeline unless actively filtered out. Most early pipelines had no systematic mechanism for filtering by copyright status.

OpenAI's GPT-3 paper (2020) disclosed that training data included WebText (web pages linked from Reddit), books from an undisclosed source labeled "Books1" and "Books2," and Wikipedia. The company has not publicly disclosed the full contents of those book datasets. In the 2023 New York Times lawsuit against OpenAI and Microsoft, the Times presented evidence that GPT-4 could reproduce near-verbatim passages from Times articles — an argument that the model had "memorized" specific text.

Scale of the Problem

The Pile, a popular open-source training dataset, contained over 800GB of text including GitHub, PubMed, FreeLaw, and DM Mathematics — assembled without licensing individual items. Many model builders used it without scrutiny of its contents.

The Memorization Effect

Researchers at Google, DeepMind, and universities have shown that models can "memorize" training data — reproducing exact or near-exact sequences when prompted. The probability rises with duplication: text that appears many times in training is more likely to be reproduced verbatim.

Why This Matters for Your Creative Work

When you use an AI tool to help write a story, design a logo, or compose a song, the outputs are shaped by everything that went into training the model. If the model was trained on copyrighted work without authorization, any resemblance between its output and that source material — even unintentional — could implicate you as the person publishing the result.

This is not hypothetical. In 2023, comedian and writer Katy O'Brian discovered that an AI image generator had produced work almost identical to a specific illustrator's style. The illustrator — Karla Ortiz — was one of three artists who filed suit against Stability AI, Midjourney, and DeviantArt in January 2023, arguing that their styles had been absorbed into systems used commercially without consent or compensation.

Understanding the pipeline — where training data comes from, how it's filtered, and where it isn't — is the first step in using AI responsibly. In the next lesson, we look at what happens when AI outputs too closely echo a specific source.

Key Takeaway

AI models are trained on vast text corpora that frequently include copyrighted material. The legal status of this practice is actively contested. As a creator using AI tools, you bear some responsibility for understanding whether the tool you're using has addressed training-data rights — and for reviewing outputs for unintended similarity to existing protected works.

Training dataThe corpus of text, images, or other data used to adjust a model's parameters during the learning process — distinct from the model itself.

MemorizationA documented phenomenon where AI models reproduce verbatim or near-verbatim sequences from training data, particularly for text that appeared many times in the corpus.

Transformative useA doctrine in U.S. copyright law that may permit use of copyrighted material if the new work adds new meaning, expression, or message — one of four factors in fair use analysis.

Lesson 1 Quiz

The Training Data Problem · 4 questions

1. The Books3 dataset, used to train several large language models, obtained its books from what source?

Correct. Books3 contained ~196,640 books scraped from Bibliotik, a piracy site. It was used to train Meta's LLaMA among other models, and led directly to lawsuits by authors including Sarah Silverman.

Not quite. Books3 was sourced from Bibliotik, a piracy site — not through any licensed or legitimate channel.

2. Why do AI companies argue that training on copyrighted text is legal "transformative use"?

Correct. The core argument is that training converts text into numerical parameters — no readable copy is stored — making it analogous to a human reading books before writing their own, and potentially transformative under copyright law.

That's not the argument. The transformative-use claim rests on the idea that training produces statistical weights, not stored copies of the original text.

3. Which of these accurately describes the "memorization" phenomenon in large language models?

Correct. Research from Google, DeepMind, and others has documented that duplication in training data increases the likelihood of verbatim reproduction. The New York Times lawsuit used this phenomenon as a central exhibit.

Memorization is a documented, non-random effect. Text that appears many times in training is statistically more likely to be reproduced verbatim — it has been demonstrated empirically across multiple research teams.

4. What practical implication does training-data provenance have for someone creating content with AI tools?

Correct. Because AI outputs can reflect — and sometimes reproduce — training material, creators who publish AI-assisted work carry some responsibility for reviewing outputs for unintended similarity to protected works.

Responsibility doesn't disappear at the model builder. A person who publishes content that closely resembles a copyrighted work may face legal exposure regardless of whether the similarity was generated by a human or an AI.

Lab 1 — Interrogating Training Data

Explore what AI knows about its own training provenance

Your Task

Ask the AI assistant below about training data, copyright, and how it handles situations where it might reproduce copyrighted content. Try to get specific — ask about particular datasets, about memorization, about what safeguards exist. The goal is to understand the gap between what AI companies say about training data and what is actually disclosed.

Suggested start: "Can you tell me what datasets were used to train you? Do you know if any of them contained copyrighted books or articles without licenses?" — then follow the thread wherever it leads.

AI Lab Assistant

Training Data & Copyright

Hello! I'm here to explore training data and copyright with you. Ask me about datasets, memorization, what gets filtered — or anything else from Lesson 1. What's on your mind?

Module 4 · Lesson 2

When Output Echoes Input

Style, similarity, and the difference between inspiration and infringement

Can AI copy an artist's style? And if it does — whose problem is that?

On January 13, 2023, three visual artists — Sarah Andersen, Kelly McKernan, and Karla Ortiz — filed a class-action lawsuit against Stability AI, Midjourney, and DeviantArt in the Northern District of California. Their complaint argued that these companies had trained image-generation models on five billion images scraped from the internet without consent, including the artists' own work, and now offered a commercial product that could generate images "in the style of" any artist in the training set — on demand.

Karla Ortiz, a fantasy illustrator whose work appeared in Marvel and DC projects, demonstrated that typing her name into Midjourney produced images strikingly similar to her distinctive style. "My entire career," she said, "compressed into a product I never agreed to."

What Copyright Protects — and What It Doesn't

Under U.S. copyright law, style is not protectable. You cannot copyright "the way you paint" — the use of particular brushstrokes, a color palette, a mood. What copyright protects is specific expression: the actual painting, the specific arrangement of words on a page, the exact melody. This is why thousands of writers can write in the style of Raymond Carver without infringing, and why imitating the Impressionist aesthetic is not a legal problem.

The challenge with AI is one of degree and systematization. A human imitating an artist's style requires skill, time, and produces work recognizably different. An AI system trained on thousands of examples of an artist's work can produce, on demand, near-identical outputs — at scale, commercially, without the artist's knowledge. Courts and legal scholars have noted that while this may not be technically infringing, it may be deeply unfair and potentially rises to other legal theories: unfair competition, right of publicity, or unjust enrichment.

Documented Case — The New York Times v. OpenAI (2023)

In December 2023, The New York Times filed suit against OpenAI and Microsoft. Central to the complaint was evidence that GPT-4 could reproduce near-verbatim passages from Times articles — sometimes hundreds of words — when prompted. The Times included 100 examples in its filing. OpenAI responded that such reproduction was a "bug" and a "hallucination artifact," not evidence of stored copying. The distinction between memorization and creative hallucination became one of the central technical disputes of the case.

The Diffusion Model Mechanics

Image-generation models like Stable Diffusion and Midjourney work differently from language models, but the copyright questions are analogous. These systems are trained by adding noise to images and learning to reverse that process — effectively learning the statistical structure of image types, styles, and compositions from the training set.

In 2023, researcher Ryan Webster published a paper demonstrating that Stable Diffusion could reproduce training images near-verbatim under certain prompting conditions — not just in style, but in actual pixel-level content. This "data extraction" from an image model became a significant exhibit in the ongoing artist lawsuits.

The LAION-5B dataset — 5.85 billion image-text pairs scraped from the web — was the primary training set for Stable Diffusion. A subsequent investigation by the Stanford Internet Observatory found that LAION-5B contained links to child sexual abuse material, leading to temporary suspension of the dataset. The incident illustrated that large-scale scraping without human review creates problems far beyond copyright.

Style vs. Expression

Copyright protects specific expression, not general style or technique. But when AI systems are trained to reproduce style on demand at commercial scale, courts are being asked whether existing doctrine is adequate for the technology.

The "In the Style of" Prompt

Midjourney and similar tools explicitly allow "in the style of [artist name]" prompts. Some artists have found their names produce outputs nearly indistinguishable from their own work. Midjourney later restricted some living artist names, but the underlying model was not retrained.

What This Means When You Create

When you use an AI image generator or text model to produce content in a particular style, you are participating in a practice whose legality is unsettled. Practically speaking, outputs that too closely resemble a specific artist's distinctive work — even if produced by AI — could expose you to claims of copying. More importantly, the artists whose work made those outputs possible received nothing.

Several platforms have begun offering "opt-out" registries (Adobe Firefly was trained only on licensed or public-domain images; Spawning.ai's "Have I Been Trained" lets artists see if their work is in datasets). Understanding these distinctions helps you make informed choices about which tools align with your values as a creator.

Key Takeaway

Style cannot be copyrighted, but specific expression can. AI systems trained on artists' work can reproduce style and, sometimes, near-exact outputs — at scale and commercially. The legal framework is actively evolving. As a creator, using AI tools that are transparent about training data and that offer artist protections is both an ethical choice and a practical risk-reduction strategy.

LAION-5BA dataset of 5.85 billion image-text pairs scraped from the web, used to train Stable Diffusion and other image generation models. Its provenance and content became central to multiple lawsuits and safety investigations.

Data extractionThe ability to recover near-verbatim training data from a trained model through specific prompting — a documented capability that complicates arguments that training is purely transformative.

Opt-out registryA mechanism allowing creators to request their work be excluded from training datasets — reactive rather than proactive, and not universally honored.

Lesson 2 Quiz

When Output Echoes Input · 4 questions

1. In the January 2023 lawsuit by Sarah Andersen, Kelly McKernan, and Karla Ortiz, what was the primary legal complaint against Stability AI and Midjourney?

Correct. The lawsuit centered on training on scraped images without consent and then commercializing a tool that systematically reproduced individual artists' styles — at scale, without compensation.

The complaint focused on training on scraped images without consent and building a commercial product that reproduced individual artists' styles on demand.

2. Under current U.S. copyright law, which of these can be directly protected?

Correct. Copyright protects specific expression — the actual work — not style, technique, genre, or mood. This is why imitating an artist's style has traditionally not been legally actionable.

Copyright protects specific expression, not general style, technique, or genre. The specific painting or exact text is what's protected, not the approach used to create it.

3. What did researcher Ryan Webster's 2023 paper demonstrate about Stable Diffusion?

Correct. Webster demonstrated data extraction from an image model — not just style transfer but near-verbatim pixel-level reproduction — which became a significant exhibit in artist lawsuits against Stability AI.

Webster's paper demonstrated the opposite — that training data could be extracted as near-verbatim images, not just style. This was an important development in the legal cases against image-generation companies.

4. What distinguishes Adobe Firefly from tools like Midjourney in terms of training data ethics?

Correct. Adobe built Firefly explicitly on licensed and public-domain images. This makes it a meaningfully different choice for creators who want to avoid tools whose training process may have infringed on artists' rights.

The key distinction is the provenance of training data. Adobe Firefly was trained on licensed and public-domain images — a deliberate choice to address the consent problem at the source.

Lab 2 — Style, Similarity, and Your Choices

Explore the line between inspiration and infringement with AI

Your Task

Ask the AI assistant below about how you might approach creating in the style of an existing artist — ethically and legally. Explore what "style" means legally, how to be inspired without infringing, and how to evaluate whether an AI tool you want to use has ethical training provenance.

Suggested start: "I want to create illustrations inspired by Moebius's science fiction art style using an AI tool. How do I think through whether this is okay — artistically, legally, and ethically?"

AI Lab Assistant

Style, Similarity & Ethics

Ready to explore the style and similarity questions with you. Ask me about what copyright does and doesn't protect, how to evaluate AI tools ethically, or how to create work inspired by artists you admire without crossing legal or ethical lines.

Module 4 · Lesson 3

Who Owns What AI Makes

If no human wrote it, can anyone own it?

When Stephen Thaler applied to register copyright for an image created entirely by his DABUS AI system — an image he titled "A Recent Entrance to Paradise" — the U.S. Copyright Office denied the application in 2022. Thaler sued. In August 2023, Federal Judge Beryl Howell upheld the Copyright Office's position: copyright requires human authorship. "Human authorship," she wrote, "is a bedrock requirement of copyright." The ruling was direct and unambiguous. AI-only works, in the United States, cannot be copyrighted as of that decision.

The Human Authorship Requirement

The U.S. Copyright Office has maintained since at least the 1970s that copyright protection requires human creative expression. The doctrine has roots in the Constitution, which grants Congress power to protect "Authors" — a term courts have consistently interpreted to mean human beings. Animals, computers, and nature cannot be authors.

This has significant practical implications. When you type a prompt into an AI image generator and it produces an image, the current U.S. legal position is that the AI's contribution to that image is not copyrightable. If the image requires no significant human creative expression beyond the prompt, it may be in the public domain the moment it's created — usable by anyone, including your competitors.

In February 2023, the Copyright Office registered — then partially revoked — copyright for a graphic novel called "Zarya of the Dawn" by Kristina Kashtanova. The office registered the text and the creative arrangement, but withdrew protection from the individual AI-generated images (produced using Midjourney), holding that they lacked sufficient human authorship. The decision established a partial-registration framework: human creative choices are protectable; AI-generated elements are not.

Documented Policy — Copyright Office Guidance (March 2023)

The U.S. Copyright Office issued formal guidance in March 2023 stating it will register works containing AI-generated material only where a human author has made "sufficient creative control" over the final expression. Prompts alone are generally insufficient. The Office cited the example of a human who selects, arranges, and modifies AI outputs using their own creative judgment — that arrangement and selection can be protected. The underlying AI-generated content itself cannot.

International Divergence

The United States is not alone in its position, but other jurisdictions are reaching different conclusions. In 2020, a South African patent office granted the first patent listing an AI (DABUS, the same system) as inventor — a decision that drew international attention but has not been replicated in the U.S. or EU. China's approach has been more flexible: a Beijing court ruled in 2023 that an AI-generated image was protectable under copyright when a human had made "intellectual inputs" in the prompt and output-selection process, establishing a lower threshold for human contribution than U.S. doctrine currently requires.

The EU AI Act (2024) touches on AI-generated content obligations — including labeling requirements and transparency about AI involvement — but leaves core copyright questions to member states, creating a patchwork of rules for creators working across jurisdictions.

What You Can Protect

Your selection and arrangement of AI outputs, your editing, your added text, your creative decisions about which outputs to use and how to combine them — these human creative contributions can qualify for copyright under current U.S. doctrine.

What You Cannot Protect

Pure AI output generated by a prompt, where no significant human creative judgment shapes the final result, is currently unprotectable in the U.S. Anyone can copy it freely. Your prompt itself is also generally not copyrightable as a work of sufficient originality.

What This Means for Your Creative Work

If you want to protect AI-assisted creative work, the current legal framework rewards heavy human involvement. Using AI to generate a draft that you then substantially rewrite gives you much stronger copyright claims than publishing AI output with minimal editing. Curating, selecting, arranging, and combining AI outputs into a larger work — where your creative judgment is evident — also strengthens your position.

From a business perspective, this creates an irony: the more you rely on AI to do the creative work, the less legal protection you have over the result. Heavy AI reliance makes work harder to protect — and potentially puts it in the public domain where competitors can freely copy it.

Key Takeaway

U.S. courts and the Copyright Office have established that AI-only creative output cannot be copyrighted. Human creative contribution — selection, arrangement, substantial editing — is required. This means AI-heavy work may be unprotectable. Internationally, the rules vary. The practical lesson: if protecting your creative work matters, document your human creative decisions and ensure they are substantial, not superficial.

Human authorship requirementThe U.S. doctrine that copyright protection requires creative expression by a human being. AI-generated works without sufficient human creative contribution cannot be copyrighted.

Zarya of the DawnA 2023 copyright case in which the U.S. Copyright Office revoked protection for AI-generated images in a graphic novel while retaining protection for the human-authored text and arrangement.

Sufficient creative controlThe standard articulated by the U.S. Copyright Office for when AI-assisted work qualifies for registration — requiring human creative judgment beyond simply entering a prompt.

Lesson 3 Quiz

Who Owns What AI Makes · 4 questions

1. What did Federal Judge Beryl Howell rule in the 2023 case involving Stephen Thaler's DABUS AI system?

Correct. Judge Howell upheld the Copyright Office's rejection of Thaler's application, ruling that "human authorship is a bedrock requirement of copyright" — AI-only works cannot be registered in the U.S.

Judge Howell's ruling was clear: copyright requires human authorship. The AI company, the AI itself, and the person who runs it do not automatically receive copyright for AI-only creative output under U.S. law.

2. In the Zarya of the Dawn case, what did the Copyright Office ultimately protect?

Correct. The partial-registration framework established in Zarya protects human-authored elements (text, arrangement, selection) but not the AI-generated images themselves — which the Office held lacked sufficient human authorship.

The Copyright Office applied a partial-registration approach: the human-authored text and the creative arrangement are protected, while the AI-generated images (produced by Midjourney) are not.

3. According to the U.S. Copyright Office's March 2023 guidance, when can AI-assisted work qualify for registration?

Correct. The "sufficient creative control" standard means the human must do more than prompt — they must exercise creative judgment in selecting, arranging, or substantially modifying the AI's output.

The standard is qualitative, not quantitative or procedural. The Copyright Office looks for human creative judgment in selection, arrangement, and modification — not just the act of prompting an AI.

4. Why does heavy reliance on AI for creative work create a business problem beyond just aesthetics?

Correct. If AI-heavy work lacks sufficient human authorship to qualify for copyright, it enters the public domain immediately — anyone can copy, sell, or adapt it. This creates a paradox where the more AI does, the less protection the creator has.

The real business problem is that AI-heavy work may be unprotectable from the moment of creation — landing in the public domain and available for free copying by anyone, including competitors.

Lab 3 — Documenting Your Human Authorship

Build a practice for protecting your AI-assisted creative work

Your Task

Ask the AI assistant to help you develop a practical system for documenting human creative choices when working with AI — so that if you ever need to prove copyright ownership of AI-assisted work, you have evidence of your creative decisions. Explore what counts as "sufficient creative control" and how to build it into your workflow.

Suggested start: "I write short fiction and use AI to help generate drafts. How should I document my process so I can demonstrate sufficient human authorship if I ever need to copyright my work?"

AI Lab Assistant

Let's build a solid documentation practice for your AI-assisted creative work. Tell me about what you create and how you currently use AI in the process — I'll help you think through how to demonstrate your human creative contribution clearly.

Module 4 · Lesson 4

Navigating the Gray Zone

Practical frameworks for creators in a world of unsettled law

When the rules aren't clear yet, how do you make decisions you can defend?

In late 2023, the major U.S. film studios and the Writers Guild of America reached a landmark agreement after a 148-day strike. Buried in the contract language was a provision that neither affirmed nor denied AI's role in writing: writers could not be required to use AI, AI-generated text could not be used to lower their compensation floors, and the studios were required to disclose when AI-generated material was provided to writers. The contract did not resolve whether AI-generated scripts were copyrightable. It simply built fences around the most immediate harms while the law caught up.

The Three Questions to Ask About Any AI Tool

Given the unsettled legal landscape, professional creators and organizations have developed pragmatic frameworks for evaluating AI tools. The questions aren't yet resolved by law, but they structure defensible decision-making:

1. What was in the training data?

Did the company disclose their data sources? Were licenses obtained? Does the tool have an opt-out registry for artists? Adobe Firefly (licensed data), Getty AI (licensed data), and tools built on LAION-5B (scraped without consent) represent different answers to this question.

2. What happens to my inputs?

Are your prompts or the content you upload used to further train the model? Are they shared with third parties? Some platforms (notably early versions of Google's Workspace AI and GitHub Copilot) sparked controversy by defaulting to training on user inputs without prominent disclosure.

3. Who owns the output?

Different platforms have different terms. OpenAI's terms (as of 2024) assign output ownership to the user, subject to copyright law. Midjourney's terms are more complex and have changed multiple times. Adobe Firefly offers an "indemnification" promise — if a customer is sued over outputs, Adobe will help defend them.

4. (Bonus) Has the company indemnified users?

Microsoft, Google, and Adobe have all offered some form of copyright indemnification to enterprise customers using their AI tools — agreeing to cover legal costs if customers are sued over AI outputs. This doesn't resolve the legal questions, but it shifts financial risk back to the platform.

The Ongoing Legal Landscape

As of 2024, the following major legal proceedings were active or recently decided:

Getty Images v. Stability AI (Delaware, 2023): Getty alleged Stability AI copied over 12 million images, including Getty's watermarks, to train Stable Diffusion — one of the most visible examples of a watermark appearing in AI outputs as evidence of verbatim copying from the training set.

The New York Times v. OpenAI (SDNY, filed December 2023): The Times claimed GPT-4 could reproduce verbatim Times articles, competing directly with the paper's subscription revenue. OpenAI has argued fair use. The case is likely to be the highest-stakes copyright case involving AI language models in the U.S. for years.

Universal Music Group v. Anthropic (Tennessee, 2023): Music publishers sued Anthropic for reproducing copyrighted song lyrics in Claude's outputs — an example of the memorization problem applied specifically to highly repetitive, easily-identified text.

Concord Music Group v. Anthropic (2024): A partial ruling found Anthropic had likely infringed lyrics copyrights by reproducing them — the first partial finding against an AI company on a training-data memorization theory.

Practical Guidance — Building Your Framework

Until law settles, creators can manage risk by: (1) preferring tools with disclosed, licensed training data; (2) documenting their human creative contributions for any work they intend to protect; (3) reviewing AI outputs for unintended similarity to known works before publishing; (4) understanding platform terms of service regarding output ownership and indemnification; and (5) staying informed as case law develops — the landscape is shifting rapidly.

The Ethical Dimension Beyond Law

Law defines minimums. Many creators and organizations have chosen standards beyond what current law requires. The argument is straightforward: the artists and authors whose work made AI possible had no say in whether their work was used. They received no compensation. The AI systems that scraped their work are now competitive with them in the marketplace.

Choosing to use tools built on licensed data — even when tools built on scraped data are cheaper or more capable — is a stance about what kind of creative ecosystem you want to support. It is also, increasingly, a reputational consideration: brands and publishers are beginning to ask whether their AI-generated content was produced ethically.

In 2024, the Authors Guild launched the AI Registry, a licensing mechanism allowing AI companies to license authors' works for training on an opt-in basis. Platforms including OpenAI have begun exploring licensing deals with publishers. The infrastructure for consent is being built — slowly, imperfectly, but it is being built.

Key Takeaway

The law is unsettled, but the decisions you make as a creator are not arbitrary. Evaluating tools by their training data provenance, understanding output ownership terms, documenting your own creative contributions, and reviewing outputs before publishing are practical steps that reduce both legal and ethical risk. The goal isn't paralysis — it's informed creativity.

IndemnificationA contractual commitment by an AI platform to cover legal costs if a user is sued over AI-generated outputs — offered by Microsoft, Google, and Adobe to enterprise customers as of 2024.

Opt-in licensingA mechanism by which creators affirmatively grant permission for their work to be used in AI training, in exchange for compensation — distinct from opt-out registries where creators must actively remove themselves from existing datasets.

WGA AI provisionsContract terms negotiated by the Writers Guild of America in 2023 prohibiting studios from requiring writers to use AI, mandating disclosure of AI-generated materials, and protecting minimum compensation floors against AI-driven reduction.

Lesson 4 Quiz

Navigating the Gray Zone · 4 questions

1. What did the 2023 WGA contract with major studios establish regarding AI?

Correct. The WGA agreement built practical fences around the most immediate harms without resolving copyright questions — protecting writers from being replaced by AI while the law catches up.

The WGA deal protected writers from being forced to use AI, required disclosure of AI materials, and protected compensation floors — it did not resolve copyright, and it did not ban AI outright.

2. The Getty Images lawsuit against Stability AI included which particularly striking piece of evidence?

Correct. Getty's complaint included examples of Stable Diffusion outputs containing distorted versions of Getty's watermark — powerful evidence that training had involved verbatim copying of watermarked images, not just style learning.

The watermark evidence was central: Stable Diffusion sometimes produced images containing distorted Getty watermarks, suggesting the training process had included verbatim copying of those images — not merely statistical style learning.

3. What does "copyright indemnification" mean in the context of AI tools like Adobe Firefly or Microsoft Copilot?

Correct. Indemnification shifts financial risk back to the platform. Adobe, Microsoft, and Google have offered this to enterprise customers — it doesn't resolve the legal questions, but it means a company using their tools won't face legal fees alone if sued.

Indemnification is a financial protection — the company agrees to cover legal defense costs if the user faces a copyright lawsuit over AI outputs. It doesn't create copyright or certify originality.

4. What is the difference between an "opt-out registry" and "opt-in licensing" for AI training data?

Correct. The distinction is fundamental: opt-out assumes scraping is acceptable unless you object; opt-in assumes creators must affirmatively consent. The latter better reflects traditional copyright principles — which is why the Authors Guild's AI Registry uses opt-in rather than opt-out.

The consent structure is reversed in each model. Opt-out registries place the burden on creators to actively remove their work from datasets that already contain it. Opt-in licensing requires AI companies to get permission first — closer to how copyright traditionally works.

Lab 4 — Building Your AI Ethics Checklist

Apply the gray-zone framework to real decisions you face as a creator

Your Task

Use this lab to develop a personal checklist for evaluating AI tools before you use them in creative work. Ask the assistant to help you think through the key questions — training data, input usage, output ownership, indemnification — and apply them to a specific tool or use case you're actually considering.

Suggested start: "Help me build a practical checklist I can use every time I'm about to use a new AI creative tool — covering the legal, ethical, and practical questions I should research before I start using it for real work."

AI Lab Assistant

Ethics Framework & Decision Tools

Let's build a decision framework you can actually use. Tell me what kinds of creative work you do and which AI tools you're currently considering or already using — I'll help you develop a checklist that fits your specific situation.

Module 4 — Module Test

When AI Copies Without Knowing · 15 questions · Pass at 80%

1. The Books3 dataset, used to train several major AI models, was sourced from:

Correct. Books3 scraped ~196,640 books from Bibliotik without authorization and became a central exhibit in lawsuits by Sarah Silverman and others against Meta.

Books3 was sourced from Bibliotik, a piracy site — not any legitimate or licensed source.

2. Why is "memorization" in language models legally significant?

Correct. The ability to reproduce training data verbatim — especially for frequently-repeated text — directly challenges the "transformative use" argument that no copies are stored in the model.

Memorization demonstrates that models can output near-verbatim training content — which is central to the New York Times lawsuit and challenges the claim that training is purely transformative.

3. Under U.S. copyright law, which element of creative work is NOT protectable?

Correct. Style, technique, and mood are not protectable — only specific expression is. This is both why style imitation has traditionally been legal and why AI "style copying" is legally complex.

Style cannot be copyrighted. Copyright protects specific expression — the particular words, melody, or image — not the general approach or aesthetic.

4. In the Zarya of the Dawn case, the Copyright Office established that:

Correct. The partial-registration framework protects human creative contribution while denying copyright to the AI-generated elements themselves — a nuanced but practically important distinction.

Zarya established a partial-registration approach: human-authored elements are protectable; the AI-generated images are not. Neither extreme — full protection or full denial — applied.

5. What did researcher Ryan Webster demonstrate about Stable Diffusion in 2023?

Correct. Data extraction from image models — near-verbatim pixel-level reproduction — became an important exhibit in the Getty Images v. Stability AI case and other artist lawsuits.

Webster demonstrated data extraction: the ability to recover near-verbatim training images, not just style — directly relevant to the legal cases against image-generation companies.

6. Federal Judge Beryl Howell's ruling in the Stephen Thaler/DABUS case held that:

Correct. "Human authorship is a bedrock requirement of copyright" — Judge Howell's ruling was unambiguous and upheld the Copyright Office's long-standing position.

The ruling was clear: human authorship is required for copyright. No human involvement = no copyright, regardless of who owns or operates the AI.

7. The LAION-5B dataset, used to train Stable Diffusion, became newsworthy beyond copyright for which additional reason?

Correct. The CSAM finding illustrated that large-scale scraping without human review creates harms far beyond copyright — a powerful example of why dataset curation matters.

The Stanford Internet Observatory's finding about CSAM in LAION-5B showed that the problems with unreviewed scraping extend well beyond copyright infringement.

8. The U.S. Copyright Office's March 2023 guidance stated that AI-assisted work can qualify for registration when:

Correct. The standard is qualitative — human creative judgment in selection, arrangement, or modification — not a simple percentage calculation or procedural requirement.

The "sufficient creative control" standard focuses on the quality and nature of human contribution, not the percentage of AI-generated content or procedural factors.

9. The New York Times lawsuit against OpenAI centered on which technical phenomenon?

Correct. The Times included 100 examples of GPT-4 reproducing near-verbatim article passages in its complaint — making memorization the central technical and legal exhibit.

The core exhibit was near-verbatim reproduction of Times articles — the memorization phenomenon applied to high-value, easily-identified journalism.

10. What distinguishes Adobe Firefly from Stable Diffusion in terms of training data ethics?

Correct. Adobe's deliberate choice to train on licensed and public-domain content makes Firefly meaningfully different — and is why Adobe can offer copyright indemnification to Firefly users.

The training data provenance is the key distinction: licensed/public-domain (Firefly) versus web-scraped without consent (Stable Diffusion/LAION-5B).

11. The WGA 2023 agreement with major studios did NOT include which of these provisions?

Correct. The WGA contract did not resolve copyright questions for AI-generated scripts — it protected writers from immediate harms (forced use, compensation reduction) while the legal framework continues to develop.

Copyright status of AI-generated scripts was not resolved by the WGA deal. The contract protected writers from immediate practical harms without addressing the deeper legal questions.

12. In the Universal Music Group v. Anthropic case (2023), what was the central allegation?

Correct. Song lyrics are short, repetitive, and highly distinctive — making them particularly susceptible to memorization and verbatim reproduction in AI outputs.

The lawsuit centered on Claude reproducing copyrighted lyrics verbatim — the memorization problem applied to a category of text (song lyrics) that is commercially valuable and easily identified.

13. What is an "opt-in licensing" mechanism for AI training data?

Correct. Opt-in licensing requires affirmative consent before inclusion — the model that most closely resembles traditional copyright principles, and the approach taken by the Authors Guild AI Registry launched in 2024.

Opt-in licensing requires prior permission from creators — consent before use, not notification after. This is meaningfully different from opt-out registries that assume consent unless you actively object.

14. A creator using AI heavily to generate content faces which practical copyright disadvantage compared to a creator who uses AI minimally?

Correct. The heavy AI-reliance paradox: the more AI does, the less legal protection the creator has. Minimal human contribution may mean the work is unprotectable from the moment it's created.

The key disadvantage is unprotectability. Work lacking sufficient human authorship enters the public domain — anyone can copy it. More AI means less protection, not more.

15. Which of the following best describes how to build defensible copyright protection for AI-assisted creative work under current U.S. doctrine?

Correct. Substantial human creative contribution — especially when documented — is the path to copyright protection under current doctrine. Selection, arrangement, editing, and rewriting all strengthen the claim of sufficient human authorship.

Human creative contribution — documented, substantial, and articulable — is what current doctrine requires. Rewriting, arranging, and editing AI outputs with your own creative judgment is the practical path to protection.