In the summer of 2023, the Authors Guild organized a letter signed by more than 10,000 writers — including Nora Roberts, John Grisham, and George R.R. Martin — demanding that AI companies compensate authors whose books had been used to train large language models. The letter was not a lawsuit. It was a warning shot. Within months, several of those same authors did file suit, and the legal battles that followed would reshape how the entire AI industry thought about copyright.
When an AI language model is trained, it ingests enormous quantities of text. GPT-4, for instance, was trained on hundreds of billions of tokens — words, punctuation marks, fragments — scraped from the web, digitized books, academic papers, code repositories, and more. The model does not store these texts the way a hard drive stores a file. Instead, it adjusts billions of numerical parameters — weights — so that it becomes better at predicting what comes next in a sequence of words.
The distinction matters legally: no copy of a book lives inside a trained model in any readable form. What lives there is a statistical residue — patterns compressed into numbers. This is why AI companies have consistently argued that training is "transformative use" under copyright law, similar to how a human author reads thousands of novels before writing their own. Critics counter that the scale is incomparable and the commercial benefit unmistakable.
In 2023, journalist Alex Reisner and researcher Anna Ridler helped surface the contents of Books3, a dataset of approximately 196,640 entire books scraped from a piracy site called Bibliotik. Books3 was used to train models including Meta's LLaMA and several early versions of other large language models. Authors could search for their own titles and find them listed. Comedian Sarah Silverman and novelists Christopher Golden and Richard Kadrey were among those who filed suit against Meta specifically because of Books3.
Most training data for large language models passes through several stages before it reaches a model. A web crawler — Common Crawl is the most widely used — continuously downloads pages from across the internet. That raw data is then filtered, deduplicated, and sometimes augmented with higher-quality sources like Wikipedia, academic papers, or licensed datasets.
The problem is that the web contains copyrighted material at extraordinary scale. News articles, song lyrics, screenplay excerpts, forum posts containing quoted prose, fan-fiction sites reproducing published passages — all of it enters the pipeline unless actively filtered out. Most early pipelines had no systematic mechanism for filtering by copyright status.
OpenAI's GPT-3 paper (2020) disclosed that training data included WebText (web pages linked from Reddit), books from an undisclosed source labeled "Books1" and "Books2," and Wikipedia. The company has not publicly disclosed the full contents of those book datasets. In the 2023 New York Times lawsuit against OpenAI and Microsoft, the Times presented evidence that GPT-4 could reproduce near-verbatim passages from Times articles — an argument that the model had "memorized" specific text.
The Pile, a popular open-source training dataset, contained over 800GB of text including GitHub, PubMed, FreeLaw, and DM Mathematics — assembled without licensing individual items. Many model builders used it without scrutiny of its contents.
Researchers at Google, DeepMind, and universities have shown that models can "memorize" training data — reproducing exact or near-exact sequences when prompted. The probability rises with duplication: text that appears many times in training is more likely to be reproduced verbatim.
When you use an AI tool to help write a story, design a logo, or compose a song, the outputs are shaped by everything that went into training the model. If the model was trained on copyrighted work without authorization, any resemblance between its output and that source material — even unintentional — could implicate you as the person publishing the result.
This is not hypothetical. In 2023, comedian and writer Katy O'Brian discovered that an AI image generator had produced work almost identical to a specific illustrator's style. The illustrator — Karla Ortiz — was one of three artists who filed suit against Stability AI, Midjourney, and DeviantArt in January 2023, arguing that their styles had been absorbed into systems used commercially without consent or compensation.
Understanding the pipeline — where training data comes from, how it's filtered, and where it isn't — is the first step in using AI responsibly. In the next lesson, we look at what happens when AI outputs too closely echo a specific source.
AI models are trained on vast text corpora that frequently include copyrighted material. The legal status of this practice is actively contested. As a creator using AI tools, you bear some responsibility for understanding whether the tool you're using has addressed training-data rights — and for reviewing outputs for unintended similarity to existing protected works.
Ask the AI assistant below about training data, copyright, and how it handles situations where it might reproduce copyrighted content. Try to get specific — ask about particular datasets, about memorization, about what safeguards exist. The goal is to understand the gap between what AI companies say about training data and what is actually disclosed.
On January 13, 2023, three visual artists — Sarah Andersen, Kelly McKernan, and Karla Ortiz — filed a class-action lawsuit against Stability AI, Midjourney, and DeviantArt in the Northern District of California. Their complaint argued that these companies had trained image-generation models on five billion images scraped from the internet without consent, including the artists' own work, and now offered a commercial product that could generate images "in the style of" any artist in the training set — on demand.
Karla Ortiz, a fantasy illustrator whose work appeared in Marvel and DC projects, demonstrated that typing her name into Midjourney produced images strikingly similar to her distinctive style. "My entire career," she said, "compressed into a product I never agreed to."
Under U.S. copyright law, style is not protectable. You cannot copyright "the way you paint" — the use of particular brushstrokes, a color palette, a mood. What copyright protects is specific expression: the actual painting, the specific arrangement of words on a page, the exact melody. This is why thousands of writers can write in the style of Raymond Carver without infringing, and why imitating the Impressionist aesthetic is not a legal problem.
The challenge with AI is one of degree and systematization. A human imitating an artist's style requires skill, time, and produces work recognizably different. An AI system trained on thousands of examples of an artist's work can produce, on demand, near-identical outputs — at scale, commercially, without the artist's knowledge. Courts and legal scholars have noted that while this may not be technically infringing, it may be deeply unfair and potentially rises to other legal theories: unfair competition, right of publicity, or unjust enrichment.
In December 2023, The New York Times filed suit against OpenAI and Microsoft. Central to the complaint was evidence that GPT-4 could reproduce near-verbatim passages from Times articles — sometimes hundreds of words — when prompted. The Times included 100 examples in its filing. OpenAI responded that such reproduction was a "bug" and a "hallucination artifact," not evidence of stored copying. The distinction between memorization and creative hallucination became one of the central technical disputes of the case.
Image-generation models like Stable Diffusion and Midjourney work differently from language models, but the copyright questions are analogous. These systems are trained by adding noise to images and learning to reverse that process — effectively learning the statistical structure of image types, styles, and compositions from the training set.
In 2023, researcher Ryan Webster published a paper demonstrating that Stable Diffusion could reproduce training images near-verbatim under certain prompting conditions — not just in style, but in actual pixel-level content. This "data extraction" from an image model became a significant exhibit in the ongoing artist lawsuits.
The LAION-5B dataset — 5.85 billion image-text pairs scraped from the web — was the primary training set for Stable Diffusion. A subsequent investigation by the Stanford Internet Observatory found that LAION-5B contained links to child sexual abuse material, leading to temporary suspension of the dataset. The incident illustrated that large-scale scraping without human review creates problems far beyond copyright.
Copyright protects specific expression, not general style or technique. But when AI systems are trained to reproduce style on demand at commercial scale, courts are being asked whether existing doctrine is adequate for the technology.
Midjourney and similar tools explicitly allow "in the style of [artist name]" prompts. Some artists have found their names produce outputs nearly indistinguishable from their own work. Midjourney later restricted some living artist names, but the underlying model was not retrained.
When you use an AI image generator or text model to produce content in a particular style, you are participating in a practice whose legality is unsettled. Practically speaking, outputs that too closely resemble a specific artist's distinctive work — even if produced by AI — could expose you to claims of copying. More importantly, the artists whose work made those outputs possible received nothing.
Several platforms have begun offering "opt-out" registries (Adobe Firefly was trained only on licensed or public-domain images; Spawning.ai's "Have I Been Trained" lets artists see if their work is in datasets). Understanding these distinctions helps you make informed choices about which tools align with your values as a creator.
Style cannot be copyrighted, but specific expression can. AI systems trained on artists' work can reproduce style and, sometimes, near-exact outputs — at scale and commercially. The legal framework is actively evolving. As a creator, using AI tools that are transparent about training data and that offer artist protections is both an ethical choice and a practical risk-reduction strategy.
Ask the AI assistant below about how you might approach creating in the style of an existing artist — ethically and legally. Explore what "style" means legally, how to be inspired without infringing, and how to evaluate whether an AI tool you want to use has ethical training provenance.
When Stephen Thaler applied to register copyright for an image created entirely by his DABUS AI system — an image he titled "A Recent Entrance to Paradise" — the U.S. Copyright Office denied the application in 2022. Thaler sued. In August 2023, Federal Judge Beryl Howell upheld the Copyright Office's position: copyright requires human authorship. "Human authorship," she wrote, "is a bedrock requirement of copyright." The ruling was direct and unambiguous. AI-only works, in the United States, cannot be copyrighted as of that decision.
The U.S. Copyright Office has maintained since at least the 1970s that copyright protection requires human creative expression. The doctrine has roots in the Constitution, which grants Congress power to protect "Authors" — a term courts have consistently interpreted to mean human beings. Animals, computers, and nature cannot be authors.
This has significant practical implications. When you type a prompt into an AI image generator and it produces an image, the current U.S. legal position is that the AI's contribution to that image is not copyrightable. If the image requires no significant human creative expression beyond the prompt, it may be in the public domain the moment it's created — usable by anyone, including your competitors.
In February 2023, the Copyright Office registered — then partially revoked — copyright for a graphic novel called "Zarya of the Dawn" by Kristina Kashtanova. The office registered the text and the creative arrangement, but withdrew protection from the individual AI-generated images (produced using Midjourney), holding that they lacked sufficient human authorship. The decision established a partial-registration framework: human creative choices are protectable; AI-generated elements are not.
The U.S. Copyright Office issued formal guidance in March 2023 stating it will register works containing AI-generated material only where a human author has made "sufficient creative control" over the final expression. Prompts alone are generally insufficient. The Office cited the example of a human who selects, arranges, and modifies AI outputs using their own creative judgment — that arrangement and selection can be protected. The underlying AI-generated content itself cannot.
The United States is not alone in its position, but other jurisdictions are reaching different conclusions. In 2020, a South African patent office granted the first patent listing an AI (DABUS, the same system) as inventor — a decision that drew international attention but has not been replicated in the U.S. or EU. China's approach has been more flexible: a Beijing court ruled in 2023 that an AI-generated image was protectable under copyright when a human had made "intellectual inputs" in the prompt and output-selection process, establishing a lower threshold for human contribution than U.S. doctrine currently requires.
The EU AI Act (2024) touches on AI-generated content obligations — including labeling requirements and transparency about AI involvement — but leaves core copyright questions to member states, creating a patchwork of rules for creators working across jurisdictions.
Your selection and arrangement of AI outputs, your editing, your added text, your creative decisions about which outputs to use and how to combine them — these human creative contributions can qualify for copyright under current U.S. doctrine.
Pure AI output generated by a prompt, where no significant human creative judgment shapes the final result, is currently unprotectable in the U.S. Anyone can copy it freely. Your prompt itself is also generally not copyrightable as a work of sufficient originality.
If you want to protect AI-assisted creative work, the current legal framework rewards heavy human involvement. Using AI to generate a draft that you then substantially rewrite gives you much stronger copyright claims than publishing AI output with minimal editing. Curating, selecting, arranging, and combining AI outputs into a larger work — where your creative judgment is evident — also strengthens your position.
From a business perspective, this creates an irony: the more you rely on AI to do the creative work, the less legal protection you have over the result. Heavy AI reliance makes work harder to protect — and potentially puts it in the public domain where competitors can freely copy it.
U.S. courts and the Copyright Office have established that AI-only creative output cannot be copyrighted. Human creative contribution — selection, arrangement, substantial editing — is required. This means AI-heavy work may be unprotectable. Internationally, the rules vary. The practical lesson: if protecting your creative work matters, document your human creative decisions and ensure they are substantial, not superficial.
Ask the AI assistant to help you develop a practical system for documenting human creative choices when working with AI — so that if you ever need to prove copyright ownership of AI-assisted work, you have evidence of your creative decisions. Explore what counts as "sufficient creative control" and how to build it into your workflow.
In late 2023, the major U.S. film studios and the Writers Guild of America reached a landmark agreement after a 148-day strike. Buried in the contract language was a provision that neither affirmed nor denied AI's role in writing: writers could not be required to use AI, AI-generated text could not be used to lower their compensation floors, and the studios were required to disclose when AI-generated material was provided to writers. The contract did not resolve whether AI-generated scripts were copyrightable. It simply built fences around the most immediate harms while the law caught up.
Given the unsettled legal landscape, professional creators and organizations have developed pragmatic frameworks for evaluating AI tools. The questions aren't yet resolved by law, but they structure defensible decision-making:
Did the company disclose their data sources? Were licenses obtained? Does the tool have an opt-out registry for artists? Adobe Firefly (licensed data), Getty AI (licensed data), and tools built on LAION-5B (scraped without consent) represent different answers to this question.
Are your prompts or the content you upload used to further train the model? Are they shared with third parties? Some platforms (notably early versions of Google's Workspace AI and GitHub Copilot) sparked controversy by defaulting to training on user inputs without prominent disclosure.
Different platforms have different terms. OpenAI's terms (as of 2024) assign output ownership to the user, subject to copyright law. Midjourney's terms are more complex and have changed multiple times. Adobe Firefly offers an "indemnification" promise — if a customer is sued over outputs, Adobe will help defend them.
Microsoft, Google, and Adobe have all offered some form of copyright indemnification to enterprise customers using their AI tools — agreeing to cover legal costs if customers are sued over AI outputs. This doesn't resolve the legal questions, but it shifts financial risk back to the platform.
As of 2024, the following major legal proceedings were active or recently decided:
Getty Images v. Stability AI (Delaware, 2023): Getty alleged Stability AI copied over 12 million images, including Getty's watermarks, to train Stable Diffusion — one of the most visible examples of a watermark appearing in AI outputs as evidence of verbatim copying from the training set.
The New York Times v. OpenAI (SDNY, filed December 2023): The Times claimed GPT-4 could reproduce verbatim Times articles, competing directly with the paper's subscription revenue. OpenAI has argued fair use. The case is likely to be the highest-stakes copyright case involving AI language models in the U.S. for years.
Universal Music Group v. Anthropic (Tennessee, 2023): Music publishers sued Anthropic for reproducing copyrighted song lyrics in Claude's outputs — an example of the memorization problem applied specifically to highly repetitive, easily-identified text.
Concord Music Group v. Anthropic (2024): A partial ruling found Anthropic had likely infringed lyrics copyrights by reproducing them — the first partial finding against an AI company on a training-data memorization theory.
Until law settles, creators can manage risk by: (1) preferring tools with disclosed, licensed training data; (2) documenting their human creative contributions for any work they intend to protect; (3) reviewing AI outputs for unintended similarity to known works before publishing; (4) understanding platform terms of service regarding output ownership and indemnification; and (5) staying informed as case law develops — the landscape is shifting rapidly.
Law defines minimums. Many creators and organizations have chosen standards beyond what current law requires. The argument is straightforward: the artists and authors whose work made AI possible had no say in whether their work was used. They received no compensation. The AI systems that scraped their work are now competitive with them in the marketplace.
Choosing to use tools built on licensed data — even when tools built on scraped data are cheaper or more capable — is a stance about what kind of creative ecosystem you want to support. It is also, increasingly, a reputational consideration: brands and publishers are beginning to ask whether their AI-generated content was produced ethically.
In 2024, the Authors Guild launched the AI Registry, a licensing mechanism allowing AI companies to license authors' works for training on an opt-in basis. Platforms including OpenAI have begun exploring licensing deals with publishers. The infrastructure for consent is being built — slowly, imperfectly, but it is being built.
The law is unsettled, but the decisions you make as a creator are not arbitrary. Evaluating tools by their training data provenance, understanding output ownership terms, documenting your own creative contributions, and reviewing outputs before publishing are practical steps that reduce both legal and ethical risk. The goal isn't paralysis — it's informed creativity.
Use this lab to develop a personal checklist for evaluating AI tools before you use them in creative work. Ask the assistant to help you think through the key questions — training data, input usage, output ownership, indemnification — and apply them to a specific tool or use case you're actually considering.