AI Knows More Than You Think · Introduction

Every Click Has Been Recorded, and Now It Speaks

You didn't consent to being training data. This course explains exactly what happened — and what it means.

In 1890, Louis Brandeis and Samuel Warren published "The Right to Privacy" in the Harvard Law Review — triggered, in part, by the arrival of the Kodak portable camera, which suddenly let strangers photograph people in public without permission. The technology had outrun the social contract by years. Courts, legislatures, and ordinary citizens spent the next four decades negotiating what privacy even meant in a world where the image of a person could be reproduced and distributed without their knowledge. Sound familiar? The pattern is old. The stakes, this time, are larger.

Today's equivalent of the Kodak moment is quieter and far more pervasive. Between 2007 and 2023, the world's internet users generated roughly 120 zettabytes of text, images, audio, and behavioral signals — search queries, product reviews, forum arguments, medical questions typed at 2 a.m., location check-ins, voice recordings captured by smart speakers. The companies that trained the large language models now reshaping medicine, law, hiring, and education harvested significant portions of that output. Common Crawl alone — one of the primary training sources for GPT-3, LLaMA, and dozens of other models — contains petabyte-scale snapshots of the public web going back to 2008. Much of what you posted publicly online is, statistically, already inside a model.

This course does not argue that AI is malevolent, nor that you should panic. It argues that you deserve to understand the mechanics — specifically, how behavioral data becomes model weights, what inferences are possible from the data trails you leave, what legal frameworks currently do and do not protect you, and what choices remain available. Four lessons, each grounded in documented events and verifiable research. No invented characters, no worst-case speculation. Just what is actually known.

If you finish every module, here's who you become:

You'll understand exactly how behavioral data — clicks, queries, location pings — gets converted into the model weights powering today's AI systems.
You'll be able to look at any app's permissions screen and identify what it's actually collecting versus what it claims to need.
You'll know what Common Crawl is, why it matters, and why your public posts are statistically inside at least one large language model.
You'll recognize the specific inference techniques AI systems use to build profiles and target decisions — hiring, credit, health, advertising — from fragmented data trails.
You'll be able to audit your own digital footprint using documented tools and interpret what your exposure actually means in practice.
You'll know which legal frameworks — GDPR, CCPA, and their limits — currently apply to your data and what rights you can actually exercise right now.
You become someone who reads AI systems clearly, makes deliberate choices about exposure, and doesn't need to go off-grid to protect what matters.

AI Knows More Than You Think · Lesson 1 of 4

The Harvest: Where Training Data Actually Comes From

Before a model learns anything, someone must decide what it reads — and the answer involves most of the public internet.

What data, collected from where, by whom, ended up shaping the AI systems answering your questions today?

On September 22, 2020, OpenAI published a technical paper describing GPT-3 — a language model trained on approximately 570 gigabytes of filtered text drawn from several sources: Common Crawl web snapshots, WebText2 (Reddit outbound links with high upvote counts), Books1, Books2, and English Wikipedia. The paper listed the sources plainly in a table. What it did not dwell on was that Common Crawl — the largest component, weighted at 60% of training tokens — is assembled by a nonprofit that has been crawling the public web since 2008, collecting pages regardless of whether their authors intended them as AI training material. A recipe blog post from 2011. A grief support forum thread from 2014. A teenager's DeviantArt commentary from 2009. All of it, potentially present. OpenAI was not uniquely aggressive in this choice; it was doing what the field had converged on as standard practice. The question this lesson asks is not whether that was right or wrong — but precisely how it works, so you can reason about it clearly.

1.1 — Common Crawl and the Web-Scale Baseline

Common Crawl is a San Francisco–based nonprofit that has operated continuous web crawls since 2008. Its publicly available dataset as of 2023 contains over 250 billion web pages in compressed form, totaling multiple petabytes. It is free to download, which is why it appears in the training lineage of GPT-2, GPT-3, GPT-4 (indirectly), Meta's LLaMA and LLaMA 2, Google's PaLM, Mistral, and dozens of academic models. The crawl captures whatever is publicly accessible at the time — including content behind no login, but subject to robots.txt exclusions that many sites do not configure carefully or at all.

The key process is called crawling and indexing: automated bots follow hyperlinks systematically, download HTML, strip navigation elements, and store raw text. Common Crawl's bots identify themselves with a user-agent string ("CCBot"), meaning website owners could technically block them — but most did not, either because they were unaware, because blocking felt futile, or because they wanted search-engine discoverability and Common Crawl shared infrastructure logic with search crawlers.

Researchers at the Allen Institute for AI published a 2023 analysis called Dolma documenting the contents of web-crawl training sets. They found that a significant fraction of text originated from a small number of domain types: news sites, Wikipedia mirrors, Reddit, e-commerce product pages, and what they categorized as "low-quality content farms." The implication is that the web's most actively written spaces — forums, comment sections, personal blogs, social media text scraped before API restrictions — are disproportionately represented inside modern models.

Documented Source

OpenAI's GPT-3 paper (Brown et al., 2020, "Language Models are Few-Shot Learners") explicitly lists training data composition: 410 billion tokens from Common Crawl, 19 billion from WebText2, 12 billion from Books1, 55 billion from Books2, and 3 billion from Wikipedia. The paper is publicly available on arXiv (arXiv:2005.14165).

1.2 — The Books Problem: LibGen, Bibliotik, and the Authors Guild Lawsuits

Web text alone cannot teach a model long-form reasoning, narrative structure, or sustained argument. For that, AI labs turned to books. The mechanisms varied and grew increasingly controversial. Meta's LLaMA 1 model, released in February 2023, was trained on a dataset that included "Books" — later reporting by The Atlantic and the journalist Alex Reisner, published in August 2023, identified the source as Books3, a dataset assembled by researcher Shawn Presser in 2020. Books3 contained approximately 196,640 pirated books scraped from the shadow library Bibliotik. The Atlantic built a searchable database; authors could look up whether their titles were included. They frequently were.

OpenAI's GPT-3 training included "Books1" and "Books2," the contents of which OpenAI did not publicly specify. Investigative reporting and legal filings in the Authors Guild v. OpenAI lawsuit (filed September 2023 in the Southern District of New York) allege the books datasets contained copyrighted material without license. Comedian and author Sarah Silverman, along with novelists Christopher Golden and Richard Kadrey, filed a separate suit in July 2023 in the Northern District of California, specifically naming Llama and ChatGPT as trained on their work without compensation.

The legal outcomes remain unresolved as of 2024, but the underlying technical fact is not disputed: copyrighted books were used as training data at scale, by multiple major labs, drawing on shadow library infrastructure that had existed for years before AI labs found it useful.

Training Token The basic unit of text a language model processes during training — roughly equivalent to a word fragment. GPT-3 trained on approximately 499 billion tokens. Each token influences the model's statistical associations.

Common Crawl A nonprofit web archive containing petabyte-scale snapshots of publicly accessible web pages since 2008. The primary training data source for most publicly known large language models.

Data Filtering The process of removing low-quality, duplicate, or harmful text from raw crawl data before training. Different labs use different filtering pipelines, meaning identical source data can produce very different training sets.

1.3 — Social Platforms: Reddit, Stack Overflow, and the API Wars of 2023

Some of the most useful training data for conversational AI is human dialogue — specifically, text where one person asks something and another answers. Reddit and Stack Overflow supplied this at massive scale. OpenAI's WebText and WebText2 datasets were built by collecting all URLs posted to Reddit that received at least 3 karma upvotes — a crude but effective quality filter. The result: billions of tokens of human-generated question-and-answer exchange, argument, humor, and domain expertise.

By 2023, Reddit and Stack Overflow recognized that their communities had effectively been harvested to train commercial AI products that now competed with their own traffic. In April 2023, Stack Overflow announced a policy requiring AI companies to pay for API access to its data. In June 2023, Reddit CEO Steve Huffman announced aggressive API pricing that would make third-party data access prohibitively expensive — widely understood as aimed at AI scrapers. Both moves arrived years after the most significant training runs had already completed. The data was already inside the models.

Reddit did subsequently negotiate a $60 million annual data licensing deal with Google, reported by Bloomberg in February 2024 ahead of Reddit's IPO. This established a market price for the kind of conversational data that had previously been taken without payment — but it did not retroactively compensate the millions of Reddit users who wrote the content.

Why This Matters to You

If you have ever posted publicly on Reddit, written a public blog, contributed to Wikipedia, published a review on Yelp or Amazon, commented on a news article, or maintained a public social media account — your text is statistically likely to be part of at least one major language model's training data. This is not speculation; it is a consequence of how Common Crawl, WebText, and similar pipelines were constructed and which sources they prioritized.

1.4 — From Raw Text to Model Weights: The Compression Step

Understanding what "training data" means requires understanding what training actually does to that data. A language model does not store text like a database. It compresses statistical relationships across trillions of word-pair associations into a set of numerical parameters called weights. GPT-3 has 175 billion such parameters. During training, the model reads a token, predicts the next token, compares its prediction to the actual next token, and adjusts its weights slightly to reduce error. This process, called stochastic gradient descent, repeats across the entire training corpus multiple times.

The result is that no individual sentence is stored — but the patterns of language use, factual associations, stylistic tendencies, and even specific recurring phrases become encoded in the weight matrix. Researchers have demonstrated memorization: in 2021, Google researcher Nicholas Carlini and colleagues published a paper showing that GPT-2 could be induced to reproduce verbatim text from its training data — including a specific person's name paired with their phone number — when prompted correctly. This was not a bug in the traditional sense; it was a consequence of data appearing frequently enough that the model's weights encoded it with high fidelity.

The takeaway for this course: your data does not sit in a folder marked "user data." It is dissolved into a statistical structure that can sometimes reconstitute fragments of it — including personal details — under the right prompting conditions.

Lesson Summary: Modern large language models are trained primarily on web-crawl data (dominated by Common Crawl), supplemented by books (some obtained via shadow libraries), and human-generated dialogue platforms like Reddit and Stack Overflow. This data was collected without individual consent or compensation. The training process compresses these sources into numerical weights that can, under certain conditions, reproduce verbatim training content. Lawsuits from authors and publishers are currently testing whether this constitutes copyright infringement. Legal outcomes remain pending.

Lesson 1 Quiz — The Harvest

Four questions. Select the best answer for each.

1. What percentage of GPT-3's training tokens came from Common Crawl, according to the original OpenAI paper?

Correct. The GPT-3 paper (Brown et al., 2020) lists approximately 410 billion tokens from Common Crawl out of roughly 499 billion total — around 60% after filtering and weighting adjustments.

Not quite. The GPT-3 paper reports approximately 410 billion tokens from Common Crawl — about 60% of the total training mixture, making it by far the largest source.

2. What was "Books3," the dataset identified in reporting about Meta's LLaMA training?

Correct. The Atlantic's August 2023 investigation, citing researcher Alex Reisner's work, identified Books3 as a collection of roughly 196,640 pirated books scraped from Bibliotik, a shadow library.

Incorrect. Books3 was a dataset of approximately 196,640 pirated books scraped from the shadow library Bibliotik — not a licensed or public-domain collection. The Atlantic built a searchable database so authors could check if their work was included.

3. How did language models like GPT-2 demonstrate "memorization" of training data, as shown in the 2021 Carlini et al. study?

Correct. Nicholas Carlini and colleagues at Google demonstrated that GPT-2 could be induced to reproduce verbatim text from training data — including specific personal contact information — under targeted prompting, even though no text is stored explicitly in the model.

Incorrect. The Carlini et al. study showed that with the right prompting, GPT-2 would reproduce verbatim training text — including personal identifiers like names and phone numbers — as an emergent consequence of weight encoding, not because text is stored explicitly.

4. What event in February 2024 established a market price for conversational social media data used in AI training?

Correct. Bloomberg reported in February 2024 that Reddit had negotiated a roughly $60 million annual licensing deal with Google, timed partly around Reddit's IPO preparation — the first large public valuation of this type of community-generated conversational data.

Not quite. Bloomberg reported in February 2024 that Reddit struck an approximately $60 million annual data licensing deal with Google — establishing a concrete market price for community-generated conversational data that had previously been collected without payment.

Lab 1 — Data Source Investigator

Talk to the AI about training data origins. Ask about your own digital footprint.

Your Mission

In this lab, you'll interrogate an AI assistant about how training data is collected, filtered, and transformed into model weights. Ask about specific sources, ask where your own data might have ended up, or challenge the AI to explain why models memorize certain content. The goal is to deepen your mental model of the pipeline from web page to language model.

Try asking: "If I wrote a public blog post in 2015, is it likely inside a language model today?" — or — "What filtering would remove my data from a Common Crawl-based dataset?"

AI Lab Assistant

Training Data Pipeline

Welcome to Lab 1. I'm here to help you think through how training data is harvested, filtered, and absorbed into language models. Ask me anything about Common Crawl, Books3, Reddit data deals, memorization research — or what your own digital footprint might mean. What would you like to explore?

AI Knows More Than You Think · Lesson 2 of 4

The Inference Engine: What AI Can Deduce From Fragments

Training data teaches models to fill gaps. The gaps they can fill about individuals are more revealing than most people expect.

Given only your public digital behavior — your posts, searches, purchases, app usage — what can a trained model infer about you that you never stated explicitly?

In 2014, a Cambridge University researcher named Aleksandr Kogan built a Facebook quiz app called "thisisyourdigitallife." It collected psychological profile data — not just from the roughly 270,000 people who installed it, but from all of their friends, via Facebook's then-permissive API. The result was a dataset covering an estimated 87 million Facebook profiles. Kogan sold this data to Cambridge Analytica, a political consulting firm that claimed to use it to build psychographic models capable of predicting voter personality and targeting political advertising accordingly. The story broke publicly in March 2018, forcing Facebook CEO Mark Zuckerberg to testify before Congress. What the episode illustrated was not primarily a hacking story — it was an inference story: from apparently innocuous behavioral signals (quiz answers, likes, page follows), it was possible to model intimate psychological attributes with meaningful accuracy.

2.1 — The Kosinski Studies: Likes Predict Lives

The scientific foundation for Cambridge Analytica's claims — disputed in its commercial applications but grounded in real research — came from work by Michal Kosinski, then at Cambridge University's Psychometrics Centre. In a 2013 paper published in PNAS (Proceedings of the National Academy of Sciences), Kosinski and colleagues analyzed 58,000 volunteers who had completed a personality questionnaire and shared their Facebook likes. Using a relatively simple machine learning model, they found that Facebook likes alone could predict: sexual orientation with 88% accuracy (for males), ethnic background with 95% accuracy, political affiliation with 85% accuracy, religious affiliation with 82% accuracy, and whether users' parents had separated before they turned 21 with 60% accuracy.

The model used no demographic self-report data — only behavioral signals (which pages someone had liked). The likes with highest predictive power were frequently not the obvious ones: liking "Curly Fries" correlated with high intelligence; liking "Being Confused After Waking Up from Naps" predicted certain personality traits. The point is not that these are causal relationships, but that any large behavioral dataset contains enough correlated signal to model attributes that users never disclosed.

Modern language models trained on user-generated text can operate similarly. If your writing style, vocabulary choices, topic interests, and posting times are all present in training data — or in a system that can query a model fine-tuned on your data — the model has access to a dense behavioral fingerprint.

Documented Research

Kosinski, Stillwell, and Graepel (2013). "Private traits and attributes are predictable from digital records of human behavior." PNAS 110(15), pp. 5802–5805. The paper is publicly available and has been cited over 3,000 times. It specifically demonstrates that Facebook likes — a passive behavioral signal — can predict sensitive personal attributes with high accuracy using standard machine learning methods available a decade ago.

2.2 — Fine-Tuning and Personalization: When the Model Learns You Specifically

Base language models trained on general web data make inferences about categories of people. Fine-tuning — the process of continuing to train a model on a smaller, targeted dataset — allows it to learn a specific individual or organization's patterns. Several documented commercial applications have raised inference concerns.

In 2023, the company Replika — which creates AI companions trained on user conversation histories — faced controversy when it attempted to remove explicitly romantic features from its product. Users who had spent months building conversation histories with their AI companions reported that the models had developed detailed models of their emotional patterns, relationship history, and psychological vulnerabilities. The company held the training data and could, in principle, use it to build detailed psychological profiles of its 10 million registered users.

More broadly, the structure of retrieval-augmented generation (RAG) — a technique where a model is given access to a user's personal documents or history at query time — means that enterprise AI deployments increasingly have access to data that enables sharp individual inference: email history, calendar patterns, document authorship, editing behavior. Microsoft's Copilot, integrated with Microsoft 365, operates in exactly this mode — it has access to a user's email, calendar, Teams messages, and documents when generating responses.

Inference Deriving information about an individual that they did not explicitly provide, by identifying statistical patterns across behavioral signals. Distinguished from retrieval (finding stored data) because inferred attributes were never directly recorded.

Fine-tuning Continuing to train a pre-trained model on a smaller, domain-specific dataset. Fine-tuning on an individual's communications can produce a model that accurately simulates that person's writing style and likely responses.

Psychographic Profiling Building statistical models of personality, values, and psychological traits from behavioral data. Demonstrated effective by Kosinski et al. (2013) using Facebook likes; applied commercially by political consultancies including Cambridge Analytica.

2.3 — The Aggregation Problem: When Fragments Add Up

A single data point rarely reveals much. Your posting that you enjoy hiking is innocuous. Your posting at 11 p.m. suggests something about your schedule. A sequence of hiking posts during the same months, combined with location metadata, work-related posts on weekdays, and purchases at outdoor retailers, begins to construct a detailed individual profile — even if no single element is sensitive. This is the aggregation problem, and it is structurally why AI inference is more powerful than prior data analysis tools.

In 2018, a study published in Nature Human Behaviour by de Montjoye and colleagues demonstrated that mobile phone metadata — call logs with no content, only timestamps and cell tower IDs — was sufficient to uniquely identify 95% of individuals in a dataset of 1.5 million people using just four data points. The researchers called these "data fingerprints." AI systems trained on richer data — which includes content, not just metadata — operate with far larger fingerprint surfaces.

A practical consequence: anonymization is harder than it sounds. In 2006, Netflix released a dataset of 100 million anonymized movie ratings as part of a machine learning competition. Researchers Arvind Narayanan and Vitaly Shmatikoff demonstrated in a 2008 paper that by cross-referencing with IMDb public reviews, they could de-anonymize specific individuals from the dataset — revealing their political preferences and other private information — with high confidence, using only a handful of anchor data points.

The Takeaway

The aggregation of individually innocuous data points — your public posts, behavioral patterns, purchase signals, and location history — creates a statistical surface from which an AI system can infer sensitive attributes you never disclosed. This is not theoretical: it has been demonstrated across multiple peer-reviewed studies using data that most people considered insufficiently private to protect carefully.

Lesson Summary: AI inference is the process of deducing unstated attributes from behavioral signals. The Kosinski et al. (2013) study demonstrated that Facebook likes alone predict sexual orientation, political affiliation, and parental status with significant accuracy. The Cambridge Analytica episode showed this research had real-world commercial application. Fine-tuning on personal data enables even sharper individual-level inference. The aggregation of individually innocuous signals compounds privacy exposure, as demonstrated by Netflix dataset de-anonymization (Narayanan & Shmatikoff, 2008).

Lesson 2 Quiz — The Inference Engine

Four questions on inference, psychographic profiling, and the aggregation problem.

1. In the 2013 Kosinski et al. PNAS study, what data source was used to predict sensitive personal attributes like sexual orientation and political affiliation?

Correct. Kosinski et al. used only Facebook likes — the pages users had clicked Like on — and found they could predict sexual orientation, ethnicity, political affiliation, and other sensitive attributes with significantly high accuracy, without any self-reported demographic data.

Incorrect. The Kosinski et al. study used only Facebook likes — purely behavioral signals indicating which pages someone had clicked Like on. No demographic self-reports were used, making the finding especially striking.

2. Approximately how many Facebook user profiles were affected by the Cambridge Analytica data collection, according to Facebook's own subsequent disclosure?

Correct. Facebook disclosed that approximately 87 million profiles were affected — not just the roughly 270,000 people who installed Kogan's quiz app, but all of their friends, whose data was also collected via Facebook's then-permissive API.

Incorrect. Facebook disclosed approximately 87 million profiles were affected. The figure was amplified so dramatically beyond the 270,000 direct users because the app harvested friends-of-users data via Facebook's API, which permitted this at the time.

3. What did the 2008 Narayanan and Shmatikoff study demonstrate about the Netflix "anonymous" movie ratings dataset?

Correct. Narayanan and Shmatikoff demonstrated that by cross-referencing the anonymized Netflix ratings with public IMDb reviews, they could de-anonymize specific individuals — revealing private information including political preferences — using only a handful of known ratings as anchor points.

Incorrect. The study showed that de-anonymization was possible through cross-referencing: using a few public IMDb reviews as anchors, the researchers could identify specific individuals within the "anonymized" Netflix dataset and reveal their private rating history, which implied political and personal preferences.

4. What is the core concern with the "aggregation problem" in AI inference?

Correct. The aggregation problem refers to the compounding effect: no single data point is sensitive, but combining posting times, location patterns, topic interests, purchase signals, and writing style creates a dense statistical fingerprint that enables inference of sensitive attributes never explicitly shared.

Incorrect. The aggregation problem is specifically about combination: individually innocuous signals — hiking posts, late-night posting times, purchase patterns — combine into a statistical fingerprint sufficient to infer sensitive personal attributes like mental health status, financial condition, or political views that the person never disclosed.

Lab 2 — Inference Mapper

Probe how behavioral signals translate into personal inferences.

Your Mission

In this lab, you'll explore what kinds of inferences are possible from different data signals. Describe a hypothetical user's digital behavior and ask what could be inferred. Challenge the AI to explain the limits of inference. Ask how aggregation amplifies what any single signal reveals.

Try asking: "If someone posts mostly between 11pm and 2am on weekdays, what might that signal to an inference model?" — or — "What's the difference between what data says vs. what it implies?"

AI Lab Assistant

Inference & Psychographics

Welcome to Lab 2. I'm ready to help you map the gap between what someone says online and what an inference system can deduce. Describe a data pattern, ask about aggregation effects, or challenge me on where inference becomes unreliable. What would you like to investigate?

AI Knows More Than You Think · Lesson 3 of 4

The Legal Landscape: What Protects You and What Doesn't

Privacy law was written for a world of filing cabinets. Watching it encounter AI training is instructive — and sobering.

When an AI company uses your data to train a model without your knowledge, which laws apply — and which leave you without remedy?

On March 31, 2023, Italy's data protection authority — the Garante per la protezione dei dati personali — ordered OpenAI to immediately block ChatGPT for Italian users. The stated grounds: no legal basis for collecting Italian users' personal data to train the model, combined with the absence of any mechanism for users to correct or delete data about themselves held inside the model. It was the first time a Western government had taken direct enforcement action against a major AI system under data protection law. OpenAI blocked the service for Italian users within days. By April 28, 2023, ChatGPT had returned to Italy after OpenAI implemented a privacy disclosure page, an age verification system, and an opt-out mechanism for EU residents. The episode illustrated both what data protection law could accomplish and the limits of its remedies — because the opt-out applied to future training, not to data already embedded in existing model weights.

3.1 — GDPR: The Strongest Framework — and Its Limits

The European Union's General Data Protection Regulation, which came into force on May 25, 2018, is the most comprehensive data protection framework in the world. Its core principles, as they apply to AI training, are significant: lawfulness, fairness, and transparency (Article 5) — you must tell people what data you're collecting and why; purpose limitation — data collected for one purpose cannot be repurposed without basis; data minimization — you may only collect what's necessary; the right to erasure (Article 17) — individuals can request deletion of their data; and the right to object (Article 21) — individuals can object to processing based on legitimate interests.

The challenge for AI training is that GDPR was designed for databases — discrete records that can be located and deleted. A language model's weights do not contain discrete records. When someone exercises their right to erasure against a company that trained a model on their data, the company faces a structural problem: it cannot surgically remove one person's contribution from 175 billion parameters without retraining the model from scratch. OpenAI's response to the Italian Garante did not include a mechanism for weight-level erasure — because none exists at commercial scale. Researchers are actively working on machine unlearning techniques, but they remain experimental as of 2024.

The GDPR has produced real enforcement actions. In January 2023, Ireland's Data Protection Commission fined Meta €390 million for using personal data from Facebook and Instagram to target advertising without adequate legal basis. In May 2023, Meta was fined an additional €1.2 billion — the largest GDPR fine in history at the time — for transferring EU user data to U.S. servers without adequate protections. These fines concern data use, not AI training specifically, but they establish that GDPR has meaningful teeth against large tech companies.

Key Legal Fact

GDPR Article 17 ("Right to erasure") contains an important caveat: it does not apply where processing is necessary for "the establishment, exercise or defence of legal claims," or where it conflicts with freedom of expression. More practically, regulators and courts have not yet ruled definitively on whether a model trained on personal data "contains" that data in a legally meaningful sense — a question that will determine whether right-to-erasure claims against AI companies can succeed.

3.2 — U.S. Law: A Patchwork Without a Federal Privacy Statute

Unlike the EU, the United States has no comprehensive federal data privacy law as of 2024. Privacy protection in the U.S. derives from a patchwork of sector-specific statutes: HIPAA covers medical records but not health searches. COPPA covers children under 13. FERPA covers student educational records. GLBA covers financial data. None of these directly govern AI training data collection from general web content or social media.

State-level legislation has partially filled this gap. California's Consumer Privacy Act (CCPA), effective January 2020, gives California residents the right to know what personal data companies hold, the right to delete it, and the right to opt out of its sale. The California Privacy Rights Act (CPRA), effective January 2023, expanded these protections and created a new agency (the California Privacy Protection Agency) to enforce them. Virginia, Colorado, Connecticut, and Texas have passed similar statutes. However, enforcement against AI training specifically — as opposed to data brokerage or targeted advertising — has been limited.

The most active legal front in U.S. AI privacy law is copyright litigation, not data protection. The Authors Guild v. OpenAI (SDNY, 2023), Getty Images v. Stability AI (D. Del., 2023), and Andersen v. Stability AI (N.D. Cal., 2023) cases all argue that training on copyrighted material constitutes infringement. These cases turn on whether AI training constitutes "fair use" under 17 U.S.C. § 107 — a question no federal circuit court has yet answered definitively for this technology.

Right to Erasure GDPR Article 17 entitlement for EU residents to request deletion of personal data. Application to AI model weights is unresolved — no technology currently allows surgical removal of a specific person's contribution from a trained model's parameters.

Machine Unlearning An active area of research seeking techniques to remove the influence of specific training data from an already-trained model without full retraining. No production-scale solution exists as of 2024.

Fair Use A U.S. copyright doctrine (17 U.S.C. § 107) that permits use of copyrighted material without license under certain conditions. Whether AI training on copyrighted content qualifies as fair use is the central question in multiple pending federal lawsuits.

3.3 — The Opt-Out Paradox and Consent Architecture

When AI companies respond to regulatory pressure by offering opt-outs from training data collection, a structural problem emerges: the opt-out cannot reach data already used to train existing models. OpenAI's privacy policy as updated in 2023 allows users to opt out of having their ChatGPT conversation data used for future training — but GPT-3, GPT-4, and other models trained on web data collected before that policy existed are not affected. The data is already compressed into weights.

Meta introduced a privacy center in the EU allowing users to object to their public Facebook and Instagram posts being used for AI training — required by GDPR enforcement pressure — but clarified that this applied to future AI model training, not data already incorporated. The Norwegian Consumer Council (Forbrukerrådet) published a 2023 report arguing that these opt-out mechanisms fail GDPR's requirement that data processing based on "legitimate interests" must allow individuals a genuine right to object — and that retroactive impossibility does not satisfy that requirement.

The practical position for anyone whose data was in Common Crawl, Reddit, Stack Overflow, or published books before 2023 is clear: their contribution to training sets has already occurred. The question of remedy — what compensation or control is available after the fact — is being argued simultaneously in regulatory proceedings in Brussels, Dublin, and Rome, and in courtrooms in San Francisco and New York.

Where the Law Actually Stands

As of mid-2024: GDPR provides the strongest framework but faces structural limits when applied to model weights. U.S. law has no federal equivalent; copyright litigation is the most active battleground. The Italy/Garante enforcement showed that regulators can compel procedural changes quickly. But no jurisdiction has yet compelled a major AI lab to delete or retrain a model based on privacy grounds — because no one has demonstrated a workable technical path for doing so.

Lesson Summary: GDPR is the world's most comprehensive data protection framework and has produced significant enforcement actions against tech companies, but its application to AI model weights is structurally unresolved — the "right to erasure" cannot currently be meaningfully applied to parameters. U.S. law relies on a state-level patchwork and copyright litigation. The Italy-ChatGPT episode of March–April 2023 showed regulatory enforcement can compel procedural changes but not retroactive weight modification. Machine unlearning remains an active research area without production solutions.

Lesson 3 Quiz — The Legal Landscape

Four questions on GDPR, U.S. law, and the opt-out paradox.

1. What action did Italy's Garante data protection authority take against ChatGPT in March 2023, and on what grounds?

Correct. On March 31, 2023, the Garante ordered ChatGPT blocked for Italian users, citing no legal basis for collecting personal data for model training and no mechanism for users to correct or request deletion of their data from the model.

Incorrect. The Garante ordered ChatGPT blocked — not fined — for Italian users, citing no legal basis for collecting personal data to train the model and no user mechanism to correct or delete personal data held within model weights.

2. Why is GDPR's "right to erasure" difficult to apply to trained AI language models?

Correct. The structural problem is technical: language model weights compress statistical patterns across billions of parameters. There is no discrete "record" for a specific person that can be located and deleted — the technique of machine unlearning to address this remains experimental.

Incorrect. The difficulty is technical, not legal. Model weights don't store text as discrete records — they encode statistical relationships across billions of parameters. Removing one person's contribution would require identifying exactly which parameters it influenced and adjusting them, which is not currently feasible at scale without full retraining.

3. What was the largest GDPR fine issued as of mid-2023, and who received it?

Correct. In May 2023, Ireland's Data Protection Commission fined Meta €1.2 billion — the largest GDPR fine at that time — for illegally transferring EU user data to U.S. servers without adequate legal protections under the GDPR's data transfer rules.

Incorrect. The largest GDPR fine as of mid-2023 was €1.2 billion, issued to Meta by Ireland's Data Protection Commission in May 2023, for transferring EU user data to U.S. servers without adequate legal protections.

4. Which of the following best describes the U.S. federal legal situation regarding data used in AI training as of 2024?

Correct. As of 2024, the U.S. has no comprehensive federal data privacy law. Sector-specific statutes (HIPAA, COPPA, FERPA) leave most AI training scenarios unregulated at the federal level. The most active legal front is copyright litigation — Authors Guild v. OpenAI, Getty v. Stability AI — turning on fair use doctrine.

Incorrect. The U.S. has no comprehensive federal data privacy law as of 2024. Sector-specific statutes like HIPAA cover limited domains. The most legally active front regarding AI training is copyright litigation in federal courts, where fair use doctrine is the central question.

Lab 3 — Legal Reasoning Workshop

Work through AI data privacy legal scenarios with the assistant.

Your Mission

Use this lab to work through legal scenarios involving AI data practices. Ask the AI to explain how GDPR would apply to a specific situation, what options a U.S. author has against a company that trained on their book, or why the opt-out paradox creates real limits on rights enforcement.

Try asking: "If I'm a UK resident and my Reddit posts were used to train GPT-4, what legal options do I have?" — or — "Explain why 'fair use' is so contested in AI training copyright cases."

AI Lab Assistant

Privacy Law & AI

Welcome to Lab 3. I can help you work through legal reasoning about AI training data practices — GDPR application, U.S. copyright fair use doctrine, CCPA rights, and the limits of current legal frameworks. Describe a scenario or ask about a specific law. What's on your mind?

AI Knows More Than You Think · Lesson 4 of 4

What You Can Actually Do: Practical Responses to a Changed Landscape

Awareness without agency is anxiety. This lesson maps what actions are real, what are marginal, and what remain unavailable.

Given that much of your historical data is already in training sets, what choices remain meaningful — and which responses are mostly theater?

On July 14, 2023, the Screen Actors Guild–American Federation of Television and Radio Artists joined the Writers Guild of America on strike — the first simultaneous major Hollywood labor stoppage since 1960. One of SAG-AFTRA's central demands was protection against studios using AI to scan actors' likenesses and voices to generate synthetic performances without consent or compensation. The studios had proposed clauses that would allow scanning a background actor for a single day's pay and then using that digital model indefinitely. The actors refused. After 118 days of strike, the November 2023 contract included provisions requiring informed consent for AI scanning and compensation equivalent to what the work would have paid. It was a documented, negotiated win — not comprehensive, but real. It established a precedent that consent and compensation for AI training on personal likeness data could be contractually mandated. The model for other domains is not identical, but the mechanism — collective bargaining, enforceable contracts — is transferable.

4.1 — What Opt-Outs Actually Do (and Don't Do)

Several major AI companies now offer mechanisms to limit or stop data collection for training. Understanding their actual scope is necessary for using them usefully. OpenAI: ChatGPT users can turn off "Improve the model for everyone" in Settings → Data Controls. This stops future conversation data from being used in training. It does not affect data already used. The API by default does not use data for training unless opted in. Google: Gemini (formerly Bard) offers a similar toggle. Google's broader data collection for advertising — which feeds its own AI development — is governed by your Google account privacy settings, which are extensive but complex. Meta: EU and UK users gained the ability in 2024 to object to Meta using public posts for AI training, under GDPR pressure. The objection form is accessible but not prominently surfaced.

What none of these mechanisms can do: remove your data from models already trained. The Italian Garante case confirmed this — OpenAI's remediation did not include retroactive weight modification. The opt-out matters for future exposure, not past exposure. If you have been an active public internet user before 2023, your historical data is almost certainly already incorporated.

4.2 — Minimizing Future Footprint: What's Realistic

Several concrete steps reduce future data exposure without requiring unrealistic behavioral changes. Robots.txt and noindex directives: If you own a website or blog, adding User-agent: CCBot / Disallow: / to your robots.txt file blocks Common Crawl's crawler from future snapshots. This does not remove pages already captured. WordPress, Ghost, and other CMS platforms support this via plugins or manual configuration. Platform privacy settings: Setting social media accounts to private removes them from public crawl scope — though data already captured while accounts were public remains in existing snapshots. Limiting platform diversity: Using fewer platforms, more selectively, reduces the cross-platform aggregation surface that enables de-anonymization. Request mechanisms: Both Google and Bing have URLs-to-remove tools that can request de-indexing of specific pages, which may reduce their representation in future Common Crawl snapshots (since Common Crawl partially follows search index signals).

Researcher Chris Callison-Burch at the University of Pennsylvania published a 2023 analysis noting that robots.txt compliance among AI training crawlers is inconsistent — Common Crawl generally respects robots.txt, but some AI training-specific crawlers do not. The landscape is moving faster than documentation of crawler behavior can track.

Practical Priority Order

If you want to minimize future AI training data exposure: (1) Enable opt-outs in ChatGPT, Gemini, and any AI tools you use actively. (2) If you own a web property, add CCBot block directives to robots.txt. (3) Set unused social accounts to private or delete them. (4) Exercise GDPR rights (if in EU/UK) via company privacy centers — not to fix past exposure, but to limit future data use and to establish legal record of your objection. (5) If you are a creator, understand what licensing or collective agreements apply to your work.

4.3 — Collective and Structural Responses

Individual opt-outs are low-leverage relative to structural change. The SAG-AFTRA example demonstrates that collective bargaining can establish meaningful consent and compensation requirements. For writers, the Authors Guild and National Writers Union are pursuing both litigation and advocacy for opt-out registries and statutory licensing schemes — a model analogous to how the music industry's ASCAP and BMI collect royalties for public performance. No equivalent system exists yet for text, but it is the direction several advocacy organizations are pushing in Washington and Brussels.

Legislative progress is uneven but real. The EU AI Act — agreed in December 2023 and formally adopted in 2024 — includes transparency requirements for general-purpose AI models: companies must publish summaries of training data, and models must comply with EU copyright law, which is interpreted to require opt-out mechanisms for rights holders. This is weaker than opt-in consent, but it is enforceable infrastructure that did not exist before. The AI Act's transparency provisions take effect for most models in August 2025.

In the U.S., the American Data Privacy and Protection Act (ADPPA) passed the House Energy and Commerce Committee in 2022 with bipartisan support but stalled in the full House, partly due to tension with California's CPRA. A federal comprehensive privacy law remains possible in the next legislative session; its passage would create GDPR-like rights nationwide and immediately become the primary legal framework for AI training data disputes.

Robots.txt A plain-text file placed at a website's root directory instructing web crawlers which pages to access or avoid. Not legally binding — it is a protocol convention — but respected by Common Crawl and major search engine crawlers.

Statutory Licensing A legal framework where use of a class of works is permitted without individual negotiation, but creators receive mandatory compensation through a collective mechanism. Used in U.S. music performance rights (ASCAP, BMI). Proposed for AI training by several author advocacy groups.

EU AI Act EU regulation formally adopted in 2024, requiring general-purpose AI model providers to publish training data summaries and comply with EU copyright law. Transparency provisions for most models take effect August 2025.

4.4 — Knowing What You Don't Know

One honest limit of this course: we do not know precisely what data GPT-4, Claude, or Gemini were trained on. OpenAI's GPT-4 technical report (March 2023) explicitly declined to disclose training data composition, citing competitive concerns. Google's PaLM 2 and Gemini papers similarly lack specificity. Anthropic has not published detailed training data disclosure for Claude. The EU AI Act's transparency requirements, when they take effect, may change this — but the data disclosure required is a "summary," not full specification.

This means that asserting with certainty whether any given piece of your content is inside any given model is not currently possible for most content. What is demonstrable is the class of data that is very likely present: publicly indexed web pages, Reddit posts, Stack Overflow questions, published books — the building blocks documented in the papers and lawsuits covered in this course.

Living with this uncertainty productively means calibrating your digital behavior with an awareness that public text is, by default, available for training purposes absent specific technological or legal barriers — and making choices accordingly, based on the actual levers available to you rather than either panic or complacency.

Lesson Summary: Practical responses to AI training data collection include enabling opt-outs in AI tools (for future training only), blocking crawlers via robots.txt on owned web properties, and exercising GDPR rights in jurisdictions where they apply. Collective mechanisms — union contracts like SAG-AFTRA's 2023 agreement, author advocacy for statutory licensing, and legislative frameworks like the EU AI Act — offer more leverage than individual action. The ADPPA remains stalled in the U.S. Congress. Full training data transparency does not currently exist for the largest commercial models.

Lesson 4 Quiz — What You Can Actually Do

Four questions on opt-outs, collective action, and the EU AI Act.

1. What did the SAG-AFTRA 2023 strike contract specifically require regarding AI scanning of actors?

Correct. The November 2023 SAG-AFTRA contract required informed consent before an actor's likeness could be AI-scanned, and compensation equivalent to what the work would have earned — defeating the studio proposal to pay a single day's rate for indefinite synthetic reuse.

Incorrect. The contract required informed consent for AI scanning and compensation equivalent to what the work would have paid — a direct rejection of studios' proposal to scan actors once for a day rate and use the digital model indefinitely. A complete ban was not achieved.

2. What does adding "User-agent: CCBot / Disallow: /" to a website's robots.txt file accomplish?

Correct. Robots.txt directives are a protocol convention, not legally binding. CCBot — Common Crawl's crawler — generally respects them, so adding the disallow directive reduces future capture. It has no effect on pages already in Common Crawl's existing archive.

Incorrect. Robots.txt is a convention, not a legal instrument. Adding the CCBot disallow directive tells Common Crawl's crawler to avoid future crawls of that site — which Common Crawl generally respects — but it cannot remove content from snapshots already taken.

3. What do the EU AI Act's transparency provisions, taking effect August 2025, require of general-purpose AI model providers?

Correct. The EU AI Act requires training data summaries (not full disclosure) and compliance with EU copyright law, which includes opt-out mechanisms for rights holders. It is weaker than full transparency or opt-in consent requirements, but establishes enforceable baseline infrastructure.

Incorrect. The EU AI Act requires training data summaries — not full specification — and compliance with EU copyright law including opt-out mechanisms for rights holders. It does not mandate individual compensation or retroactive model deletion.

4. Why did OpenAI's remediation in response to the Italian Garante order NOT include removing Italian users' data from existing model weights?

Correct. The structural problem is technical: model weights don't store discrete records. Machine unlearning — the research field aimed at enabling this — remains experimental. OpenAI's remediation addressed procedural gaps (privacy disclosure, age verification, future opt-out) but could not offer retroactive weight-level erasure because the capability does not exist at production scale.

Incorrect. The reason is technical, not legal or regulatory: there is no production-scale method to surgically remove a specific person's data contribution from already-trained model weights. Machine unlearning research is attempting to solve this, but no commercial solution existed at the time of the Garante order or as of 2024.

Lab 4 — Action Planning Workshop

Build a personal data exposure strategy with the AI assistant.

Your Mission

In this final lab, you'll work with the AI to assess your specific situation and build a realistic action plan. Tell it about your digital presence — platforms, content types, jurisdiction — and ask it to help you prioritize the actions most likely to actually matter. Push back on any advice that sounds like theater rather than substance.

Try asking: "I have a public Twitter/X account, a WordPress blog, and I'm in California — what's the highest-leverage thing I can actually do?" — or — "Is there any point in exercising GDPR rights if I'm outside the EU?"

AI Lab Assistant

Action Planning

Welcome to Lab 4 — the action planning session. Tell me about your digital footprint: what platforms you use, what you've published publicly, your jurisdiction. I'll help you think through which actions are high-leverage vs. low-leverage given your specific situation. What does your digital presence look like?

Module Test — AI Knows More Than You Think

15 questions across all four lessons. Score 80% or higher to pass.

1. Common Crawl has been continuously crawling the public web since approximately what year?

Correct. Common Crawl began continuous web crawling in 2008 and its archive goes back to that year.

Incorrect. Common Crawl has been operating since 2008, building a petabyte-scale archive of the public web.

2. What was OpenAI's WebText/WebText2 training dataset built from?

Correct. WebText was built by collecting all outbound URLs posted to Reddit with at least 3 upvotes — a quality filter that produced billions of tokens of human-curated web content.

Incorrect. WebText was assembled from outbound URLs shared on Reddit that had received at least 3 karma upvotes — using Reddit's upvote mechanism as a proxy for content quality.

3. According to the 2013 Kosinski et al. PNAS study, Facebook likes could predict sexual orientation (for males) with approximately what accuracy?

Correct. The study reported 88% accuracy in predicting sexual orientation for males from Facebook likes alone — a behavioral signal that most users considered innocuous at the time.

Incorrect. Kosinski et al. reported 88% accuracy in predicting male sexual orientation from Facebook likes, along with similarly high accuracy for other sensitive attributes.

4. The 2021 Carlini et al. study on GPT-2 memorization demonstrated what?

Correct. Carlini and colleagues showed that with appropriate prompting, GPT-2 could be induced to reproduce verbatim text from training — including specific personal contact information — demonstrating that model weights encode some training content with high fidelity.

Incorrect. The study showed that targeted prompting could elicit verbatim training text reproduction — including specific personal identifiers — without any explicit storage mechanism, as a consequence of how the weight matrix encoded high-frequency training patterns.

5. The 2008 Narayanan and Shmatikoff study on the Netflix dataset demonstrated which key principle?

Correct. The study showed that cross-referencing anonymized Netflix ratings with public IMDb reviews — using just a handful of known ratings as anchors — was sufficient to identify specific individuals and reveal their private rating history.

Incorrect. The key finding was de-anonymization by cross-referencing: the supposedly anonymous Netflix dataset could be linked to specific individuals using their public IMDb reviews as anchor points, with just a few data points needed to make a confident identification.

6. When did Italy's Garante data protection authority order ChatGPT blocked for Italian users?

Correct. The Garante issued its blocking order on March 31, 2023. OpenAI complied within days. ChatGPT returned to Italian users by April 28, 2023, after OpenAI implemented required procedural changes.

Incorrect. The order was issued on March 31, 2023 — making it the first Western government enforcement action against a major AI system under data protection law.

7. What is "machine unlearning" in the context of AI training data?

Correct. Machine unlearning is an active research field attempting to solve the technical problem of targeted data removal from trained models — but no production-scale solution exists as of 2024.

Incorrect. Machine unlearning refers to research into techniques for removing specific training data's influence from already-trained model weights without complete retraining — a technically unsolved problem at production scale as of 2024.

8. The Authors Guild v. OpenAI lawsuit (filed September 2023) primarily argues that AI training constitutes:

Correct. The central legal argument is copyright infringement — that training on copyrighted books without a license exceeds fair use under 17 U.S.C. § 107. No federal circuit court has yet ruled definitively on this question for AI training.

Incorrect. The core claim is copyright infringement: that using copyrighted books as AI training data without license or compensation exceeds what fair use doctrine permits. Fair use is a U.S. copyright doctrine, not a GDPR concept.

9. What did Reddit announce it would do with its data starting in June 2023, and approximately what value did it subsequently place on that data?

Correct. Reddit CEO Steve Huffman announced aggressive API pricing in June 2023, widely understood as targeting AI scrapers. In February 2024, Bloomberg reported Reddit signed a roughly $60 million annual licensing deal with Google — establishing a market price for community conversational data.

Incorrect. Reddit introduced expensive API pricing in June 2023 to restrict AI data access, then signed an approximately $60 million annual data licensing deal with Google, reported by Bloomberg in February 2024 ahead of Reddit's IPO.

10. What is the "aggregation problem" in the context of AI inference and privacy?

Correct. The aggregation problem describes how individually non-sensitive signals — posting times, topic patterns, vocabulary choices, purchase behavior — compound into a statistical fingerprint enabling inference of sensitive attributes the person never disclosed.

Incorrect. The aggregation problem is about inference: individually innocuous data points, when combined, create a fingerprint that reveals attributes the person never explicitly shared — as demonstrated in the de Montjoye mobile phone metadata study (4 data points uniquely identifying 95% of individuals).

11. What does the California Privacy Rights Act (CPRA), effective January 2023, NOT include?

Correct. The CPRA is a California state law — it applies only in California. Federal privacy legislation (ADPPA) has stalled in Congress and no federal comprehensive privacy law exists as of 2024.

Incorrect. The CPRA is California state legislation — it has no federal reach. The American Data Privacy and Protection Act (ADPPA) that would create federal equivalents passed committee in 2022 but stalled in the full House.

12. What two core demands did SAG-AFTRA win in its 2023 contract regarding AI scanning of performers?

Correct. The contract required informed consent before AI scanning and compensation equivalent to what the work would have earned — defeating the studios' proposal to pay a single day rate for indefinite synthetic reuse of a performer's likeness.

Incorrect. The two core wins were informed consent before scanning and equivalent compensation — meaning studios could not scan an actor for a day rate and then use the digital model indefinitely without additional payment.

13. Which of the following statements about the EU AI Act's training data transparency requirements is accurate?

Correct. The EU AI Act requires training data summaries — not granular disclosure — and compliance with EU copyright law, which is interpreted to require opt-out mechanisms for rights holders. These provisions apply to most general-purpose AI models starting August 2025.

Incorrect. The EU AI Act requires summaries, not full training data disclosure, and compliance with EU copyright law including opt-out mechanisms for rights holders — not opt-in consent. The provisions take effect for most models in August 2025.

14. Why can't OpenAI's current opt-out mechanism (disabling "Improve the model for everyone" in ChatGPT settings) protect data you've already contributed?

Correct. The opt-out affects future training data collection only. Models already trained on conversations that occurred before the opt-out was enabled — or before the feature existed — are not retroactively modified, because machine unlearning at that scale is not currently feasible.

Incorrect. The opt-out is prospective only: it prevents future conversations from being used in future training. Models already trained on earlier conversations are unaffected — retroactive weight modification would require technology (machine unlearning at scale) that does not yet exist commercially.

15. Which of the following best summarizes the technical reason that "right to erasure" claims are structurally difficult to enforce against trained AI language models?

Correct. The structural problem is technical: language model training compresses text into distributed numerical weights. There is no discrete record for "Alice's forum post" — her contribution is spread across billions of parameters in ways that cannot be isolated and removed without retraining, which is computationally prohibitive for production models.

Incorrect. The core problem is technical: a trained model's weights don't store text as discrete records. Any individual's contribution is diffused across billions of parameters during the gradient descent training process, making surgical removal without full retraining technically infeasible at commercial scale as of 2024.