In 1890, Louis Brandeis and Samuel Warren published "The Right to Privacy" in the Harvard Law Review — triggered, in part, by the arrival of the Kodak portable camera, which suddenly let strangers photograph people in public without permission. The technology had outrun the social contract by years. Courts, legislatures, and ordinary citizens spent the next four decades negotiating what privacy even meant in a world where the image of a person could be reproduced and distributed without their knowledge. Sound familiar? The pattern is old. The stakes, this time, are larger.
Today's equivalent of the Kodak moment is quieter and far more pervasive. Between 2007 and 2023, the world's internet users generated roughly 120 zettabytes of text, images, audio, and behavioral signals — search queries, product reviews, forum arguments, medical questions typed at 2 a.m., location check-ins, voice recordings captured by smart speakers. The companies that trained the large language models now reshaping medicine, law, hiring, and education harvested significant portions of that output. Common Crawl alone — one of the primary training sources for GPT-3, LLaMA, and dozens of other models — contains petabyte-scale snapshots of the public web going back to 2008. Much of what you posted publicly online is, statistically, already inside a model.
This course does not argue that AI is malevolent, nor that you should panic. It argues that you deserve to understand the mechanics — specifically, how behavioral data becomes model weights, what inferences are possible from the data trails you leave, what legal frameworks currently do and do not protect you, and what choices remain available. Four lessons, each grounded in documented events and verifiable research. No invented characters, no worst-case speculation. Just what is actually known.
If you finish every module, here's who you become:
On September 22, 2020, OpenAI published a technical paper describing GPT-3 — a language model trained on approximately 570 gigabytes of filtered text drawn from several sources: Common Crawl web snapshots, WebText2 (Reddit outbound links with high upvote counts), Books1, Books2, and English Wikipedia. The paper listed the sources plainly in a table. What it did not dwell on was that Common Crawl — the largest component, weighted at 60% of training tokens — is assembled by a nonprofit that has been crawling the public web since 2008, collecting pages regardless of whether their authors intended them as AI training material. A recipe blog post from 2011. A grief support forum thread from 2014. A teenager's DeviantArt commentary from 2009. All of it, potentially present. OpenAI was not uniquely aggressive in this choice; it was doing what the field had converged on as standard practice. The question this lesson asks is not whether that was right or wrong — but precisely how it works, so you can reason about it clearly.
Common Crawl is a San Francisco–based nonprofit that has operated continuous web crawls since 2008. Its publicly available dataset as of 2023 contains over 250 billion web pages in compressed form, totaling multiple petabytes. It is free to download, which is why it appears in the training lineage of GPT-2, GPT-3, GPT-4 (indirectly), Meta's LLaMA and LLaMA 2, Google's PaLM, Mistral, and dozens of academic models. The crawl captures whatever is publicly accessible at the time — including content behind no login, but subject to robots.txt exclusions that many sites do not configure carefully or at all.
The key process is called crawling and indexing: automated bots follow hyperlinks systematically, download HTML, strip navigation elements, and store raw text. Common Crawl's bots identify themselves with a user-agent string ("CCBot"), meaning website owners could technically block them — but most did not, either because they were unaware, because blocking felt futile, or because they wanted search-engine discoverability and Common Crawl shared infrastructure logic with search crawlers.
Researchers at the Allen Institute for AI published a 2023 analysis called Dolma documenting the contents of web-crawl training sets. They found that a significant fraction of text originated from a small number of domain types: news sites, Wikipedia mirrors, Reddit, e-commerce product pages, and what they categorized as "low-quality content farms." The implication is that the web's most actively written spaces — forums, comment sections, personal blogs, social media text scraped before API restrictions — are disproportionately represented inside modern models.
OpenAI's GPT-3 paper (Brown et al., 2020, "Language Models are Few-Shot Learners") explicitly lists training data composition: 410 billion tokens from Common Crawl, 19 billion from WebText2, 12 billion from Books1, 55 billion from Books2, and 3 billion from Wikipedia. The paper is publicly available on arXiv (arXiv:2005.14165).
Web text alone cannot teach a model long-form reasoning, narrative structure, or sustained argument. For that, AI labs turned to books. The mechanisms varied and grew increasingly controversial. Meta's LLaMA 1 model, released in February 2023, was trained on a dataset that included "Books" — later reporting by The Atlantic and the journalist Alex Reisner, published in August 2023, identified the source as Books3, a dataset assembled by researcher Shawn Presser in 2020. Books3 contained approximately 196,640 pirated books scraped from the shadow library Bibliotik. The Atlantic built a searchable database; authors could look up whether their titles were included. They frequently were.
OpenAI's GPT-3 training included "Books1" and "Books2," the contents of which OpenAI did not publicly specify. Investigative reporting and legal filings in the Authors Guild v. OpenAI lawsuit (filed September 2023 in the Southern District of New York) allege the books datasets contained copyrighted material without license. Comedian and author Sarah Silverman, along with novelists Christopher Golden and Richard Kadrey, filed a separate suit in July 2023 in the Northern District of California, specifically naming Llama and ChatGPT as trained on their work without compensation.
The legal outcomes remain unresolved as of 2024, but the underlying technical fact is not disputed: copyrighted books were used as training data at scale, by multiple major labs, drawing on shadow library infrastructure that had existed for years before AI labs found it useful.
Some of the most useful training data for conversational AI is human dialogue — specifically, text where one person asks something and another answers. Reddit and Stack Overflow supplied this at massive scale. OpenAI's WebText and WebText2 datasets were built by collecting all URLs posted to Reddit that received at least 3 karma upvotes — a crude but effective quality filter. The result: billions of tokens of human-generated question-and-answer exchange, argument, humor, and domain expertise.
By 2023, Reddit and Stack Overflow recognized that their communities had effectively been harvested to train commercial AI products that now competed with their own traffic. In April 2023, Stack Overflow announced a policy requiring AI companies to pay for API access to its data. In June 2023, Reddit CEO Steve Huffman announced aggressive API pricing that would make third-party data access prohibitively expensive — widely understood as aimed at AI scrapers. Both moves arrived years after the most significant training runs had already completed. The data was already inside the models.
Reddit did subsequently negotiate a $60 million annual data licensing deal with Google, reported by Bloomberg in February 2024 ahead of Reddit's IPO. This established a market price for the kind of conversational data that had previously been taken without payment — but it did not retroactively compensate the millions of Reddit users who wrote the content.
If you have ever posted publicly on Reddit, written a public blog, contributed to Wikipedia, published a review on Yelp or Amazon, commented on a news article, or maintained a public social media account — your text is statistically likely to be part of at least one major language model's training data. This is not speculation; it is a consequence of how Common Crawl, WebText, and similar pipelines were constructed and which sources they prioritized.
Understanding what "training data" means requires understanding what training actually does to that data. A language model does not store text like a database. It compresses statistical relationships across trillions of word-pair associations into a set of numerical parameters called weights. GPT-3 has 175 billion such parameters. During training, the model reads a token, predicts the next token, compares its prediction to the actual next token, and adjusts its weights slightly to reduce error. This process, called stochastic gradient descent, repeats across the entire training corpus multiple times.
The result is that no individual sentence is stored — but the patterns of language use, factual associations, stylistic tendencies, and even specific recurring phrases become encoded in the weight matrix. Researchers have demonstrated memorization: in 2021, Google researcher Nicholas Carlini and colleagues published a paper showing that GPT-2 could be induced to reproduce verbatim text from its training data — including a specific person's name paired with their phone number — when prompted correctly. This was not a bug in the traditional sense; it was a consequence of data appearing frequently enough that the model's weights encoded it with high fidelity.
The takeaway for this course: your data does not sit in a folder marked "user data." It is dissolved into a statistical structure that can sometimes reconstitute fragments of it — including personal details — under the right prompting conditions.
Lesson Summary: Modern large language models are trained primarily on web-crawl data (dominated by Common Crawl), supplemented by books (some obtained via shadow libraries), and human-generated dialogue platforms like Reddit and Stack Overflow. This data was collected without individual consent or compensation. The training process compresses these sources into numerical weights that can, under certain conditions, reproduce verbatim training content. Lawsuits from authors and publishers are currently testing whether this constitutes copyright infringement. Legal outcomes remain pending.
In this lab, you'll interrogate an AI assistant about how training data is collected, filtered, and transformed into model weights. Ask about specific sources, ask where your own data might have ended up, or challenge the AI to explain why models memorize certain content. The goal is to deepen your mental model of the pipeline from web page to language model.
In 2014, a Cambridge University researcher named Aleksandr Kogan built a Facebook quiz app called "thisisyourdigitallife." It collected psychological profile data — not just from the roughly 270,000 people who installed it, but from all of their friends, via Facebook's then-permissive API. The result was a dataset covering an estimated 87 million Facebook profiles. Kogan sold this data to Cambridge Analytica, a political consulting firm that claimed to use it to build psychographic models capable of predicting voter personality and targeting political advertising accordingly. The story broke publicly in March 2018, forcing Facebook CEO Mark Zuckerberg to testify before Congress. What the episode illustrated was not primarily a hacking story — it was an inference story: from apparently innocuous behavioral signals (quiz answers, likes, page follows), it was possible to model intimate psychological attributes with meaningful accuracy.
The scientific foundation for Cambridge Analytica's claims — disputed in its commercial applications but grounded in real research — came from work by Michal Kosinski, then at Cambridge University's Psychometrics Centre. In a 2013 paper published in PNAS (Proceedings of the National Academy of Sciences), Kosinski and colleagues analyzed 58,000 volunteers who had completed a personality questionnaire and shared their Facebook likes. Using a relatively simple machine learning model, they found that Facebook likes alone could predict: sexual orientation with 88% accuracy (for males), ethnic background with 95% accuracy, political affiliation with 85% accuracy, religious affiliation with 82% accuracy, and whether users' parents had separated before they turned 21 with 60% accuracy.
The model used no demographic self-report data — only behavioral signals (which pages someone had liked). The likes with highest predictive power were frequently not the obvious ones: liking "Curly Fries" correlated with high intelligence; liking "Being Confused After Waking Up from Naps" predicted certain personality traits. The point is not that these are causal relationships, but that any large behavioral dataset contains enough correlated signal to model attributes that users never disclosed.
Modern language models trained on user-generated text can operate similarly. If your writing style, vocabulary choices, topic interests, and posting times are all present in training data — or in a system that can query a model fine-tuned on your data — the model has access to a dense behavioral fingerprint.
Kosinski, Stillwell, and Graepel (2013). "Private traits and attributes are predictable from digital records of human behavior." PNAS 110(15), pp. 5802–5805. The paper is publicly available and has been cited over 3,000 times. It specifically demonstrates that Facebook likes — a passive behavioral signal — can predict sensitive personal attributes with high accuracy using standard machine learning methods available a decade ago.
Base language models trained on general web data make inferences about categories of people. Fine-tuning — the process of continuing to train a model on a smaller, targeted dataset — allows it to learn a specific individual or organization's patterns. Several documented commercial applications have raised inference concerns.
In 2023, the company Replika — which creates AI companions trained on user conversation histories — faced controversy when it attempted to remove explicitly romantic features from its product. Users who had spent months building conversation histories with their AI companions reported that the models had developed detailed models of their emotional patterns, relationship history, and psychological vulnerabilities. The company held the training data and could, in principle, use it to build detailed psychological profiles of its 10 million registered users.
More broadly, the structure of retrieval-augmented generation (RAG) — a technique where a model is given access to a user's personal documents or history at query time — means that enterprise AI deployments increasingly have access to data that enables sharp individual inference: email history, calendar patterns, document authorship, editing behavior. Microsoft's Copilot, integrated with Microsoft 365, operates in exactly this mode — it has access to a user's email, calendar, Teams messages, and documents when generating responses.
A single data point rarely reveals much. Your posting that you enjoy hiking is innocuous. Your posting at 11 p.m. suggests something about your schedule. A sequence of hiking posts during the same months, combined with location metadata, work-related posts on weekdays, and purchases at outdoor retailers, begins to construct a detailed individual profile — even if no single element is sensitive. This is the aggregation problem, and it is structurally why AI inference is more powerful than prior data analysis tools.
In 2018, a study published in Nature Human Behaviour by de Montjoye and colleagues demonstrated that mobile phone metadata — call logs with no content, only timestamps and cell tower IDs — was sufficient to uniquely identify 95% of individuals in a dataset of 1.5 million people using just four data points. The researchers called these "data fingerprints." AI systems trained on richer data — which includes content, not just metadata — operate with far larger fingerprint surfaces.
A practical consequence: anonymization is harder than it sounds. In 2006, Netflix released a dataset of 100 million anonymized movie ratings as part of a machine learning competition. Researchers Arvind Narayanan and Vitaly Shmatikoff demonstrated in a 2008 paper that by cross-referencing with IMDb public reviews, they could de-anonymize specific individuals from the dataset — revealing their political preferences and other private information — with high confidence, using only a handful of anchor data points.
The aggregation of individually innocuous data points — your public posts, behavioral patterns, purchase signals, and location history — creates a statistical surface from which an AI system can infer sensitive attributes you never disclosed. This is not theoretical: it has been demonstrated across multiple peer-reviewed studies using data that most people considered insufficiently private to protect carefully.
Lesson Summary: AI inference is the process of deducing unstated attributes from behavioral signals. The Kosinski et al. (2013) study demonstrated that Facebook likes alone predict sexual orientation, political affiliation, and parental status with significant accuracy. The Cambridge Analytica episode showed this research had real-world commercial application. Fine-tuning on personal data enables even sharper individual-level inference. The aggregation of individually innocuous signals compounds privacy exposure, as demonstrated by Netflix dataset de-anonymization (Narayanan & Shmatikoff, 2008).
In this lab, you'll explore what kinds of inferences are possible from different data signals. Describe a hypothetical user's digital behavior and ask what could be inferred. Challenge the AI to explain the limits of inference. Ask how aggregation amplifies what any single signal reveals.
On March 31, 2023, Italy's data protection authority — the Garante per la protezione dei dati personali — ordered OpenAI to immediately block ChatGPT for Italian users. The stated grounds: no legal basis for collecting Italian users' personal data to train the model, combined with the absence of any mechanism for users to correct or delete data about themselves held inside the model. It was the first time a Western government had taken direct enforcement action against a major AI system under data protection law. OpenAI blocked the service for Italian users within days. By April 28, 2023, ChatGPT had returned to Italy after OpenAI implemented a privacy disclosure page, an age verification system, and an opt-out mechanism for EU residents. The episode illustrated both what data protection law could accomplish and the limits of its remedies — because the opt-out applied to future training, not to data already embedded in existing model weights.
The European Union's General Data Protection Regulation, which came into force on May 25, 2018, is the most comprehensive data protection framework in the world. Its core principles, as they apply to AI training, are significant: lawfulness, fairness, and transparency (Article 5) — you must tell people what data you're collecting and why; purpose limitation — data collected for one purpose cannot be repurposed without basis; data minimization — you may only collect what's necessary; the right to erasure (Article 17) — individuals can request deletion of their data; and the right to object (Article 21) — individuals can object to processing based on legitimate interests.
The challenge for AI training is that GDPR was designed for databases — discrete records that can be located and deleted. A language model's weights do not contain discrete records. When someone exercises their right to erasure against a company that trained a model on their data, the company faces a structural problem: it cannot surgically remove one person's contribution from 175 billion parameters without retraining the model from scratch. OpenAI's response to the Italian Garante did not include a mechanism for weight-level erasure — because none exists at commercial scale. Researchers are actively working on machine unlearning techniques, but they remain experimental as of 2024.
The GDPR has produced real enforcement actions. In January 2023, Ireland's Data Protection Commission fined Meta €390 million for using personal data from Facebook and Instagram to target advertising without adequate legal basis. In May 2023, Meta was fined an additional €1.2 billion — the largest GDPR fine in history at the time — for transferring EU user data to U.S. servers without adequate protections. These fines concern data use, not AI training specifically, but they establish that GDPR has meaningful teeth against large tech companies.
GDPR Article 17 ("Right to erasure") contains an important caveat: it does not apply where processing is necessary for "the establishment, exercise or defence of legal claims," or where it conflicts with freedom of expression. More practically, regulators and courts have not yet ruled definitively on whether a model trained on personal data "contains" that data in a legally meaningful sense — a question that will determine whether right-to-erasure claims against AI companies can succeed.
Unlike the EU, the United States has no comprehensive federal data privacy law as of 2024. Privacy protection in the U.S. derives from a patchwork of sector-specific statutes: HIPAA covers medical records but not health searches. COPPA covers children under 13. FERPA covers student educational records. GLBA covers financial data. None of these directly govern AI training data collection from general web content or social media.
State-level legislation has partially filled this gap. California's Consumer Privacy Act (CCPA), effective January 2020, gives California residents the right to know what personal data companies hold, the right to delete it, and the right to opt out of its sale. The California Privacy Rights Act (CPRA), effective January 2023, expanded these protections and created a new agency (the California Privacy Protection Agency) to enforce them. Virginia, Colorado, Connecticut, and Texas have passed similar statutes. However, enforcement against AI training specifically — as opposed to data brokerage or targeted advertising — has been limited.
The most active legal front in U.S. AI privacy law is copyright litigation, not data protection. The Authors Guild v. OpenAI (SDNY, 2023), Getty Images v. Stability AI (D. Del., 2023), and Andersen v. Stability AI (N.D. Cal., 2023) cases all argue that training on copyrighted material constitutes infringement. These cases turn on whether AI training constitutes "fair use" under 17 U.S.C. § 107 — a question no federal circuit court has yet answered definitively for this technology.
When AI companies respond to regulatory pressure by offering opt-outs from training data collection, a structural problem emerges: the opt-out cannot reach data already used to train existing models. OpenAI's privacy policy as updated in 2023 allows users to opt out of having their ChatGPT conversation data used for future training — but GPT-3, GPT-4, and other models trained on web data collected before that policy existed are not affected. The data is already compressed into weights.
Meta introduced a privacy center in the EU allowing users to object to their public Facebook and Instagram posts being used for AI training — required by GDPR enforcement pressure — but clarified that this applied to future AI model training, not data already incorporated. The Norwegian Consumer Council (Forbrukerrådet) published a 2023 report arguing that these opt-out mechanisms fail GDPR's requirement that data processing based on "legitimate interests" must allow individuals a genuine right to object — and that retroactive impossibility does not satisfy that requirement.
The practical position for anyone whose data was in Common Crawl, Reddit, Stack Overflow, or published books before 2023 is clear: their contribution to training sets has already occurred. The question of remedy — what compensation or control is available after the fact — is being argued simultaneously in regulatory proceedings in Brussels, Dublin, and Rome, and in courtrooms in San Francisco and New York.
As of mid-2024: GDPR provides the strongest framework but faces structural limits when applied to model weights. U.S. law has no federal equivalent; copyright litigation is the most active battleground. The Italy/Garante enforcement showed that regulators can compel procedural changes quickly. But no jurisdiction has yet compelled a major AI lab to delete or retrain a model based on privacy grounds — because no one has demonstrated a workable technical path for doing so.
Lesson Summary: GDPR is the world's most comprehensive data protection framework and has produced significant enforcement actions against tech companies, but its application to AI model weights is structurally unresolved — the "right to erasure" cannot currently be meaningfully applied to parameters. U.S. law relies on a state-level patchwork and copyright litigation. The Italy-ChatGPT episode of March–April 2023 showed regulatory enforcement can compel procedural changes but not retroactive weight modification. Machine unlearning remains an active research area without production solutions.
Use this lab to work through legal scenarios involving AI data practices. Ask the AI to explain how GDPR would apply to a specific situation, what options a U.S. author has against a company that trained on their book, or why the opt-out paradox creates real limits on rights enforcement.
On July 14, 2023, the Screen Actors Guild–American Federation of Television and Radio Artists joined the Writers Guild of America on strike — the first simultaneous major Hollywood labor stoppage since 1960. One of SAG-AFTRA's central demands was protection against studios using AI to scan actors' likenesses and voices to generate synthetic performances without consent or compensation. The studios had proposed clauses that would allow scanning a background actor for a single day's pay and then using that digital model indefinitely. The actors refused. After 118 days of strike, the November 2023 contract included provisions requiring informed consent for AI scanning and compensation equivalent to what the work would have paid. It was a documented, negotiated win — not comprehensive, but real. It established a precedent that consent and compensation for AI training on personal likeness data could be contractually mandated. The model for other domains is not identical, but the mechanism — collective bargaining, enforceable contracts — is transferable.
Several major AI companies now offer mechanisms to limit or stop data collection for training. Understanding their actual scope is necessary for using them usefully. OpenAI: ChatGPT users can turn off "Improve the model for everyone" in Settings → Data Controls. This stops future conversation data from being used in training. It does not affect data already used. The API by default does not use data for training unless opted in. Google: Gemini (formerly Bard) offers a similar toggle. Google's broader data collection for advertising — which feeds its own AI development — is governed by your Google account privacy settings, which are extensive but complex. Meta: EU and UK users gained the ability in 2024 to object to Meta using public posts for AI training, under GDPR pressure. The objection form is accessible but not prominently surfaced.
What none of these mechanisms can do: remove your data from models already trained. The Italian Garante case confirmed this — OpenAI's remediation did not include retroactive weight modification. The opt-out matters for future exposure, not past exposure. If you have been an active public internet user before 2023, your historical data is almost certainly already incorporated.
Several concrete steps reduce future data exposure without requiring unrealistic behavioral changes. Robots.txt and noindex directives: If you own a website or blog, adding User-agent: CCBot / Disallow: / to your robots.txt file blocks Common Crawl's crawler from future snapshots. This does not remove pages already captured. WordPress, Ghost, and other CMS platforms support this via plugins or manual configuration. Platform privacy settings: Setting social media accounts to private removes them from public crawl scope — though data already captured while accounts were public remains in existing snapshots. Limiting platform diversity: Using fewer platforms, more selectively, reduces the cross-platform aggregation surface that enables de-anonymization. Request mechanisms: Both Google and Bing have URLs-to-remove tools that can request de-indexing of specific pages, which may reduce their representation in future Common Crawl snapshots (since Common Crawl partially follows search index signals).
Researcher Chris Callison-Burch at the University of Pennsylvania published a 2023 analysis noting that robots.txt compliance among AI training crawlers is inconsistent — Common Crawl generally respects robots.txt, but some AI training-specific crawlers do not. The landscape is moving faster than documentation of crawler behavior can track.
If you want to minimize future AI training data exposure: (1) Enable opt-outs in ChatGPT, Gemini, and any AI tools you use actively. (2) If you own a web property, add CCBot block directives to robots.txt. (3) Set unused social accounts to private or delete them. (4) Exercise GDPR rights (if in EU/UK) via company privacy centers — not to fix past exposure, but to limit future data use and to establish legal record of your objection. (5) If you are a creator, understand what licensing or collective agreements apply to your work.
Individual opt-outs are low-leverage relative to structural change. The SAG-AFTRA example demonstrates that collective bargaining can establish meaningful consent and compensation requirements. For writers, the Authors Guild and National Writers Union are pursuing both litigation and advocacy for opt-out registries and statutory licensing schemes — a model analogous to how the music industry's ASCAP and BMI collect royalties for public performance. No equivalent system exists yet for text, but it is the direction several advocacy organizations are pushing in Washington and Brussels.
Legislative progress is uneven but real. The EU AI Act — agreed in December 2023 and formally adopted in 2024 — includes transparency requirements for general-purpose AI models: companies must publish summaries of training data, and models must comply with EU copyright law, which is interpreted to require opt-out mechanisms for rights holders. This is weaker than opt-in consent, but it is enforceable infrastructure that did not exist before. The AI Act's transparency provisions take effect for most models in August 2025.
In the U.S., the American Data Privacy and Protection Act (ADPPA) passed the House Energy and Commerce Committee in 2022 with bipartisan support but stalled in the full House, partly due to tension with California's CPRA. A federal comprehensive privacy law remains possible in the next legislative session; its passage would create GDPR-like rights nationwide and immediately become the primary legal framework for AI training data disputes.
One honest limit of this course: we do not know precisely what data GPT-4, Claude, or Gemini were trained on. OpenAI's GPT-4 technical report (March 2023) explicitly declined to disclose training data composition, citing competitive concerns. Google's PaLM 2 and Gemini papers similarly lack specificity. Anthropic has not published detailed training data disclosure for Claude. The EU AI Act's transparency requirements, when they take effect, may change this — but the data disclosure required is a "summary," not full specification.
This means that asserting with certainty whether any given piece of your content is inside any given model is not currently possible for most content. What is demonstrable is the class of data that is very likely present: publicly indexed web pages, Reddit posts, Stack Overflow questions, published books — the building blocks documented in the papers and lawsuits covered in this course.
Living with this uncertainty productively means calibrating your digital behavior with an awareness that public text is, by default, available for training purposes absent specific technological or legal barriers — and making choices accordingly, based on the actual levers available to you rather than either panic or complacency.
Lesson Summary: Practical responses to AI training data collection include enabling opt-outs in AI tools (for future training only), blocking crawlers via robots.txt on owned web properties, and exercising GDPR rights in jurisdictions where they apply. Collective mechanisms — union contracts like SAG-AFTRA's 2023 agreement, author advocacy for statutory licensing, and legislative frameworks like the EU AI Act — offer more leverage than individual action. The ADPPA remains stalled in the U.S. Congress. Full training data transparency does not currently exist for the largest commercial models.
In this final lab, you'll work with the AI to assess your specific situation and build a realistic action plan. Tell it about your digital presence — platforms, content types, jurisdiction — and ask it to help you prioritize the actions most likely to actually matter. Push back on any advice that sounds like theater rather than substance.