Intro
L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
AI & Media · Introduction

The news used to be what journalists reported. Now it's what algorithms surface.

Media literacy is becoming a different skill. This course is about the new one.

For most of the twentieth century, the news was a relatively knowable thing. A handful of national broadcasters, a few national newspapers, a local paper — you knew where it came from, who chose it, and roughly what their incentives were. You could disagree with it, but you could locate it.

That has been coming apart for twenty years, and AI is finishing the job. The news you see is increasingly chosen by algorithms optimizing for engagement, some of it is written or summarized by AI, an increasing fraction is entirely synthetic, and your media experience may differ meaningfully from the person sitting next to you on the train.

This course is about media in the age of AI — for both consumers and producers. It covers how algorithmic feeds actually select content, the state of AI-generated news, the economics of the news industry under AI, how to detect synthetic media, the emerging provenance standards, and the skill of navigating an information environment where the old heuristics (trust this outlet, distrust that one) no longer apply by themselves.

If you finish every module, here's who you become:

  • You'll understand how algorithmic feeds actually select what you see — and whose interests those selections serve.
  • You'll be able to identify synthetic media, from AI-written news copy to deepfake video, using current detection methods and provenance standards.
  • You'll know why the Los Angeles Times published a story three minutes after an earthquake, and what that moment revealed about where journalism was already heading.
  • You'll think critically about content moderation at scale — understanding where automated speech policing works, where it breaks, and why the failures are rarely random.
  • You'll be able to navigate an information environment where outlet-level trust is no longer enough, replacing old heuristics with more durable ones.
  • You'll understand the economics forcing news organisations toward AI — and what that pressure means for the journalism that survives it.
  • You'll become someone who reads the media environment itself, not just the stories inside it.
Lesson 1 · AI & Media — Module 1

How AI Entered the Newsroom

From wire-copy automation to large language models — tracing the arc of machine-assisted journalism.
When did news organizations first hand writing to a machine, and what crossed the line from tool to journalist?

On June 17, 2014, the Los Angeles Times published a story about a 4.4-magnitude earthquake in Westwood, California. The article appeared three minutes after the quake struck. No human reporter wrote it. A program called Quakebot — built by journalist Ken Schwencke — pulled data from the USGS feed, slotted it into a template, and pushed it to the CMS. Schwencke reviewed and published it from his phone in bed.

That article was not the first automated text in journalism. But it was among the first to be indistinguishable in format from a reporter's byline piece and to circulate widely as an example of what automation could do at deadline speed.

Automation Before "AI"

The Associated Press began using Automated Insights' Wordsmith platform to generate quarterly earnings reports in 2014, the same year as Quakebot's debut. By 2016 the AP was producing 3,700 earnings stories per quarter — roughly ten times its previous human output — covering companies that would otherwise receive no coverage at all. The platform used structured financial data and fill-in-the-blank natural-language templates, not neural networks.

The Washington Post deployed its own system, Heliograf, for the 2016 Rio Olympics and the 2016 U.S. election. Heliograf generated short updates when election results crossed preset thresholds. The Post reported Heliograf produced more than 500 short articles during election night. Again: structured data in, templated sentences out.

These systems are correctly called natural language generation (NLG) rather than artificial intelligence in the modern sense. They do not learn; they execute rules. Understanding this distinction matters because the public conversation about "AI in journalism" often conflates rule-based automation with machine learning.

Why It Matters

Newsrooms that adopted early NLG tools freed reporters from commodity data work. The AP's journalists, relieved of writing thousands of boilerplate earnings stories, could pursue investigative work. Automation expanded coverage breadth without proportionally expanding staff — a trade-off that shaped how editors would later approach large language models.

The LLM Turn: 2022–2023

The release of ChatGPT in November 2022 marked a qualitative shift. Unlike Wordsmith or Heliograf, large language models can generate coherent prose from unstructured prompts without pre-written templates. News organizations responded rapidly — and inconsistently.

CNET quietly published more than 70 AI-generated personal finance articles between November 2022 and January 2023. When Futurism broke the story in January 2023, CNET acknowledged the practice and paused the program after editors found factual errors, plagiarism-adjacent phrasing, and compounded interest miscalculations in multiple articles. The episode became a widely cited case study in the risks of publishing LLM output with insufficient human review.

Sports Illustrated faced similar scrutiny in November 2023 when Futurism again reported the outlet had published product-recommendation articles under fictitious author names with AI-generated headshots. The publisher, The Arena Group, initially denied the articles were AI-written, then acknowledged using a third-party content vendor, and subsequently terminated its relationship with that vendor.

These cases established a reputational baseline: newsrooms deploying generative AI without robust editorial oversight risk credibility damage that can be swift and severe.

Key Concepts
NLGNatural Language Generation — rule-based systems that produce text from structured data using pre-written templates. No learning occurs.
LLMLarge Language Model — a neural network trained on vast text corpora that generates prose from open-ended prompts. Output is probabilistic, not rule-bound.
Automated JournalismThe use of software — whether NLG or LLM — to produce publishable text with reduced or no human writing involvement.
Human-in-the-LoopAn editorial workflow in which a human reviews and approves AI-generated content before publication. CNET's 2022–2023 program lacked adequate human-in-the-loop review.
Documented Timeline

2014: AP adopts Wordsmith for earnings reports; LA Times publishes Quakebot earthquake story. 2016: Washington Post deploys Heliograf for Olympics and election coverage. Nov 2022: CNET begins AI article program. Jan 2023: Futurism exposes CNET; program paused after error audit. Nov 2023: Sports Illustrated AI-author scandal; Arena Group terminates vendor contract.

Lesson 1 Quiz

How AI Entered the Newsroom · 5 questions
1. What program did the Los Angeles Times use in 2014 to automatically publish earthquake stories within minutes of a seismic event?
Correct. Quakebot was built by LA Times journalist Ken Schwencke. It pulled USGS data and published the earthquake story three minutes after the 4.4-magnitude Westwood quake struck.
Not quite. Quakebot was the LA Times program. Heliograf was the Washington Post's system; Wordsmith was used by the AP.
2. Approximately how many earnings stories per quarter was the AP producing using Automated Insights' Wordsmith by 2016?
Correct. The AP reported approximately 3,700 earnings stories per quarter — roughly ten times its previous human output — covering companies that previously received no coverage.
The correct figure is approximately 3,700 per quarter, about ten times the AP's previous human output for earnings stories.
3. The Washington Post's Heliograf system was first deployed for which two events in 2016?
Correct. Heliograf debuted at the 2016 Rio Olympics and was used on election night 2016, generating over 500 short articles as results came in.
Heliograf's documented first deployments were the 2016 Rio Olympics and the 2016 U.S. presidential election.
4. What is the key technical difference between NLG systems like Wordsmith and large language models like ChatGPT?
Correct. NLG systems execute pre-written rules — structured data fills template slots. LLMs are neural networks that generate text probabilistically, without hard-coded templates.
The core difference is architectural: NLG fills templates with structured data; LLMs generate probabilistic prose from open-ended inputs using learned statistical patterns.
5. Which outlet broke the story that CNET had been quietly publishing AI-generated articles, and what specific problem triggered the public audit?
Correct. Futurism reported in January 2023 that CNET had published 70+ AI articles. An internal audit found factual errors and plagiarism-adjacent phrasing; CNET paused the program.
Futurism broke the CNET story in January 2023. The subsequent internal review found factual errors — including compounded interest miscalculations — and plagiarism-adjacent phrasing.

Lab 1 — Newsroom Automation Audit

Analyze real deployment decisions with an AI research assistant

Your Task

You are a digital editor at a mid-size regional newspaper. Your publisher wants to know whether deploying an automated system for local government data stories is worth the risk. Use the AI assistant below to work through the key questions an editorial team must answer before adopting automation.

Starter prompt: "What editorial safeguards should a newsroom put in place before publishing AI-generated content, based on what happened at CNET and the AP?"
AI Research Assistant
Lab 1
Hello! I'm here to help you think through the editorial and ethical dimensions of newsroom automation. We'll look at real cases — the AP's Wordsmith deployment, the LA Times Quakebot, the CNET controversy, and more — to help you make a grounded recommendation. What would you like to explore first?
Lesson 2 · AI & Media — Module 1

Verification in the Age of Synthetic Media

Deepfakes, cloned audio, and fabricated images have made fact-checking harder — and more consequential.
When a photograph can be generated in seconds and a politician's voice can be cloned from a 30-second sample, what does journalistic verification mean?

On March 22, 2023, an image depicting former U.S. President Donald Trump being physically arrested began circulating on Twitter and other platforms. The photographs were photorealistic — crowds, police officers, struggle — but entirely synthetic, generated by journalist and AI researcher Eliot Higgins of Bellingcat using Midjourney v5. Higgins posted them explicitly labeled as AI-generated to demonstrate the technology's capabilities.

Within hours, portions of the image set had been reshared without the label by accounts that presented them as real. Several journalists from mainstream outlets contacted Higgins for comment, uncertain whether the images were authentic. The episode illustrated a gap that had opened almost overnight: the time required to generate a convincing synthetic image had collapsed from days to seconds, while the time required to verify authenticity had not.

The Verification Problem

Traditional image verification relied on metadata analysis (EXIF data), reverse image search, geolocation matching, and source tracing. These tools remain useful but are increasingly insufficient against AI-generated content because:

1. Synthetic images contain no authentic EXIF data. A Midjourney or DALL-E image will carry metadata from the generation software, not from a camera in a specific location at a specific time. Tools like InVID/WeVerify and FotoForensics look for compression artifacts and metadata inconsistencies — signals that are absent or misleading in synthetic images.

2. Reverse image search finds only previously indexed images. A newly generated image has no prior index presence. Google Reverse Image Search and TinEye cannot find what has never been uploaded before.

3. Deepfake video has outpaced detection tools. The Reuters Institute reported in 2023 that commercial deepfake detection APIs had accuracy rates between 65–80% on synthetic video — useful, but insufficient for publication-standard verification.

Documented Incident — 2023 Pentagon Explosion

On May 22, 2023, a fabricated image of an explosion near the Pentagon circulated on Twitter. Verified accounts including a Bloomberg News feed (via a third-party automated Twitter feed, not Bloomberg's editorial staff) briefly amplified the image. The S&P 500 dipped approximately 0.3% in the minutes before the image was debunked by Arlington County officials. The episode was the first documented case of a synthetic image causing a measurable market reaction.

Audio Deepfakes and Cloned Voices

In January 2024, robocalls using an AI-generated voice cloned from President Biden were sent to tens of thousands of New Hampshire voters ahead of the state's primary. The calls, which appeared to come from a number associated with a Democratic operative, instructed recipients not to vote in the primary. The New Hampshire Attorney General launched an investigation, and the FCC subsequently ruled that AI-generated voices in robocalls are covered by the Telephone Consumer Protection Act.

For journalists, the incident underscored that audio is no longer reliable evidence. The standard practice of recording a source to confirm quotes must now contend with the possibility that a cloned-voice recording could be fabricated. Verification protocols now recommended by the First Draft coalition and the Poynter Institute include callback verification on a known number, in-person confirmation for high-stakes quotes, and waveform analysis using tools like Adobe Podcast's AI detection layer.

Emerging Verification Tools

The Coalition for Content Provenance and Authenticity (C2PA) released its 1.0 specification in 2021 and has since been adopted by Adobe, Microsoft, Google, Sony, Nikon, and Leica. C2PA embeds a cryptographically signed content credential into media files at the point of creation, recording the device, software, time, and any edits made. The New York Times began embedding C2PA credentials into its photojournalism in 2023.

C2PA is not a complete solution: credentials can be stripped by re-saving or screenshotting, and not all cameras or platforms support it. But it represents the most substantive technical attempt to establish a chain of custody for media content.

DeepfakeAI-generated or AI-manipulated video, audio, or images that depict real people doing or saying things they did not do or say.
C2PACoalition for Content Provenance and Authenticity — an industry standard for cryptographically signing media to establish its creation and edit history.
Voice CloningThe use of AI to synthesize a convincing replica of a specific person's voice from audio samples, often as few as a few seconds of training data.
Content CredentialA C2PA-compliant metadata record embedded in a media file that documents provenance, device, and editing history using cryptographic signatures.
Key Takeaway

Verification has historically been a process of confirming what already happened. Synthetic media requires journalists to verify that something actually happened at all — a fundamentally harder epistemic task. Technical tools like C2PA help establish provenance at creation but cannot authenticate media that circulates without credentials. Human judgment, corroboration, and source relationships remain the core of the verification chain.

Lesson 2 Quiz

Verification in the Age of Synthetic Media · 5 questions
1. Who created the synthetic images of Donald Trump's "arrest" in March 2023, and for what stated purpose?
Correct. Eliot Higgins, founder of Bellingcat, created the images and posted them explicitly labeled as AI-generated to demonstrate Midjourney v5's photorealism capabilities.
Eliot Higgins of Bellingcat created the images as a demonstration of Midjourney v5, labeling them as AI-generated. They were later reshared without that label.
2. Why is reverse image search insufficient for verifying AI-generated images?
Correct. Reverse image search can only surface previously indexed images. A brand-new synthetic image has no prior online presence and therefore returns no results.
The problem is that a newly generated image has never been uploaded before, so no search index contains it. Reverse search can only find images it has already seen.
3. What measurable real-world consequence resulted from the fabricated Pentagon explosion image on May 22, 2023?
Correct. The fabricated image caused an approximately 0.3% dip in the S&P 500 in the minutes before it was debunked — the first documented case of a synthetic image triggering a market reaction.
The S&P 500 dipped about 0.3% before Arlington County officials debunked the image, marking the first documented synthetic-image-driven market movement.
4. The C2PA content provenance standard has been adopted by which of the following organizations?
Correct. C2PA's coalition includes major tech companies, camera manufacturers, and software firms. The New York Times began embedding C2PA credentials in its photojournalism in 2023.
C2PA has broad industry adoption including Adobe, Microsoft, Google, Sony, Nikon, and Leica. The New York Times also adopted C2PA credentials for photojournalism in 2023.
5. The January 2024 New Hampshire robocall incident prompted the FCC to issue a ruling that AI-generated voices in robocalls are covered under which existing law?
Correct. The FCC ruled that AI-generated voices used in robocalls fall under the Telephone Consumer Protection Act, extending existing robocall regulations to synthetic voice content.
The FCC applied the Telephone Consumer Protection Act to AI-generated robocall voices, extending existing robocall regulation to synthetic audio.

Lab 2 — Verification Protocol Design

Build a synthetic media verification checklist with AI support

Your Task

Your newsroom's standards editor has asked you to draft a first version of a synthetic media verification protocol. Use the assistant below to stress-test your thinking against documented cases and identify gaps in your proposed workflow.

Starter prompt: "Help me design a step-by-step protocol for verifying whether a photograph or audio clip is AI-generated before my newsroom publishes it."
Verification Protocol Assistant
Lab 2
Ready to help you build a verification protocol. We can draw on documented cases — the Eliot Higgins Trump images, the Pentagon explosion fake, the New Hampshire voice-clone robocalls — to make sure your protocol addresses real-world failure modes. Where would you like to start: images, audio, or video?
Lesson 3 · AI & Media — Module 1

AI-Assisted Investigative Reporting

How data journalism teams use machine learning to find stories inside datasets too large for any human to read.
What investigative stories have only become possible because AI could process what human reporters could not?

In April 2016, the International Consortium of Investigative Journalists (ICIJ) published the Panama Papers — at the time the largest leak in journalistic history. The dataset: 11.5 million documents, 2.6 terabytes of data, spanning 40 years of offshore financial records from the law firm Mossack Fonseca. No human team could read it in any reasonable timeframe.

The ICIJ used Apache Solr for full-text search, Nuix for document processing, and a custom graph database to map relationships between shell companies, directors, and named individuals. Natural language processing tools extracted entity names from unstructured text — the automated reading of documents humans flagged for deeper investigation. The result: stories naming 143 politicians, 12 current or former world leaders, and figures from 200 countries in a single coordinated global publication.

What AI Actually Does in Data Journalism

It is important to be precise about where AI contributes in investigative contexts, because the term is often used loosely. The documented uses fall into several categories:

Entity extraction and relationship mapping: NLP models identify named persons, companies, dates, and amounts in unstructured documents and link them into networks. The ICIJ's OffshoreLeaks database — which grew from the Panama Papers to include the Pandora Papers (2021, 11.9 million documents) — relies on this approach.

Anomaly detection: Machine learning models trained on baseline patterns can flag statistical outliers. ProPublica used this approach in its Surgeon Scorecard (2015), which analyzed Medicare data to identify surgeons with statistically elevated complication rates. The model identified which surgeons to investigate; reporters then verified findings through medical records and interviews.

Document classification: Supervised learning models sort large document sets by relevance or category. The Marshall Project used classification models to identify police misconduct records from hundreds of thousands of disciplinary documents obtained via public records requests across multiple states.

ProPublica — Surgeon Scorecard

ProPublica's 2015 Surgeon Scorecard analyzed Medicare claims data covering 17,000 surgeons and approximately 3.3 million procedures. The ML model identified surgeons with complication rates more than one standard deviation above their specialty's adjusted baseline. Human reporters then investigated flagged cases through interviews, hospital records, and regulatory filings. No AI-identified finding was published without human corroboration. The project won multiple journalism awards and led to policy discussions about surgical outcome transparency.

The Pandora Papers: Scale Escalation

The 2021 Pandora Papers — coordinated by the same ICIJ — exceeded the Panama Papers in scale: 11.9 million documents from 14 financial service providers. Processing required more sophisticated ML pipelines than 2016. The ICIJ used transformer-based NLP models (architecturally similar to BERT) to extract and disambiguate entities across documents in 16 languages. Cross-lingual entity matching — recognizing that "Vladimir Putin" and "Владимир Путин" refer to the same person in a relationship graph — required models specifically fine-tuned for the task.

The Pandora Papers implicated 35 current or former world leaders and 330 politicians across more than 90 countries. Without AI-assisted processing, the dataset would have taken an estimated 600 years to read manually.

Limits and Ethical Considerations

AI-assisted investigation raises specific ethical questions. Anomaly detection models can reflect historical biases in the data they were trained on — ProPublica's 2016 investigation into the COMPAS recidivism prediction algorithm demonstrated how a "neutral" ML model can encode racial disparities from historical criminal justice data. When newsrooms use similar tools to identify stories, they must interrogate whether the baseline the model is trained on is itself fair.

Publication decisions remain human decisions. In every documented case of consequential AI-assisted investigation — Panama Papers, Surgeon Scorecard, The Marshall Project's misconduct database — human editors and reporters made final publication calls, and legal review was conducted on findings before they were named in print.

Entity ExtractionNLP process that identifies and classifies named entities (persons, organizations, locations, dates) in unstructured text.
Anomaly DetectionMachine learning technique that identifies data points statistically deviating from a trained baseline — useful for finding outliers worth investigating.
Relationship GraphA data structure representing entities as nodes and connections (directorship, ownership, family ties) as edges — essential for mapping offshore finance networks.
Human CorroborationThe editorial requirement that AI-identified findings be independently verified by human reporting before publication.
The Core Principle

AI in investigative journalism functions best as a triage and discovery layer — processing at scale to identify what humans should investigate, not to determine what should be published. Every major AI-assisted investigation on record has maintained this division: machines read and flag; humans verify and decide.

Lesson 3 Quiz

AI-Assisted Investigative Reporting · 5 questions
1. How large was the Panama Papers dataset released in 2016, and which organization coordinated its analysis?
Correct. The Panama Papers comprised 11.5 million documents and 2.6 terabytes of data from Mossack Fonseca, coordinated by the ICIJ across 400 journalists in 80 countries.
The Panama Papers were 11.5 million documents (2.6 TB) coordinated by the ICIJ — the largest journalistic leak at the time of publication.
2. What methodology did ProPublica use in its 2015 Surgeon Scorecard to identify surgeons worth investigating?
Correct. ProPublica's ML model identified statistical outliers in Medicare complication data; human reporters then verified flagged cases through records and interviews before publication.
ProPublica used anomaly detection on Medicare data — surgeons more than one standard deviation above their specialty's adjusted complication rate were flagged for human investigation.
3. Why did the Pandora Papers (2021) require more sophisticated AI processing than the Panama Papers (2016)?
Correct. At 11.9 million documents from 14 providers across 16 languages, cross-lingual entity disambiguation (e.g., matching Cyrillic and Latin spellings of the same name) required transformer-based models fine-tuned for the task.
The Pandora Papers' added complexity came from scale (11.9M docs from 14 providers) and multilingual cross-lingual entity matching that required more advanced transformer-based NLP than 2016 tools.
4. What ethical concern did ProPublica's 2016 COMPAS investigation raise about using ML models in journalism?
Correct. ProPublica's COMPAS analysis showed that a model trained on historical criminal justice data could produce racially disparate risk scores even without explicitly using race as a variable.
The core issue was training data bias: COMPAS inherited racial disparities from the historical criminal justice data it was trained on, producing disparate outcomes without any intentional discrimination in the algorithm itself.
5. In every major documented AI-assisted investigation, what role did human journalists retain throughout the process?
Correct. In Panama Papers, Surgeon Scorecard, Pandora Papers, and similar investigations, AI served as a triage and discovery layer. All verification, publication decisions, and legal review remained with human journalists and editors.
The consistent documented pattern: AI reads at scale and flags; humans verify, investigate, and decide what to publish. No AI finding was published in a major investigation without human corroboration.

Lab 3 — Investigative Data Strategy

Map an AI-assisted investigation workflow for a real-world scenario

Your Task

Your investigative team has obtained 800,000 documents from a public records request covering ten years of state contractor payments. You suspect there are patterns of favoritism but don't know where to start. Use the assistant to design your ML-assisted investigation strategy.

Starter prompt: "I have 800,000 contractor payment documents. Walk me through how to design an anomaly detection approach to find potential favoritism patterns, and what human verification steps I need at each stage."
Investigative Data Assistant
Lab 3
Great investigative scenario. We can build on the approaches used in the Panama Papers entity extraction, ProPublica's Surgeon Scorecard anomaly detection, and The Marshall Project's document classification work. What do you know about the structure of your 800,000 documents — are they structured data (spreadsheets, databases) or unstructured text (PDFs, scanned contracts)?
Lesson 4 · AI & Media — Module 1

Ethics, Disclosure, and the Future of AI in News

What newsrooms owe their audiences — and each other — as AI rewrites the rules of production and trust.
When AI writes or assists in writing a published article, what does the audience have a right to know?

In 2023, a survey by the Reuters Institute for the Study of Journalism found that fewer than 20% of major news organizations had published any explicit policy on AI use in editorial content. Among those that had, definitions varied widely: some required disclosure only for fully AI-written articles; others required disclosure for any AI involvement in drafting; others had no disclosure requirement at all but required human review.

The gap between practice and policy was significant. The same survey found that over 75% of journalists surveyed at organizations with no official policy reported using AI tools — primarily for research, summarization, and translation — on a regular basis.

The Disclosure Debate

The journalism ethics community has not reached consensus on AI disclosure requirements, but several positions have emerged clearly from documented debates:

Position 1 — Disclose AI authorship, not AI assistance. Under this view, if a human reporter used an LLM to summarize a 200-page report before writing their own analysis, that is analogous to using a search engine — a tool, not an author. Disclosure is only required when AI substantially generates published text. The Associated Press's 2023 AI guidelines take roughly this position.

Position 2 — Disclose all material AI involvement. Under this view, readers are owed transparency about any process that might affect accuracy or perspective, including AI-assisted drafting, translation, or summarization used in reporting. The BBC's editorial guidelines, updated in 2023, lean toward this more expansive disclosure.

Position 3 — Treat AI as infrastructure, not authorship. Under this view, requiring disclosure of AI tools is like requiring disclosure of spell-checkers or grammar tools. This position is less common among editorial ethics bodies but appears in arguments made by some publishers focused on efficiency.

AP AI Guidelines — 2023

The Associated Press published explicit AI guidelines for its journalists in 2023. Key provisions: AP journalists may not use AI-generated text in published stories without explicit editor approval. AI-generated images are prohibited from use in editorial content. Journalists may use AI tools for research and background reading but must verify all AI-generated facts independently. The AP also announced it would not use LLMs trained on AP content without licensing agreements — and subsequently entered into such an agreement with OpenAI in July 2023.

Copyright and Training Data

The AP-OpenAI licensing agreement was one of several that emerged in 2023 as news organizations began asserting intellectual property rights over their archives. In December 2023, The New York Times filed suit against OpenAI and Microsoft, alleging that its articles had been used without authorization to train GPT models. The Times alleged it could demonstrate instances where ChatGPT reproduced near-verbatim passages of Times journalism, which it argued constituted copyright infringement.

The case — still in litigation as of this writing — raises questions that will define the legal framework for AI training data for years. Key issues include: whether training on publicly accessible text constitutes fair use; whether reproducing memorized text in model outputs constitutes infringement; and whether the scale of training data reproduction qualitatively changes the fair-use analysis.

In parallel, several smaller outlets including The Intercept, Raw Story, and AlterNet filed similar suits against OpenAI in early 2024. The Center for Investigative Reporting filed against both OpenAI and Meta in June 2024.

Audience Trust and Transparency

A 2023 Reuters Institute Digital News Report survey of 94,000 respondents across 46 countries found that 52% were uncomfortable with news articles written primarily by AI, even if checked by a human. Comfort levels were significantly higher when AI was used for data visualization (63% comfortable) or translation (68% comfortable) compared to text generation.

The data suggests a reader distinction between AI as a production tool and AI as an author — audiences appear more willing to accept AI in support roles than in the byline. Newsrooms attempting to build trust in AI use are likely to face different reception depending on how transparently they communicate what AI did and what humans decided.

Disclosure PolicyA newsroom's explicit statement to its audience about when and how AI involvement in content production will be labeled or communicated.
Fair UseA legal doctrine allowing limited use of copyrighted material without permission under certain conditions — currently being tested in AI training data litigation.
Training DataThe corpus of text, images, or other media on which an AI model is trained — the subject of ongoing intellectual property disputes between AI developers and publishers.
MemorizationA documented phenomenon where LLMs reproduce near-verbatim sequences from training data — central to The New York Times' copyright claims against OpenAI.
Where the Field Stands

AI in journalism is simultaneously a production tool, an investigative asset, a verification challenge, and a copyright battleground. The organizations navigating it most successfully share a common approach: they define specific, bounded use cases with clear human-oversight requirements; they publish their policies; and they treat disclosure as an audience-trust investment rather than a liability. The newsrooms struggling most are those that adopted AI broadly without the ethical infrastructure to match the speed of the technology.

Lesson 4 Quiz

Ethics, Disclosure, and the Future of AI in News · 5 questions
1. According to the 2023 Reuters Institute survey, what percentage of major news organizations had published an explicit policy on AI use in editorial content?
Correct. The Reuters Institute survey found fewer than 20% of major news organizations had published any explicit AI editorial policy — despite over 75% of journalists at policy-less organizations already using AI tools regularly.
The Reuters Institute found fewer than 20% of major news organizations had published explicit AI policies, revealing a significant gap between practice and documented policy.
2. The Associated Press entered into a licensing agreement with which AI company in July 2023 regarding use of AP content for model training?
Correct. The AP and OpenAI announced a licensing agreement in July 2023, allowing OpenAI to access AP's archive of news content — a model other publishers have since pursued or litigated for.
The AP signed a licensing agreement with OpenAI in July 2023, becoming an early example of a news organization negotiating terms for AI training data use rather than pursuing litigation.
3. What was the central legal claim in The New York Times' December 2023 lawsuit against OpenAI and Microsoft?
Correct. The Times alleged unauthorized training on its archive and demonstrated instances of near-verbatim reproduction in ChatGPT outputs — the "memorization" phenomenon — as evidence of infringement rather than transformative fair use.
The Times' core claim was unauthorized training use plus memorization: ChatGPT could reproduce near-verbatim Times passages, which the Times argued constituted infringement beyond fair-use protection.
4. A 2023 Reuters Institute Digital News Report survey found what percentage of respondents were uncomfortable with news articles written primarily by AI, even if checked by a human?
Correct. 52% of the 94,000-person survey sample was uncomfortable with primarily AI-written news even with human review — compared to significantly higher comfort with AI used for translation or data visualization.
52% expressed discomfort. Notably, the same audience was much more comfortable with AI in support roles (translation: 68% comfortable; data visualization: 63% comfortable) than in text generation.
5. Which of the following best describes the AP's 2023 AI guidelines regarding AI-generated text in published stories?
Correct. The AP's guidelines require editor approval for any AI-generated text in published stories, ban AI editorial images entirely, and require independent verification of any AI-suggested facts.
The AP requires explicit editor approval for AI-generated text, prohibits AI editorial images, and mandates independent verification of AI-suggested facts — a structured, bounded approach to AI use.

Lab 4 — Drafting an AI Editorial Policy

Build a defensible disclosure and use policy for a real newsroom context

Your Task

You've been asked to draft your newsroom's first AI editorial policy. It needs to address: what AI tools journalists may use, when disclosure to readers is required, how AI-generated facts must be verified, and what AI is prohibited from doing entirely. Use the assistant to stress-test your draft against real-world cases and industry standards.

Starter prompt: "Help me draft a three-section AI editorial policy covering permitted uses, disclosure requirements, and prohibited uses — drawing on what the AP, BBC, and CNET's experience have taught the industry."
Editorial Policy Assistant
Lab 4
Let's build this policy carefully. We'll draw on the AP's 2023 guidelines, the BBC's expanded disclosure approach, the lessons from CNET's AI content failure, and the Reuters Institute survey data on audience trust. To calibrate the policy: is your newsroom primarily a print/digital outlet, a broadcast organization, or a data journalism team? That will affect which sections need the most detail.

Module Test — AI in Journalism

15 questions · 80% to pass
1. What year did the Associated Press first deploy Automated Insights' Wordsmith for earnings reporting?
Correct. The AP adopted Wordsmith in 2014, the same year the LA Times published its first Quakebot earthquake story.
The AP adopted Wordsmith in 2014 — the same year the LA Times published its first Quakebot story. Both marked the public emergence of automated journalism.
2. Ken Schwencke built Quakebot for which newspaper?
Correct. Ken Schwencke built Quakebot while working at the LA Times. The bot published the June 2014 Westwood earthquake story three minutes after the event.
Quakebot was built by Ken Schwencke at the Los Angeles Times. It published the Westwood earthquake story three minutes after the 4.4-magnitude quake struck.
3. The Washington Post's Heliograf generated more than how many short articles during the 2016 election night?
Correct. The Washington Post reported Heliograf produced over 500 short articles on election night 2016, triggered when results crossed pre-set thresholds.
Heliograf produced more than 500 articles on 2016 election night, generating updates automatically as results crossed predefined thresholds.
4. Which publication exposed CNET's undisclosed AI article program in January 2023?
Correct. Futurism broke the CNET AI story in January 2023. Futurism also reported the Sports Illustrated AI-author scandal in November 2023.
Futurism exposed CNET's AI program in January 2023 and later broke the Sports Illustrated AI-author story in November 2023.
5. A C2PA content credential establishes which of the following?
Correct. C2PA embeds a cryptographic chain of custody — device, software, timestamp, and edit history — into a media file at creation, allowing downstream verification of provenance.
C2PA creates a cryptographically signed content credential recording the device, software, time, and editing history of a media file — a chain of custody for provenance verification.
6. The fabricated Pentagon explosion image on May 22, 2023 caused which measurable financial effect before being debunked?
Correct. The S&P 500 dipped approximately 0.3% in the minutes the fake Pentagon explosion image circulated — the first documented case of a synthetic image causing a market reaction.
The S&P 500 dipped ~0.3% before Arlington County officials debunked the image — the first documented instance of a synthetic image triggering a measurable market reaction.
7. The Panama Papers were analyzed using which combination of tools by the ICIJ?
Correct. The ICIJ used Apache Solr for full-text search, Nuix for document processing, and a custom graph database to map relationships between entities across 11.5 million documents.
The ICIJ used Apache Solr (search), Nuix (document processing), and a custom graph database (relationship mapping) — plus NLP for entity extraction from unstructured text.
8. ProPublica's Surgeon Scorecard (2015) covered approximately how many surgeons and procedures?
Correct. The Surgeon Scorecard analyzed Medicare data covering 17,000 surgeons and approximately 3.3 million procedures, using ML anomaly detection to flag statistical outliers.
ProPublica's model analyzed Medicare data for 17,000 surgeons across approximately 3.3 million procedures, flagging those more than one standard deviation above their specialty's adjusted complication rate.
9. How many documents were in the 2021 Pandora Papers, and from how many financial service providers?
Correct. The Pandora Papers comprised 11.9 million documents from 14 financial service providers across 16 languages — larger and more complex than the 2016 Panama Papers.
11.9 million documents from 14 providers in 16 languages — larger than the Panama Papers and requiring more advanced multilingual NLP for entity extraction and disambiguation.
10. The FCC ruled that AI-generated voices in robocalls are covered under which law, following the January 2024 New Hampshire primary incident?
Correct. The FCC extended the Telephone Consumer Protection Act to cover AI-generated voices in robocalls, following the Biden voice-clone calls to New Hampshire voters.
The FCC applied the Telephone Consumer Protection Act — existing robocall law — to AI-generated voice content following the New Hampshire primary incident.
11. The Sports Illustrated AI-author scandal involved which specific deceptive practice?
Correct. The Arena Group's SI articles appeared under completely fictitious authors with AI-generated profile photos — combining AI-generated text and synthetic identity creation.
The SI scandal involved AI-generated articles under completely fictitious author names with AI-generated headshots — a combination of AI content and synthetic identity presentation.
12. Under the AP's 2023 AI guidelines, which of the following is explicitly prohibited?
Correct. The AP's guidelines explicitly prohibit AI-generated images in editorial content. AI-generated text requires editor approval but is not outright prohibited under all circumstances.
The AP specifically prohibits AI-generated images in editorial use. AI text requires explicit editor approval rather than an outright ban.
13. The New York Times filed its copyright lawsuit against OpenAI and Microsoft in which month and year?
Correct. The Times filed in December 2023, alleging unauthorized training use and demonstrating memorization instances where ChatGPT could reproduce near-verbatim Times passages.
The Times filed in December 2023. The suit alleged both unauthorized training-data use and LLM memorization — ChatGPT reproducing near-verbatim Times content in outputs.
14. According to the 2023 Reuters Institute Digital News Report, readers were most comfortable with AI being used for which journalism application?
Correct. The Reuters Institute survey found readers were most comfortable with AI in translation (68%) and data visualization (63%) — support roles rather than authorship roles.
Translation (68% comfortable) and data visualization (63%) were the most accepted AI applications, compared to only 48% comfortable with primarily AI-written news even with human review.
15. Which of the following best describes the role AI has played in every major documented investigative journalism case — Panama Papers, Surgeon Scorecard, Pandora Papers?
Correct. The consistent pattern across all major AI-assisted investigations: AI as a triage and discovery layer; humans as verifiers, interviewers, and decision-makers. No AI finding was published without human corroboration.
In every documented major investigation, AI processed and flagged; humans verified, investigated, and decided. The human-in-the-loop principle was maintained throughout all cases.