For most of the twentieth century, the news was a relatively knowable thing. A handful of national broadcasters, a few national newspapers, a local paper — you knew where it came from, who chose it, and roughly what their incentives were. You could disagree with it, but you could locate it.
That has been coming apart for twenty years, and AI is finishing the job. The news you see is increasingly chosen by algorithms optimizing for engagement, some of it is written or summarized by AI, an increasing fraction is entirely synthetic, and your media experience may differ meaningfully from the person sitting next to you on the train.
This course is about media in the age of AI — for both consumers and producers. It covers how algorithmic feeds actually select content, the state of AI-generated news, the economics of the news industry under AI, how to detect synthetic media, the emerging provenance standards, and the skill of navigating an information environment where the old heuristics (trust this outlet, distrust that one) no longer apply by themselves.
If you finish every module, here's who you become:
On June 17, 2014, the Los Angeles Times published a story about a 4.4-magnitude earthquake in Westwood, California. The article appeared three minutes after the quake struck. No human reporter wrote it. A program called Quakebot — built by journalist Ken Schwencke — pulled data from the USGS feed, slotted it into a template, and pushed it to the CMS. Schwencke reviewed and published it from his phone in bed.
That article was not the first automated text in journalism. But it was among the first to be indistinguishable in format from a reporter's byline piece and to circulate widely as an example of what automation could do at deadline speed.
The Associated Press began using Automated Insights' Wordsmith platform to generate quarterly earnings reports in 2014, the same year as Quakebot's debut. By 2016 the AP was producing 3,700 earnings stories per quarter — roughly ten times its previous human output — covering companies that would otherwise receive no coverage at all. The platform used structured financial data and fill-in-the-blank natural-language templates, not neural networks.
The Washington Post deployed its own system, Heliograf, for the 2016 Rio Olympics and the 2016 U.S. election. Heliograf generated short updates when election results crossed preset thresholds. The Post reported Heliograf produced more than 500 short articles during election night. Again: structured data in, templated sentences out.
These systems are correctly called natural language generation (NLG) rather than artificial intelligence in the modern sense. They do not learn; they execute rules. Understanding this distinction matters because the public conversation about "AI in journalism" often conflates rule-based automation with machine learning.
Newsrooms that adopted early NLG tools freed reporters from commodity data work. The AP's journalists, relieved of writing thousands of boilerplate earnings stories, could pursue investigative work. Automation expanded coverage breadth without proportionally expanding staff — a trade-off that shaped how editors would later approach large language models.
The release of ChatGPT in November 2022 marked a qualitative shift. Unlike Wordsmith or Heliograf, large language models can generate coherent prose from unstructured prompts without pre-written templates. News organizations responded rapidly — and inconsistently.
CNET quietly published more than 70 AI-generated personal finance articles between November 2022 and January 2023. When Futurism broke the story in January 2023, CNET acknowledged the practice and paused the program after editors found factual errors, plagiarism-adjacent phrasing, and compounded interest miscalculations in multiple articles. The episode became a widely cited case study in the risks of publishing LLM output with insufficient human review.
Sports Illustrated faced similar scrutiny in November 2023 when Futurism again reported the outlet had published product-recommendation articles under fictitious author names with AI-generated headshots. The publisher, The Arena Group, initially denied the articles were AI-written, then acknowledged using a third-party content vendor, and subsequently terminated its relationship with that vendor.
These cases established a reputational baseline: newsrooms deploying generative AI without robust editorial oversight risk credibility damage that can be swift and severe.
2014: AP adopts Wordsmith for earnings reports; LA Times publishes Quakebot earthquake story. 2016: Washington Post deploys Heliograf for Olympics and election coverage. Nov 2022: CNET begins AI article program. Jan 2023: Futurism exposes CNET; program paused after error audit. Nov 2023: Sports Illustrated AI-author scandal; Arena Group terminates vendor contract.
You are a digital editor at a mid-size regional newspaper. Your publisher wants to know whether deploying an automated system for local government data stories is worth the risk. Use the AI assistant below to work through the key questions an editorial team must answer before adopting automation.
On March 22, 2023, an image depicting former U.S. President Donald Trump being physically arrested began circulating on Twitter and other platforms. The photographs were photorealistic — crowds, police officers, struggle — but entirely synthetic, generated by journalist and AI researcher Eliot Higgins of Bellingcat using Midjourney v5. Higgins posted them explicitly labeled as AI-generated to demonstrate the technology's capabilities.
Within hours, portions of the image set had been reshared without the label by accounts that presented them as real. Several journalists from mainstream outlets contacted Higgins for comment, uncertain whether the images were authentic. The episode illustrated a gap that had opened almost overnight: the time required to generate a convincing synthetic image had collapsed from days to seconds, while the time required to verify authenticity had not.
Traditional image verification relied on metadata analysis (EXIF data), reverse image search, geolocation matching, and source tracing. These tools remain useful but are increasingly insufficient against AI-generated content because:
1. Synthetic images contain no authentic EXIF data. A Midjourney or DALL-E image will carry metadata from the generation software, not from a camera in a specific location at a specific time. Tools like InVID/WeVerify and FotoForensics look for compression artifacts and metadata inconsistencies — signals that are absent or misleading in synthetic images.
2. Reverse image search finds only previously indexed images. A newly generated image has no prior index presence. Google Reverse Image Search and TinEye cannot find what has never been uploaded before.
3. Deepfake video has outpaced detection tools. The Reuters Institute reported in 2023 that commercial deepfake detection APIs had accuracy rates between 65–80% on synthetic video — useful, but insufficient for publication-standard verification.
On May 22, 2023, a fabricated image of an explosion near the Pentagon circulated on Twitter. Verified accounts including a Bloomberg News feed (via a third-party automated Twitter feed, not Bloomberg's editorial staff) briefly amplified the image. The S&P 500 dipped approximately 0.3% in the minutes before the image was debunked by Arlington County officials. The episode was the first documented case of a synthetic image causing a measurable market reaction.
In January 2024, robocalls using an AI-generated voice cloned from President Biden were sent to tens of thousands of New Hampshire voters ahead of the state's primary. The calls, which appeared to come from a number associated with a Democratic operative, instructed recipients not to vote in the primary. The New Hampshire Attorney General launched an investigation, and the FCC subsequently ruled that AI-generated voices in robocalls are covered by the Telephone Consumer Protection Act.
For journalists, the incident underscored that audio is no longer reliable evidence. The standard practice of recording a source to confirm quotes must now contend with the possibility that a cloned-voice recording could be fabricated. Verification protocols now recommended by the First Draft coalition and the Poynter Institute include callback verification on a known number, in-person confirmation for high-stakes quotes, and waveform analysis using tools like Adobe Podcast's AI detection layer.
The Coalition for Content Provenance and Authenticity (C2PA) released its 1.0 specification in 2021 and has since been adopted by Adobe, Microsoft, Google, Sony, Nikon, and Leica. C2PA embeds a cryptographically signed content credential into media files at the point of creation, recording the device, software, time, and any edits made. The New York Times began embedding C2PA credentials into its photojournalism in 2023.
C2PA is not a complete solution: credentials can be stripped by re-saving or screenshotting, and not all cameras or platforms support it. But it represents the most substantive technical attempt to establish a chain of custody for media content.
Verification has historically been a process of confirming what already happened. Synthetic media requires journalists to verify that something actually happened at all — a fundamentally harder epistemic task. Technical tools like C2PA help establish provenance at creation but cannot authenticate media that circulates without credentials. Human judgment, corroboration, and source relationships remain the core of the verification chain.
Your newsroom's standards editor has asked you to draft a first version of a synthetic media verification protocol. Use the assistant below to stress-test your thinking against documented cases and identify gaps in your proposed workflow.
In April 2016, the International Consortium of Investigative Journalists (ICIJ) published the Panama Papers — at the time the largest leak in journalistic history. The dataset: 11.5 million documents, 2.6 terabytes of data, spanning 40 years of offshore financial records from the law firm Mossack Fonseca. No human team could read it in any reasonable timeframe.
The ICIJ used Apache Solr for full-text search, Nuix for document processing, and a custom graph database to map relationships between shell companies, directors, and named individuals. Natural language processing tools extracted entity names from unstructured text — the automated reading of documents humans flagged for deeper investigation. The result: stories naming 143 politicians, 12 current or former world leaders, and figures from 200 countries in a single coordinated global publication.
It is important to be precise about where AI contributes in investigative contexts, because the term is often used loosely. The documented uses fall into several categories:
Entity extraction and relationship mapping: NLP models identify named persons, companies, dates, and amounts in unstructured documents and link them into networks. The ICIJ's OffshoreLeaks database — which grew from the Panama Papers to include the Pandora Papers (2021, 11.9 million documents) — relies on this approach.
Anomaly detection: Machine learning models trained on baseline patterns can flag statistical outliers. ProPublica used this approach in its Surgeon Scorecard (2015), which analyzed Medicare data to identify surgeons with statistically elevated complication rates. The model identified which surgeons to investigate; reporters then verified findings through medical records and interviews.
Document classification: Supervised learning models sort large document sets by relevance or category. The Marshall Project used classification models to identify police misconduct records from hundreds of thousands of disciplinary documents obtained via public records requests across multiple states.
ProPublica's 2015 Surgeon Scorecard analyzed Medicare claims data covering 17,000 surgeons and approximately 3.3 million procedures. The ML model identified surgeons with complication rates more than one standard deviation above their specialty's adjusted baseline. Human reporters then investigated flagged cases through interviews, hospital records, and regulatory filings. No AI-identified finding was published without human corroboration. The project won multiple journalism awards and led to policy discussions about surgical outcome transparency.
The 2021 Pandora Papers — coordinated by the same ICIJ — exceeded the Panama Papers in scale: 11.9 million documents from 14 financial service providers. Processing required more sophisticated ML pipelines than 2016. The ICIJ used transformer-based NLP models (architecturally similar to BERT) to extract and disambiguate entities across documents in 16 languages. Cross-lingual entity matching — recognizing that "Vladimir Putin" and "Владимир Путин" refer to the same person in a relationship graph — required models specifically fine-tuned for the task.
The Pandora Papers implicated 35 current or former world leaders and 330 politicians across more than 90 countries. Without AI-assisted processing, the dataset would have taken an estimated 600 years to read manually.
AI-assisted investigation raises specific ethical questions. Anomaly detection models can reflect historical biases in the data they were trained on — ProPublica's 2016 investigation into the COMPAS recidivism prediction algorithm demonstrated how a "neutral" ML model can encode racial disparities from historical criminal justice data. When newsrooms use similar tools to identify stories, they must interrogate whether the baseline the model is trained on is itself fair.
Publication decisions remain human decisions. In every documented case of consequential AI-assisted investigation — Panama Papers, Surgeon Scorecard, The Marshall Project's misconduct database — human editors and reporters made final publication calls, and legal review was conducted on findings before they were named in print.
AI in investigative journalism functions best as a triage and discovery layer — processing at scale to identify what humans should investigate, not to determine what should be published. Every major AI-assisted investigation on record has maintained this division: machines read and flag; humans verify and decide.
Your investigative team has obtained 800,000 documents from a public records request covering ten years of state contractor payments. You suspect there are patterns of favoritism but don't know where to start. Use the assistant to design your ML-assisted investigation strategy.
In 2023, a survey by the Reuters Institute for the Study of Journalism found that fewer than 20% of major news organizations had published any explicit policy on AI use in editorial content. Among those that had, definitions varied widely: some required disclosure only for fully AI-written articles; others required disclosure for any AI involvement in drafting; others had no disclosure requirement at all but required human review.
The gap between practice and policy was significant. The same survey found that over 75% of journalists surveyed at organizations with no official policy reported using AI tools — primarily for research, summarization, and translation — on a regular basis.
The journalism ethics community has not reached consensus on AI disclosure requirements, but several positions have emerged clearly from documented debates:
Position 1 — Disclose AI authorship, not AI assistance. Under this view, if a human reporter used an LLM to summarize a 200-page report before writing their own analysis, that is analogous to using a search engine — a tool, not an author. Disclosure is only required when AI substantially generates published text. The Associated Press's 2023 AI guidelines take roughly this position.
Position 2 — Disclose all material AI involvement. Under this view, readers are owed transparency about any process that might affect accuracy or perspective, including AI-assisted drafting, translation, or summarization used in reporting. The BBC's editorial guidelines, updated in 2023, lean toward this more expansive disclosure.
Position 3 — Treat AI as infrastructure, not authorship. Under this view, requiring disclosure of AI tools is like requiring disclosure of spell-checkers or grammar tools. This position is less common among editorial ethics bodies but appears in arguments made by some publishers focused on efficiency.
The Associated Press published explicit AI guidelines for its journalists in 2023. Key provisions: AP journalists may not use AI-generated text in published stories without explicit editor approval. AI-generated images are prohibited from use in editorial content. Journalists may use AI tools for research and background reading but must verify all AI-generated facts independently. The AP also announced it would not use LLMs trained on AP content without licensing agreements — and subsequently entered into such an agreement with OpenAI in July 2023.
The AP-OpenAI licensing agreement was one of several that emerged in 2023 as news organizations began asserting intellectual property rights over their archives. In December 2023, The New York Times filed suit against OpenAI and Microsoft, alleging that its articles had been used without authorization to train GPT models. The Times alleged it could demonstrate instances where ChatGPT reproduced near-verbatim passages of Times journalism, which it argued constituted copyright infringement.
The case — still in litigation as of this writing — raises questions that will define the legal framework for AI training data for years. Key issues include: whether training on publicly accessible text constitutes fair use; whether reproducing memorized text in model outputs constitutes infringement; and whether the scale of training data reproduction qualitatively changes the fair-use analysis.
In parallel, several smaller outlets including The Intercept, Raw Story, and AlterNet filed similar suits against OpenAI in early 2024. The Center for Investigative Reporting filed against both OpenAI and Meta in June 2024.
A 2023 Reuters Institute Digital News Report survey of 94,000 respondents across 46 countries found that 52% were uncomfortable with news articles written primarily by AI, even if checked by a human. Comfort levels were significantly higher when AI was used for data visualization (63% comfortable) or translation (68% comfortable) compared to text generation.
The data suggests a reader distinction between AI as a production tool and AI as an author — audiences appear more willing to accept AI in support roles than in the byline. Newsrooms attempting to build trust in AI use are likely to face different reception depending on how transparently they communicate what AI did and what humans decided.
AI in journalism is simultaneously a production tool, an investigative asset, a verification challenge, and a copyright battleground. The organizations navigating it most successfully share a common approach: they define specific, bounded use cases with clear human-oversight requirements; they publish their policies; and they treat disclosure as an audience-trust investment rather than a liability. The newsrooms struggling most are those that adopted AI broadly without the ethical infrastructure to match the speed of the technology.
You've been asked to draft your newsroom's first AI editorial policy. It needs to address: what AI tools journalists may use, when disclosure to readers is required, how AI-generated facts must be verified, and what AI is prohibited from doing entirely. Use the assistant to stress-test your draft against real-world cases and industry standards.