AI & Media · Module 4 · Lesson 1

The Scale Problem

Why human review alone collapsed — and what platforms built to replace it

When billions of posts arrive every day, who decides what stays — and can any system be fair at that volume?

Facebook's head of global policy, Monika Bickert, testified to the U.S. Senate that the company employed 7,500 human content reviewers worldwide — a number that sounded large until senators did the arithmetic: 500 million Stories posted per day divided by 7,500 reviewers working eight-hour shifts yields roughly one review per 18,000 posts. The scale mismatch was not a staffing failure. It was a structural impossibility.

The Volume Numbers That Changed Everything

In 2023, YouTube reported that users upload 500 hours of video every minute. Meta processes roughly 100 billion messages per day across its apps. TikTok's own transparency report for H1 2023 showed it removed 112 million videos in six months — approximately 600,000 per day — before those videos accumulated a single view. These numbers make human-only review mathematically impossible.

The platforms' response was to build automated detection pipelines that triage content before any human sees it. A small percentage of that triage is now handled by large language models and vision transformers; the majority is still handled by narrower classifiers trained on labeled violation datasets. The human reviewer, in 2024, is largely an appeals judge rather than a first screener.

500 hrs

Video uploaded to YouTube per minute

112M

TikTok videos removed, H1 2023

3.5B

Pieces reviewed by Meta AI per quarter

Three Eras of Moderation Architecture

Understanding how platforms got here requires tracking three distinct eras.

2004–12

Report-and-Review: Users flagged content. A human reviewer acted within days. Effective when platforms had millions of users; collapsed at billions.

2012–18

Hash Matching & Early ML: PhotoDNA (Microsoft, 2009, deployed by Facebook 2011) let platforms match images against known CSAM hashes without human review. Early text classifiers flagged hate speech by keyword proximity. Fast but brittle — easily evaded by minor edits.

2018–present

Transformer-Based Proactive Detection: Models trained on millions of labeled violations scan content at upload, before distribution. Meta's 2021 Community Standards Enforcement Report stated that 97.3% of hate speech removed that quarter was identified by AI before user reports.

Documented Case — YouTube BERT Deployment (2020)

YouTube disclosed in its 2020 Q3 transparency report that it had deployed BERT-based natural language understanding to detect borderline content — videos that don't violate policies outright but sit near the line. The model reduced recommendations of such content by 70% on U.S. English queries within six months. The company noted the same model performed significantly worse on non-English content, a disparity it acknowledged would require separate training data for each language market.

What Automated Systems Can and Cannot Do

Automated systems excel at three tasks: matching known violations (hashed CSAM, known terrorist imagery), classifying unambiguous violations at scale (nudity, graphic violence with no news context), and routing borderline content to human queues faster than report-based systems.

They struggle with context-dependent speech. In 2020, Facebook's systems incorrectly removed thousands of posts discussing the 2020 Belarusian protests because phrases used by protesters overlapped with terms associated with coordinated inauthentic behavior. The company acknowledged the error after Belarusian journalists and human rights groups documented the pattern publicly. Context — knowing that "take to the streets" in Minsk in August 2020 was protest journalism, not incitement — requires more than pattern matching.

Key Terms

Proactive detectionAI identifies violations before user reports, often before any view is registered.

Hash matchingComparing a digital fingerprint of content against a database of known violations; near-perfect precision on known material, useless for novel content.

ClassifierA model trained to assign content to categories (violating / not violating) based on features learned from labeled examples.

Lesson 1 Quiz

The Scale Problem

Three questions — click an answer to reveal feedback.

1. What percentage of hate speech removed by Meta in Q3 2021 was identified by AI before any user report?

Correct. Meta's Community Standards Enforcement Report for Q3 2021 reported that 97.3% of hate speech actioned that quarter was detected proactively by AI systems before user reports triggered review.

Not quite. Meta's own report put the figure at 97.3% — a number that surprised many observers and illustrated how thoroughly automated detection had displaced reactive review by that point.

2. PhotoDNA, the hash-matching tool first deployed by Facebook in 2011, is most effective at detecting which type of content?

Correct. Hash matching is near-perfect on known material because it compares digital fingerprints. It is, however, useless against novel content that hasn't been hashed and entered into the reference database.

Hash matching compares digital fingerprints against a reference database of known violations. It excels at catching re-uploads of previously identified material — like CSAM — but cannot detect content it has never seen before.

3. YouTube's 2020 deployment of BERT-based understanding achieved which documented outcome?

Correct. YouTube's Q3 2020 transparency report documented a 70% reduction in borderline content recommendations on U.S. English queries, while acknowledging the model performed significantly worse on non-English content.

YouTube's own report said BERT reduced recommendations of borderline content by 70% on U.S. English — not removal, and not cross-language. The company specifically flagged the non-English performance gap as a remaining challenge.

Lab 1 — AI Tutor

Thinking Through Scale

Discuss the moderation scale problem with your AI tutor. At least 3 exchanges to complete.

Your Task

You've seen the volume numbers and the three eras of moderation. Now think critically: if you were a platform policy director in 2018 deciding whether to invest in transformer-based proactive detection, what would your key concerns be? What might go wrong?

Start by describing one specific risk you'd worry about when deploying an AI system that removes content before any human sees it — then explore that concern with your tutor.

AESOP Tutor

Content Moderation Scale

Welcome to the Scale lab. I'm here to help you think through the real tradeoffs platform policy teams face when deploying AI moderation at billions-of-posts scale. What's one concern you'd raise in that 2018 meeting?

AI & Media · Module 4 · Lesson 2

How Automated Detection Works

From training data to deployment: the pipeline that decides what gets removed

What does it actually mean to train a model to detect hate speech — and who decides what counts as a violation in the training data?

Google's Jigsaw unit released Perspective API in February 2017, a publicly accessible tool that scored text for "toxicity." Within weeks, researchers at Carnegie Mellon found that the model assigned higher toxicity scores to phrases containing words like "gay," "lesbian," and "Black" — not because those words were inherently toxic, but because the training data, drawn from Wikipedia talk pages and New York Times comment threads, reflected the contexts in which those words had historically appeared alongside abuse. The tool was measuring the training data's biases as faithfully as it measured actual toxicity.

The Training Pipeline

Every automated moderation system begins with labeled examples: millions of pieces of content that human annotators have marked as violating or not violating specific policies. The model learns to generalize from those labels. The quality of moderation is therefore bounded by the quality of the labels — and labels are produced by humans who bring their own cultural contexts, fatigue levels, and disagreements.

A 2019 study published in ACL Anthology ("Hate Speech Detection Is Not as Easy as You May Think") found inter-annotator agreement on hate speech labels was as low as 60% in some datasets — meaning two trained human annotators disagreed on 40% of examples. A model trained on that data cannot exceed that ceiling of human agreement.

Text Classification Pipeline

Raw text → tokenizer → transformer encoder → classification head → probability score → threshold → action. Each step introduces potential failure: tokenizers miss non-standard spellings, encoders inherit training-data biases, thresholds are set by policy teams balancing false positives against false negatives.

Image & Video Hashing

Known violations are hashed (PhotoDNA for images, TMK+PDQF for video). Uploads are compared against the hash database. Matching triggers automatic removal. Near-duplicate detection (perceptual hashing) catches minor edits. Completely novel violating content is invisible to hash systems.

The Threshold Decision

After a model produces a probability score, a policy team sets a threshold: above X, remove automatically; between Y and X, send to human review; below Y, allow. This threshold is not a technical decision — it is a values decision. A high threshold protects free expression but allows more violations through. A low threshold catches more violations but over-removes legitimate speech.

This tradeoff became visible during the COVID-19 infodemic of 2020. YouTube lowered thresholds for medical misinformation in March 2020, which also caused removal of videos from legitimate public health researchers discussing vaccine hesitancy as a topic to study, not promote. YouTube acknowledged in a blog post that week that "automated systems trained on past violations struggle with novel policy categories" — COVID was new, and the training data was not.

Documented Case — Facebook's Proactive Detection Surge (2020–21)

After the January 6, 2021 Capitol attack, Facebook's systems flagged an unprecedented volume of content in the 48-hour window, removing significantly more content than in any comparable period. Internal documents later reviewed by journalists (the "Facebook Papers," released October 2021) showed that the surge in removals included legitimate news reporting and political commentary alongside actual incitement. The company acknowledged in a statement that "operating at unprecedented volume increases the error rate of automated systems."

Multimodal Challenges

Text-only and image-only models miss content that requires understanding both modalities together. A 2022 research paper from Meta AI ("HateMM") documented that memes — image-text combinations — were misclassified at nearly twice the rate of text-only hate speech. A meme combining a benign image with a hateful caption, or vice versa, requires understanding the ironic relationship between image and text. As of 2024, multimodal moderation remains an active area of research, not a solved problem.

Key Terms

Annotator agreementThe degree to which different human labelers assign the same category to the same content. Low agreement means noisy training data.

Detection thresholdThe probability score above which a system takes automatic action. Setting this threshold is a policy choice, not a purely technical one.

Multimodal contentContent that combines multiple formats (image + text, audio + video). Harder to moderate because meaning may emerge from the combination rather than either element alone.

Lesson 2 Quiz

How Automated Detection Works

Three questions — click an answer to reveal feedback.

1. What did Carnegie Mellon researchers find about Google's Perspective API shortly after its 2017 launch?

Correct. CMU researchers found the API over-scored identity-group mentions as toxic — a direct artifact of the training data, which reflected historical patterns of abuse around those terms rather than their inherent toxicity.

CMU researchers found the opposite problem: the API assigned higher toxicity scores to phrases containing words like "gay" or "Black" — not because those words are inherently toxic, but because the training data reflected contexts where those words had historically appeared alongside abuse.

2. A 2019 ACL study on hate speech detection datasets found inter-annotator agreement as low as 60%. What is the key implication for automated moderation?

Correct. If annotators disagree on 40% of training examples, the model learns a noisy signal. It cannot reliably generalize beyond the ceiling set by human agreement in the data it was trained on.

When humans disagree on 40% of examples, those disagreements are baked into the training data. The model learns a noisy signal — it cannot exceed the agreement level of the humans who labeled the data it was trained on.

3. Why did YouTube's lowered moderation thresholds during early COVID-19 (March 2020) cause problems?

Correct. YouTube's own blog post acknowledged that "automated systems trained on past violations struggle with novel policy categories." COVID was new, and the training data didn't include context for distinguishing promotion of misinformation from research discussion of it.

YouTube's systems had no training data for a novel policy category — COVID misinformation. Without that context, lower thresholds swept in legitimate public health research discussing vaccine hesitancy as a subject of study, not as advocacy. YouTube acknowledged this in their own published statement.

Lab 2 — AI Tutor

Training Data & Thresholds

Explore annotation bias and threshold tradeoffs with your AI tutor. At least 3 exchanges to complete.

Your Task

You've learned that moderation model quality is bounded by training data quality, and that thresholds are values decisions. In this lab, you'll explore a concrete scenario: you're setting the detection threshold for a hate speech classifier on a social platform with a global user base.

Start with this scenario: your classifier gives a post a 0.72 probability score of being hate speech. Your threshold for automatic removal is 0.75. The post is from a user in a country where this content is illegal. What factors should affect whether you change the threshold? Discuss with your tutor.

AESOP Tutor

Training Data & Thresholds

Ready to dig into thresholds and training data tradeoffs. The scenario I've given you is deliberately ambiguous — the right answer depends on assumptions about error rates, legal liability, and who you think should make this decision. Where do you want to start?

AI & Media · Module 4 · Lesson 3

Bias, Error, and Disparate Impact

Why automated moderation consistently makes different errors for different communities

If a moderation system makes mistakes at the same overall rate across all languages, why might it still be discriminatory?

In the weeks preceding and during the May 2021 Israeli-Palestinian conflict, Human Rights Watch and Palestinian digital rights groups documented hundreds of cases in which Meta's automated systems removed Arabic-language posts, Stories, and accounts belonging to journalists and human rights workers covering the conflict. Many posts were news photographs or eyewitness accounts. Meta acknowledged in a subsequent statement that "we made errors that affected people's ability to share their experiences" and attributed part of the problem to its systems performing less accurately on Arabic-language content than on English.

Why Disparate Error Rates Emerge

Moderation AI is trained predominantly on English-language data. When deployed globally, models encounter languages, dialects, and cultural contexts radically different from their training distribution. The result is not a neutral error rate: errors concentrate in under-resourced languages and communities already marginalized in digital spaces.

A 2021 paper from the University of Washington ("Measuring Model Biases in the Absence of Ground Truth") documented that Facebook's hate speech classifier had a false positive rate three times higher for African American Vernacular English (AAVE) compared to Standard American English — even when content was identical in meaning. Phrases common in AAVE ("I'm dead" meaning "I'm laughing," for example) triggered hate speech flags at disproportionately high rates.

False Positive Disparities

A false positive in content moderation means legitimate speech is removed. When false positive rates are higher for specific communities, those communities bear a disproportionate burden of censorship — losing access to the platform's audience, appeals processes, and monetization at higher rates than others.

False Negative Disparities

A false negative means violating content stays up. When false negative rates are higher for specific communities — meaning attacks against them go undetected — those communities receive less protection from harassment. Both error types can be discriminatory depending on their distribution.

The Transparency Gap

Until 2021, Meta published aggregated accuracy statistics but not breakdowns by language, geography, or demographic. The company's transparency reports showed high-level "proactive detection rates" that masked significant performance disparities. It was civil society organizations — not platform transparency reports — that documented the Arabic moderation failure.

The European Union's Digital Services Act (DSA), which came into effect for very large platforms in August 2023, requires platforms to provide researchers with access to data for auditing algorithmic systems. The first DSA-mandated audit cycles began in 2024. Researchers at the Oxford Internet Institute are among those conducting audits specifically examining cross-language moderation error rates under this framework.

Documented Case — Twitter's Image Cropping Algorithm (2020)

In October 2020, researchers and users demonstrated that Twitter's automated image cropping algorithm — which selected the most "salient" part of a long image to display in timeline previews — consistently favored white faces over Black faces when both appeared in the same image. Twitter's own subsequent analysis confirmed the disparity and found the algorithm also favored women's bodies in ways that reflected training data collected from web images reflecting existing societal biases. Twitter removed the automated cropping algorithm in May 2021, stating "the risks of harm with our image cropping algorithm are not acceptable."

Structural vs. Incidental Bias

The key distinction regulators and researchers draw is between incidental bias — errors that occur somewhat randomly — and structural bias — patterns of error that consistently disadvantage the same groups. Structural bias in moderation is not simply a technical flaw to be patched. It reflects choices about which training data to collect, which languages to support, which communities to prioritize in annotation, and which errors are acceptable. These are policy choices made by platform teams, often without public accountability.

Key Terms

Disparate impactWhen a system produces systematically different outcomes for different groups, regardless of intent.

False positive rateThe proportion of non-violating content incorrectly flagged as violating; when higher for specific groups, it constitutes over-censorship of those groups.

DSA auditUnder the EU Digital Services Act, independent researchers can audit algorithmic systems for systemic risk, including moderation bias.

Lesson 3 Quiz

Bias, Error, and Disparate Impact

Three questions — click an answer to reveal feedback.

1. What did Human Rights Watch document about Meta's automated moderation during the May 2021 Israeli-Palestinian conflict?

Correct. HRW documented hundreds of removals of Arabic-language journalism and eyewitness accounts. Meta acknowledged the errors and attributed part of the problem to lower accuracy on Arabic — a direct example of disparate impact from language-skewed training data.

Human Rights Watch documented that Arabic-language posts — including news photographs and eyewitness accounts from journalists — were removed at high rates. Meta acknowledged errors and cited lower accuracy on Arabic content specifically, illustrating how language biases in training data produce disparate real-world impact.

2. University of Washington research on Facebook's hate speech classifier found what disparity regarding AAVE (African American Vernacular English)?

Correct. The UW 2021 paper found false positive rates roughly three times higher for AAVE — meaning legitimate speech in AAVE was flagged as hate speech at far higher rates, even when content was identical in meaning to Standard American English equivalents.

The UW research found AAVE had a false positive rate approximately three times higher than Standard American English. Phrases common in AAVE triggered hate speech flags disproportionately — not because the content was more harmful, but because the training data didn't adequately represent AAVE as a legitimate dialect.

3. Twitter removed its automated image cropping algorithm in May 2021. What was the documented problem?

Correct. Twitter's own analysis confirmed the algorithm favored white faces over Black faces, and also showed gendered biases in how it selected "salient" regions. Twitter concluded the "risks of harm are not acceptable" and removed the feature.

Twitter's own subsequent analysis confirmed that the saliency-based cropping algorithm consistently favored white faces over Black faces and showed gendered biases — reflecting training data drawn from web images that encoded existing societal biases. Twitter removed the algorithm rather than attempt further fixes.

Lab 3 — AI Tutor

Auditing for Disparate Impact

Explore how to detect and address moderation bias with your AI tutor. At least 3 exchanges to complete.

Your Task

You're on the trust and safety team at a platform with 200 million users across 40 languages. Your English moderation accuracy is 94%. You've just received your first DSA audit report showing your Arabic and Swahili false positive rates are 3× and 4× your English rate respectively.

What are your immediate obligations — to users, to regulators, to the public? And what interventions are actually available to you to reduce the disparity? Discuss with your tutor — there is no single right answer here.

AESOP Tutor

Disparate Impact & Auditing

This is a real type of situation trust and safety teams are now facing under DSA frameworks. The audit has found something — now what? Let's think through the options systematically. What's the first thing you'd want to know more about before deciding on an intervention?

AI & Media · Module 4 · Lesson 4

Governance, Appeals, and Accountability

Oversight structures, user rights, and the unresolved question of who governs the governors

When an AI system removes your speech, what recourse do you have — and who is accountable for the decision?

In January 2021, Facebook suspended former President Donald Trump's account following the Capitol attack. In May 2021, the Oversight Board — a nominally independent body Facebook had created and funded — upheld the suspension but ruled that indefinite suspension was not a defined penalty in Facebook's own policies. The board ordered Facebook to review its own decision within 180 days. Facebook responded by maintaining the suspension and adjusting its stated policies to allow indefinite suspension. The case revealed that even a dedicated appeals body could not compel the platform to follow its own rules.

The Appeals Architecture

Under Meta's current system, a user whose content is removed can appeal through an in-app flow. If the initial review affirms the removal, the user can escalate to the Oversight Board — but only for a tiny fraction of cases. The Oversight Board accepted 20 cases for full review in its first year of operation; Meta removes millions of pieces of content per day. The board functions as a symbolic and precedent-setting body, not a scalable appeals mechanism.

The EU Digital Services Act requires platforms to provide users with access to out-of-court dispute settlement bodies for content moderation decisions. These bodies must be independent, expert, and free to users. The first certified dispute settlement bodies began operations in 2024 under DSA requirements. Whether they can meaningfully scale to handle the volume of moderation decisions remains untested.

Documented Case — YouTube's Counter-Terrorism Removal Surge (2017)

Following the 2017 London Bridge attack, YouTube accelerated its AI-powered removal of terrorist-related content. Within weeks, the Syrian Archive — a civil society organization documenting war crimes — reported that over 300,000 videos had been removed, many of which constituted evidence of atrocities in Syria that human rights organizations and the International Criminal Court were using in investigations. YouTube acknowledged the removals and created a limited appeals process for "at-risk" archival content. The incident became the foundational case study for why moderation systems need carve-outs for documentation and journalism.

Transparency Reporting Requirements

Prior to 2022, platform transparency reports were voluntary and inconsistently formatted, making cross-platform comparison nearly impossible. The DSA mandates standardized reporting for very large platforms. The Global Network Initiative (GNI) — a multi-stakeholder body — has developed principles for transparency that more than 30 platform and telecom companies have endorsed, though adherence is self-reported.

The EU's DSA Transparency Database, launched in 2023, requires platforms to publish every content moderation decision as structured data within 24 hours. As of mid-2024, Meta has submitted over 1.5 billion records to the database. Researchers at Tilburg University found that roughly 95% of these records lacked sufficient context in the "reason" field to enable meaningful analysis — indicating that transparency compliance in form does not guarantee transparency in substance.

Platform-Internal Mechanisms

User reports → automated review → human escalation → in-app appeal → Oversight Board (Meta only, limited cases). Speed: days to weeks. Binding on platform: yes, but platform retains final authority. Accountability: limited to internal standards.

External / Regulatory Mechanisms

DSA out-of-court dispute settlement → national competent authority → European Commission (systemic risk). Speed: weeks to months. Binding: yes for DSA-covered platforms. Accountability: fines up to 6% of global revenue for systemic failures.

The Unresolved Governance Problem

No current governance architecture solves the core problem: privately operated AI systems making billions of speech decisions per day, with minimal real-time accountability, serving users in jurisdictions with incompatible legal frameworks. The DSA applies only in the EU. The U.S. has no comparable federal framework. Brazil's LGPD addresses data but not speech. India's IT Rules 2021 require platforms to appoint local compliance officers — but do not mandate algorithmic transparency.

The academic and policy community has proposed several frameworks: algorithmic impact assessments before deployment (analogous to environmental impact assessments), mandatory audits by independent third parties, data access regimes for researchers, and interoperability requirements that would allow users to migrate between platforms without losing their social graph. As of 2024, none of these frameworks has been enacted at scale outside the EU.

Key Terms

Oversight BoardMeta's nominally independent appeals body, funded by a trust Meta endowed. Accepts a small fraction of cases; decisions are binding in form but platform retains policy authority.

DSA Transparency DatabaseEU-mandated repository of moderation decisions from large platforms, launched 2023. Records must be submitted within 24 hours of each decision.

Algorithmic impact assessmentPre-deployment analysis of potential harms from an automated decision system, analogous to environmental or privacy impact assessments.

Lesson 4 Quiz

Governance, Appeals, and Accountability

Three questions — click an answer to reveal feedback.

1. What was the Oversight Board's ruling on Facebook's suspension of Donald Trump's account, and what happened next?

Correct. The board upheld the suspension but found "indefinite" suspension was not a defined policy penalty, ordering a 180-day review. Facebook maintained the suspension and revised its policies to permit indefinite suspensions — demonstrating that the platform retains final authority even over its own appeals body.

The Oversight Board upheld the suspension but ruled that "indefinite" was not a defined penalty in Facebook's own policies. It ordered Facebook to review the decision within 180 days. Facebook's response was to maintain the suspension and update its policies to allow indefinite suspension — illustrating the platform's retained authority over its own governance body.

2. The 2017 YouTube counter-terrorism removal surge damaged which type of important resource?

Correct. The Syrian Archive reported over 300,000 videos removed, many constituting evidence of atrocities that human rights organizations and the ICC were using in investigations. This case became the foundational example for why moderation systems need carve-outs for documentation and journalism.

The Syrian Archive documented that over 300,000 videos were removed in the surge — many of them eyewitness documentation of war crimes being used by human rights organizations and international courts. This incident is now the canonical case study for why over-removal by AI systems has concrete humanitarian consequences beyond platform policy debates.

3. What did researchers at Tilburg University find about Meta's submissions to the EU DSA Transparency Database?

Correct. Tilburg researchers found that ~95% of Meta's 1.5 billion+ submitted records had insufficient "reason" field content for meaningful analysis — illustrating that formal compliance with transparency requirements doesn't guarantee substantive transparency.

Tilburg University researchers found that roughly 95% of Meta's submitted records lacked sufficient context to be analytically useful — a finding that illustrates the gap between compliance in form (submitting records) and transparency in substance (records that actually enable accountability).

Lab 4 — AI Tutor

Designing an Appeals System

Think through scalable moderation accountability with your AI tutor. At least 3 exchanges to complete.

Your Task

Current appeals mechanisms are either too slow to be meaningful or too limited in scope to matter. You've learned about the Oversight Board's limitations, the DSA's structured but still-unproven dispute resolution requirement, and the Syrian Archive case that showed the stakes of over-removal.

Design a content moderation appeals system that could work at scale. What would make it genuinely accountable — not just procedurally compliant? What are the hardest tradeoffs in your design? Start by proposing one structural feature and defending it.

AESOP Tutor

Governance & Appeals Design

This is one of the hardest open problems in platform governance. There's no deployed system that has solved it yet — so this is genuinely exploratory design work. What's the first structural feature you'd build into your appeals system, and why would it make accountability more than symbolic?

Module 4 — Final Test

Content Moderation at Scale

15 questions across all four lessons. Score 80% or higher to pass.

1. In 2021, what percentage of hate speech actioned by Meta was detected proactively by AI before any user report?

Correct — Meta's Q3 2021 Community Standards Enforcement Report.

Meta's report stated 97.3%.

2. PhotoDNA was originally developed by Microsoft and first deployed by Facebook in what year?

Correct — PhotoDNA was developed in 2009 and Facebook deployed it in 2011.

Facebook deployed PhotoDNA in 2011 for CSAM detection.

3. YouTube's 2020 BERT deployment reduced recommendations of borderline content by what proportion on U.S. English queries?

Correct — documented in YouTube's Q3 2020 transparency report.

YouTube reported a 70% reduction on U.S. English queries.

4. The "report-and-review" era of moderation is best described as collapsing at what platform scale?

Correct — the arithmetic of billions of daily posts versus thousands of human reviewers made report-and-review structurally impossible.

Human review was structurally overwhelmed at the billions-of-users scale — it was a mathematical impossibility, not a choice.

5. What is the core limitation of hash-matching systems like PhotoDNA?

Correct — hash matching is effective only on previously identified and hashed material; entirely new violating content is invisible to it.

Hash matching's fundamental limitation is that it can only match known, previously hashed material. Novel violating content evades it entirely.

6. A 2019 ACL study on hate speech datasets found inter-annotator agreement as low as 60%. What is the primary implication?

Correct — noisy training labels set an upper bound on what a classifier trained on that data can reliably learn.

When training data has low agreement, the model learns a noisy signal — it cannot generalize beyond that ceiling of human agreement.

7. Google's Perspective API, released in 2017, was found to over-score toxicity for what type of content?

Correct — CMU researchers documented that the API reflected training data bias by scoring identity-group mentions as disproportionately toxic.

Carnegie Mellon researchers found the API over-scored phrases containing identity group terms — a training data artifact, not a deliberate policy.

8. Setting a moderation detection threshold is best understood as what kind of decision?

Correct — threshold-setting balances false positives (over-censorship) against false negatives (under-enforcement); that balance reflects the platform's values priorities, not a technical fact.

Threshold-setting is a values decision: higher thresholds mean more violations slip through; lower thresholds mean more legitimate speech is removed. Neither direction is objectively correct.

9. During the May 2021 Israeli-Palestinian conflict, Meta's moderation system was criticized for what pattern?

Correct — documented by Human Rights Watch; Meta acknowledged lower accuracy on Arabic content contributed to erroneous removals.

HRW documented high rates of removal of Arabic-language journalism and eyewitness accounts, which Meta acknowledged was partly due to lower model accuracy on Arabic.

10. Twitter's image cropping algorithm was removed in May 2021 because it demonstrated what bias?

Correct — Twitter's own analysis confirmed the racial and gendered disparities and concluded the "risks of harm are not acceptable."

Twitter's own analysis confirmed consistent favoritism of white faces over Black faces and gendered biases — both reflecting the web training data's societal biases.

11. The DSA Transparency Database requires platforms to submit moderation decision records within what timeframe?

Correct — the DSA requires records to be submitted within 24 hours of each decision.

The DSA Transparency Database requires submission within 24 hours of each moderation decision.

12. What did the 2017 YouTube counter-terrorism removal surge damage that became a landmark accountability case?

Correct — over 300,000 videos were removed, including evidence of atrocities being used in international investigations, making this the canonical case for documentation carve-outs in moderation policy.

The Syrian Archive documented 300,000+ removed videos including war crimes documentation used by human rights organizations and international courts — the foundational case for why over-removal has humanitarian consequences.

13. The Oversight Board's ruling on Trump's suspension revealed what fundamental limitation of platform-internal governance?

Correct — Facebook maintained the suspension and updated its own policies to accommodate its decision, demonstrating that the board's rulings are binding in form but not in substance when platforms can revise their policies in response.

The board ruled, Facebook maintained the suspension anyway and updated its policies to permit indefinite suspension — showing that even formal governance bodies cannot override platform authority when platforms can simply change their own rules.

14. University of Washington researchers found Facebook's hate speech classifier had what false positive rate disparity for AAVE content?

Correct — the 2021 UW paper documented AAVE false positive rates approximately three times those for Standard American English.

The UW 2021 research found AAVE false positive rates approximately three times higher than Standard American English — a significant disparate impact on AAVE speakers.

15. What did Tilburg University researchers find about Meta's submissions to the DSA Transparency Database?

Correct — compliance in form (submitting records) did not guarantee transparency in substance; the reason fields were insufficiently detailed for accountability purposes.

Tilburg researchers found ~95% of Meta's 1.5B+ records lacked meaningful "reason" field content — illustrating that formal DSA compliance doesn't guarantee substantive accountability.