In May 2018, the day the GDPR took effect, Max Schrems filed complaints against Facebook, Google, Instagram, and WhatsApp within hours of midnight. His argument was precise: these platforms had used "forced consent" — you could not use the service at all unless you accepted data collection. The option to say no did not exist in any meaningful sense. Within twelve hours of the regulation's start, the companies faced complaints totalling €3.9 billion in potential fines.
What Schrems had named was not a technical violation. It was an architecture — a system designed so that the question of consent could never actually be answered in the negative.
In law and ethics, valid consent has four components: it must be informed (you know what you're agreeing to), voluntary (refusal must be a real option), specific (blanket consent to "anything we might do" doesn't count), and revocable (you can withdraw it). Digital platforms have historically undermined all four simultaneously.
The Federal Trade Commission's 2012 report on privacy described the average US privacy policy as requiring 76 work days per year to read in full — for every service an average person uses. The policies are long by design. Length functions as a consent mechanism that denies actual consent.
AI systems compound this problem. Where a 2005 website might collect your email address and browsing history, a 2024 AI platform trains on the text of every prompt you submit, may use your conversations to improve future models, and may retain inferences about your beliefs, health status, and relationships — none of which you explicitly disclosed.
In 2022, Google's privacy policy update quietly extended the right to use profile photos in AI training. Users who had uploaded family photos years earlier had not consented to this secondary use. No notification was sent; the update was embedded in routine policy language. Consumer advocacy groups in the EU filed formal complaints under GDPR Article 6 (lawfulness of processing), arguing the original consent could not retroactively extend to AI training purposes not yet disclosed at the time of upload.
Modern data collection rarely happens in a single step. Your fitness app shares data with its analytics provider, which sells aggregate data to a health insurer's data broker, which sells enriched profiles to an employer background-check service. At each handoff, the original consent — "allow this app to track your steps" — is stretched further from its original meaning.
AI systems sit at the end of many such chains. A large language model trained on scraped internet data contains, implicitly, the words of bloggers who never imagined being training data, forum posts from people who later deleted their accounts, and medical questions asked anonymously on patient support sites. The scrape was legal; the consent was never sought.
The Common Crawl dataset — used in training GPT-3, LLaMA, and many other major models — contains approximately 3.1 trillion tokens scraped from the open web. No individual whose writing contributed to it gave consent for this purpose. Common Crawl itself is a nonprofit that makes no commercial use of the data, but the companies that train billion-dollar commercial models on it did not obtain separate consent from content creators.
Traditional data collection is transactional: company takes X, uses X for purpose Y. AI training is transformative. A model trained on your data doesn't store your data — it distills patterns from it into weights. You cannot be "deleted" from a trained model the way you can be deleted from a database. This makes the right to erasure (GDPR Article 17) technically complex and, in many cases, practically impossible without retraining the entire model.
When Adobe updated its Terms of Service in June 2024 to include language allowing the company to access users' work via "automated techniques" to improve its AI products, users and professional photographers interpreted this as consent to train AI on their creative work. Adobe denied this interpretation, but the episode illustrated how ambiguous policy language — even unintentionally — can create a consent gap between what companies say they do and what users believe they are agreeing to.
The problem isn't that companies are necessarily acting in bad faith. It's that the existing architecture of consent — long policies, bundled agreements, retroactive updates — was designed for a world of simpler data transactions. AI broke that architecture, and we haven't built a replacement yet.
In this lab you'll interrogate the consent frameworks of real AI platforms by discussing specific cases with the AI. Your goal is to develop precise language for describing when and why consent fails in AI data collection — and what valid consent would look like.
Work through at least three exchanges. Identify a specific consent problem, explain which component of valid consent it violates, and propose a concrete design fix.
In 2012, a Minneapolis father walked into a Target store to complain that his teenage daughter had been receiving coupons for cribs and maternity clothing. He demanded an apology from the manager. Weeks later, he called back to apologize himself: his daughter was pregnant, and she hadn't told him. Target's statistical model had inferred her pregnancy from purchasing patterns — unscented lotion, calcium supplements, cotton balls — before she disclosed it to her own family.
Target's model assigned each customer a "pregnancy prediction score." It had been built on purchase data customers gave for an entirely different purpose: to get discounts on items they already wanted to buy. No one agreed to be pregnancy-scored.
There is a fundamental difference between what you disclose and what a system infers from that disclosure. You might share your location so an app can show local weather. The app — or its advertising partners — can then infer your commute pattern, your workplace, your religious attendance on Sunday mornings, your visits to medical facilities, and your approximate income bracket. None of these were shared; all were inferred.
AI dramatically scales this inference gap. Classical statistics could identify pregnancy from purchasing data with reasonable accuracy. Modern machine learning can infer sexual orientation from facial photographs (a 2017 Stanford study achieved 81% accuracy for men from a single photo), political affiliation from social media likes, and mental health status from typing patterns — none of which require you to disclose these attributes directly.
The consent problem is layered: you consented to share X; the system inferred Y from X without telling you; the inferred Y was then used for decisions affecting you. At what point in this chain was your consent sought for Y?
In 2014, Facebook published research in the Proceedings of the National Academy of Sciences showing that it had secretly manipulated the emotional tone of approximately 700,000 users' News Feeds for one week in 2012 to study emotional contagion. Users had not consented to participate in a psychological experiment. Facebook argued the manipulation was covered by its data use policy. Cornell University's IRB (which had partial involvement) later acknowledged the study raised serious ethical concerns. The US Senate Commerce Committee opened an inquiry. Facebook's defense — "it was in the terms of service" — became a landmark example of how platform consent frameworks can be stretched far beyond their intended scope.
Legal protections around protected characteristics — race, religion, health status, sexual orientation — mean very little if those attributes can be inferred from non-protected data and then acted upon. A lender cannot legally ask about your health. But if an AI model trained on purchasing data can infer chronic illness from pharmacy shopping patterns and use that signal — even implicitly — in a credit decision, the legal protection is bypassed through inference.
This problem has a name: attribute inference attacks. Researchers at MIT and the University of Texas demonstrated in 2013 that Netflix ratings — shared voluntarily for movie recommendations — could be combined with public IMDb data to de-anonymize users and infer their political views and sexual orientation. The data Netflix users shared was innocuous; what was extracted was not.
Regulators are beginning to respond. The EU AI Act (2024) requires "high-risk" AI systems to document the data they process and the inferences they produce. But enforcement of inference-based discrimination is still nascent — regulators can audit what data is collected; auditing what is inferred is far harder.
Helen Nissenbaum, a philosopher at Cornell Tech, proposed that the right question isn't "was this information public?" but "does this use match the norms of the context in which it was shared?" Your medical information shared with a doctor flows appropriately to other treating physicians but not to employers. Your grocery purchases shared for discounts flow appropriately to inventory management but not to pregnancy scoring.
AI systems routinely violate contextual integrity by aggregating data from many contexts — medical, commercial, social — into a single model that serves entirely different purposes. The information was shared in specific contexts with specific implicit norms; the aggregation breaks all of them simultaneously.
The 2023 FTC report on commercial surveillance cited contextual integrity violations as a primary harm of the data broker industry, noting that AI-powered profiling companies now combine data from over 5,000 distinct sources to build individual profiles. Each source had its own consent context; none foresaw the profile.
In this lab you'll practice identifying inference gaps — where AI systems derive sensitive attributes from innocuous disclosed data. Choose a real AI application (recommendation systems, credit scoring, health apps, ad targeting) and map what is disclosed, what is inferred, and what consent gap exists.
Complete at least three exchanges. Your analysis should name: (1) the data disclosed, (2) the attribute inferred, (3) the purpose the inference serves, and (4) whether contextual integrity is violated.
On March 31, 2023, Italy's data protection authority — the Garante — ordered OpenAI to immediately stop processing Italian users' data and temporarily blocked ChatGPT in the country. The Garante cited multiple violations: no age verification, no legal basis for mass data collection for training, and the impossibility of correcting inaccurate personal information that the model had, in effect, memorized. ChatGPT had been producing false biographical claims about real Italian citizens — information it could not simply "correct" by updating a database record, because the incorrect information was baked into model weights.
OpenAI's response included new privacy controls and an opt-out for Italian users. The service was restored in April. But the episode crystallized a structural problem no engineering patch could fully solve: a trained language model is not a database. It does not store facts; it stores patterns. Removing a fact from a pattern requires changing the pattern — which means, at minimum, retraining.
The GDPR's "right to erasure" (Article 17) and the California Consumer Privacy Act's "right to delete" were written with databases in mind. In a relational database, a DELETE command removes a row. In a trained neural network, there is no equivalent operation. The network's knowledge is distributed across billions of parameters; no single parameter encodes a single person's data.
The emerging field of machine unlearning attempts to solve this. Researchers at Google, Stanford, and elsewhere have developed techniques to reduce a model's reliance on specific training examples — retraining from a checkpoint, gradient ascent on "forgotten" data, and influence function methods that identify which parameters were most affected by particular training examples. But none of these methods offer the clean equivalence of deletion that GDPR assumes, and all are computationally expensive.
A 2023 paper from researchers at the University of Washington and Stanford found that even after applying machine unlearning techniques, models retained measurable residual knowledge about "deleted" individuals at rates between 3% and 28% depending on the method — far from the complete erasure a data subject might expect when exercising a legal right.
Clearview AI scraped over 30 billion facial images from social media platforms and built a facial recognition model sold to law enforcement. When Vermont's Attorney General filed suit in 2020 and Illinois sought to enforce its Biometric Information Privacy Act, Clearview faced a fundamental problem: even if it deleted a person's photos from its database, the neural network trained on those photos had already learned that person's facial geometry. The model's "memory" of a face persists even if the training image is deleted. Clearview settled multiple suits, agreeing to stop selling to private companies in the US, but conceded it could not "un-train" its model on specific individuals. Courts in Australia and the UK ordered the deletion of collected data; Clearview complied with database deletions while acknowledging the model itself could not be similarly purged.
In 2023, researchers at Google DeepMind and collaborating institutions published a study demonstrating that ChatGPT (GPT-3.5 Turbo) could be prompted to reproduce verbatim training data — including personal information — through a technique as simple as asking the model to repeat a word indefinitely. The model would eventually "diverge" into memorized text, producing names, email addresses, phone numbers, and private content that had appeared in its training set.
This memorization problem has direct consent implications. People whose personal information appeared in web pages that were scraped into training data never consented to that information being permanently encoded into a model that could reproduce it on request. The information isn't stored in a deletable database; it's encoded in weights and can be elicited by anyone with API access.
The FTC opened an inquiry into OpenAI in July 2023 partly on these grounds, asking for documentation of what personal data the models were trained on and what steps were taken to prevent harmful outputs of personal information. OpenAI produced over a thousand pages of documentation.
The EU AI Act (fully in force by 2026) requires providers of general-purpose AI models to publish summaries of training data and to comply with copyright and data protection law — including erasure requests. But the Act stops short of specifying how erasure must be technically achieved, delegating that to future guidance from the European AI Office.
In the United States, the FTC has signaled through its "Algorithmic Accountability" framework that companies may be required to delete not just training data but the models trained on illegally collected data — a position sometimes called "algorithmic disgorgement." In a 2022 settlement with Everalbum, a photo-sharing app that had trained facial recognition models without consent, the FTC required deletion of both the training images and the models built from them. This was the first time a US regulator required model deletion as a remedy.
Machine unlearning, differential privacy during training, and federated learning (training on data that never leaves users' devices) are the primary technical approaches being developed to make consent-compatible AI training possible at scale. None is yet a complete solution.
The right to erasure collides with the architecture of trained neural networks. In this lab, you'll explore what a technically realistic consent framework for AI training might look like — and where current approaches fall short. Consider: machine unlearning, differential privacy, federated learning, and data minimization.
Complete at least three exchanges. Identify a specific type of AI system, propose a technical approach to making it more consent-compatible, and evaluate its limitations honestly.
In April 2021, Apple released iOS 14.5 with a feature called App Tracking Transparency. Every app that wanted to track users across other apps or websites was now required to display a prompt: "Allow [App] to track your activity across other companies' apps and websites?" The choices were binary: "Ask App Not to Track" or "Allow."
The result was dramatic. Within months, industry measurements found that 85% of US users chose not to be tracked when given a clear, friction-free choice. Facebook's parent Meta reported a $10 billion revenue reduction in 2022 that it attributed substantially to ATT. The lesson was blunt: when consent is genuinely voluntary and clearly explained, most people decline. The prior consent architecture had not been designed for genuine refusal to be the common outcome.
The Apple ATT example shows that the design of a consent interface is not neutral. Deliberately confusing opt-out flows, pre-ticked consent boxes, and consent bundled with service access all produce artificially high consent rates. Real consent architecture has several identifiable features:
Granularity: Users can consent to some uses and refuse others — not all-or-nothing. Spotify's privacy settings allow separate choices for personalization, third-party sharing, and research use. This is closer to valid specific consent than a single "I agree."
Timing: Consent is sought before data collection, not retrospectively. Amazon's Alexa Skills now require developers to obtain user consent before accessing voice history — a requirement that was not present in Alexa's initial consent framework.
Plain language: The UK's Information Commissioner's Office has issued guidance requiring that consent requests be "as prominent as possible and separate from other terms." Ireland's DPC fined WhatsApp €225 million in 2021 partly because its privacy notice was insufficiently clear about the legal basis for processing — a case directly about whether users could understand what they were agreeing to.
Genuine revocability: Withdrawal of consent must be as easy as giving it. The GDPR's Article 7(3) makes this explicit, but many platforms still make opt-out buried in settings menus requiring multiple steps.
In 2023, Mozilla added a feature to Firefox called "Privacy-Preserving Attribution" (PPA) — an API that allowed websites to measure ad conversions without tracking users across sites. The feature was enabled by default for all users without explicit notification. Privacy advocates criticized the move: even if the technology was more privacy-preserving than alternatives, enrolling users in an ad measurement system without opt-in consent violated the principle that consent should be active, not passive. Mozilla subsequently acknowledged the issue and added clearer disclosure. The episode illustrated that good technical intentions do not substitute for consent process — even a privacy improvement can be a consent violation if deployed without transparency.
Traditional consent is a one-time event. AI's ongoing learning creates a need for dynamic consent — frameworks where users can review, adjust, and withdraw consent as AI systems evolve and their data is used in new ways. Several implementations exist at scale:
The UK Biobank, which collects genetic and health data from 500,000 participants, uses a dynamic consent model where participants can log in to a portal and update their consent preferences — specifying which research uses they approve, which they withdraw, and receiving notifications when new uses are proposed. This is considered a gold standard in biomedical research.
Google's "My Ad Center" (launched 2022) allows users to see and adjust what topics their ad profile includes, turn off personalization by category, and review what data Google infers about them. While imperfect — it doesn't allow full opt-out from all inference — it is a meaningful step toward granular dynamic consent at consumer scale.
The IEEE's Ethically Aligned Design framework (v2, 2019) recommends that AI systems provide "consent dashboards" giving users visibility into what data is held, what has been inferred, what decisions were made based on that data, and granular controls for each use. No major AI platform yet fully implements this, but it provides a normative target.
The FTC's 2023 report on commercial surveillance identified three structural changes needed to make AI consent frameworks meaningful: first, a federal privacy law with a private right of action (so individuals can sue, not just regulators); second, data minimization requirements that limit what can be collected to what is genuinely necessary; and third, algorithmic transparency mandates that require AI systems to disclose not just what data they collect but what they infer and how those inferences are used in decisions.
The EU AI Act's Article 13 requires that high-risk AI systems be transparent enough that users can make informed decisions — including about whether to interact with the system at all. This is the first major legal framework to treat AI-system-level transparency as a precondition for valid consent.
The path from where consent frameworks stand today to where they need to be for AI is not primarily technical. The technical tools exist: differential privacy, federated learning, machine unlearning, consent dashboards, data minimization. What is lacking is the regulatory mandate and the economic incentive to deploy them at the cost of reduced data collection. Apple's ATT experiment suggests that when given clear choice, most people prefer privacy — and that genuine consent architecture might fundamentally change the economics of AI data collection.
In this capstone lab, you'll design a concrete consent framework for a real AI application. Your design must address: what data is collected and for which specific purposes, what is inferred and disclosed, how users revoke consent, and how the system handles existing trained models if consent is withdrawn.
Complete at least three exchanges. Reference at least one real regulatory standard (GDPR, EU AI Act, CCPA, FTC guidelines) and one technical mechanism (differential privacy, federated learning, machine unlearning, data minimization) in your design.