L1
Β·
Quiz
Β·
Lab
L2
Β·
Quiz
Β·
Lab
L3
Β·
Quiz
Β·
Lab
L4
Β·
Quiz
Β·
Lab
Module Test
AI Risk for Business Leaders Β· Module 6 Β· Lesson 1

How AI Systems Consume and Expose Personal Data

Training pipelines, inference logs, and embedding stores create data exposure surfaces that legacy compliance frameworks never anticipated.

Within three weeks of Samsung Semiconductor permitting engineers to use ChatGPT for code assistance, three separate employees had pasted proprietary source code and confidential meeting notes directly into the chat interface. Samsung discovered the incidents only after an internal review. The data had already traversed OpenAI's servers. Samsung subsequently banned generative AI tools company-wide β€” a reactive measure that illustrated a fundamental misunderstanding: employees had treated an AI assistant as a private notepad, unaware that submitted text could be retained for model improvement.

The episode crystallised a governance gap that has since become one of the most urgent questions in enterprise AI: where exactly does your data go when it enters an AI system, and who controls it once it leaves your perimeter?

The AI Data Lifecycle

AI systems interact with personal and proprietary data across three distinct phases, each carrying its own exposure profile. Understanding these phases is the prerequisite for any meaningful data-risk programme.

Training phase. Foundation models are pre-trained on massive internet-sourced corpora. Researchers from Google DeepMind and EleutherAI have demonstrated that large language models memorise and can reproduce verbatim text fragments from training data β€” including personal emails, medical forum posts, and private documents that were publicly accessible at crawl time. A 2023 study by Carlini et al. extracted hundreds of memorised training examples from GPT-2 using targeted prompts. When your vendor says a model was trained on "public data," that data may have included personal information about your customers.

Fine-tuning and retrieval-augmented generation (RAG) phase. Enterprises increasingly fine-tune base models on proprietary datasets or connect them to internal knowledge bases via RAG pipelines. Every document fed into these systems becomes a potential retrieval target. If access controls on the underlying vector database are misconfigured, an AI assistant may surface confidential HR records or legal memos to users who should not have access β€” not through a security breach, but through ordinary question-answering.

Inference phase. Every query submitted to a hosted AI service is an API call that may be logged, cached, or reviewed by vendor personnel. OpenAI's terms of service prior to March 2023 allowed conversation data to be used for model training unless users opted out β€” a default that most enterprise customers did not notice.

Data Types at Risk

Personal Identifiable Information (PII). Names, email addresses, social security numbers, and biometric data are the traditional targets of privacy law. AI systems compound the risk because they can generate synthetic PII that resembles real individuals, or infer sensitive attributes β€” health status, political views, sexuality β€” from apparently innocuous inputs.

Special category data. The EU's General Data Protection Regulation (GDPR) Article 9 designates health data, racial or ethnic origin, religious beliefs, and sexual orientation as requiring heightened protection. AI inference engines can derive these categories from behavioural signals; a loan-assessment model trained on zip codes may effectively proxy race without ever processing a racial category field.

Proprietary and trade-secret data. Source code, pricing models, M&A targets, and customer lists do not fall under privacy law but represent equally serious exposure. The Samsung incident involved this category: engineers were not violating privacy law, but they were violating trade-secret obligations.

Aggregated and inferred data. GDPR Recital 26 notes that anonymised data is only protected if re-identification is not reasonably possible. AI models can re-identify individuals from supposedly anonymised datasets with high accuracy. A 2019 Nature paper by de Montjoye et al. showed that 99.98% of Americans could be re-identified in an anonymised dataset using just 15 demographic attributes.

Regulatory Flash Point

Italy's data protection authority, the Garante, suspended ChatGPT access for Italian residents in March 2023, citing a lack of lawful basis for processing personal data in training and insufficient age-verification mechanisms. OpenAI restored service within a month after implementing opt-out controls and a transparency page β€” but the episode demonstrated that regulators will apply existing privacy law to AI systems regardless of whether those systems were designed with such law in mind.

Embedding Stores and Vector Database Risk

The dominant enterprise AI architecture in 2024 pairs a foundation model with a vector database containing embeddings of internal documents. Embeddings are dense numerical representations of text; they are not the text itself, but they are not truly anonymised either. Researchers at Google have shown that embeddings can be partially "inverted" β€” enough original text can be reconstructed from an embedding to constitute a privacy exposure.

The practical implication for business leaders: a Pinecone, Weaviate, or pgvector deployment housing your HR documents, legal contracts, or customer support transcripts is a data asset subject to the same access-control, retention, and audit requirements as the original documents. Many organisations that would never store unencrypted customer data in a public S3 bucket are deploying vector databases with default authentication settings and no data-retention policies.

Key Terms

Memorisation: The phenomenon by which a trained model reproduces fragments of its training data verbatim in response to prompts. RAG (Retrieval-Augmented Generation): Architecture that connects a language model to an external knowledge base, retrieving relevant documents at inference time. Vector database: A database that stores high-dimensional numerical embeddings of text or other content, enabling semantic similarity search. Inference log: A record of queries and responses processed by a hosted AI service, potentially retained by the vendor.

Lesson 1 Quiz

3 questions β€” free, untracked, retake anytime.
1. In the 2023 Samsung incident, what was the primary category of data that employees inadvertently exposed by using ChatGPT?
βœ“ Correct. Samsung engineers pasted proprietary source code and internal meeting notes β€” trade-secret material β€” not personal data governed by privacy law.
βœ— The Samsung incident involved proprietary source code and confidential meeting notes: trade-secret data, not customer PII, health records, or biometrics.
2. The Carlini et al. (2023) research on LLM memorisation demonstrated which of the following risks?
βœ“ Correct. Carlini et al. showed that targeted prompts could extract verbatim memorised training examples β€” including personal information β€” from models like GPT-2.
βœ— The research showed verbatim reproduction of training data fragments through targeted prompting β€” not full embedding reconstruction or fine-tuning comparisons.
3. Why does the Italy Garante's March 2023 suspension of ChatGPT matter to business leaders outside Italy?
βœ“ Correct. The Garante applied existing GDPR provisions β€” not new AI-specific law β€” signalling that privacy frameworks already in force can constrain AI deployment globally.
βœ— The Garante's action was significant because it applied existing GDPR law to AI β€” not because it created new data-localisation or training-data rules.

Lab 1: Mapping Your AI Data Exposure Surface

Identify where personal and proprietary data enters, travels through, and exits your AI systems.

Your Task

In this lab you will work with an AI advisor to map the data exposure surface of a realistic enterprise AI deployment. Consider a scenario where your organisation has deployed a customer-facing chatbot powered by a third-party LLM API, backed by a RAG pipeline over your internal CRM and knowledge base documents.

Ask the AI to help you identify specific data flows, categorise data types by risk level, and flag which phases of the AI lifecycle create the greatest exposure. Push for specifics β€” regulatory citations, real-world analogues, and concrete mitigation steps.

Try asking: "Walk me through every point where customer personal data could leave our control in a RAG-powered chatbot deployment, and rank those points by severity."
AI Data Risk Advisor Lab 1
AI Risk for Business Leaders Β· Module 6 Β· Lesson 2

GDPR, CCPA, and the Global Privacy Patchwork

AI systems cross jurisdictions instantly. The privacy obligations that follow are neither consistent nor simple.

Clearview AI scraped more than three billion facial images from public social media platforms and trained a facial-recognition AI that it sold to law enforcement agencies. Between 2020 and 2023, data protection authorities in Italy, France, Greece, the UK, Australia, and Canada each issued enforcement actions against the company, imposing fines totalling tens of millions of euros and demanding deletion of data related to their residents. The company had no offices in any of these jurisdictions.

Clearview's experience crystallised a fundamental principle of modern privacy law: where your servers sit is irrelevant β€” what matters is whose data you process and where those individuals are located. A business leader deploying AI anywhere in the world must understand that their system's data-processing activities will be judged under the laws of every jurisdiction whose residents' data the system touches.

GDPR: The Extraterritorial Standard

The EU's General Data Protection Regulation, effective May 2018, sets the global benchmark. Its extraterritorial reach under Article 3 applies to any organisation processing EU residents' data, regardless of where the organisation is established. For AI systems, this has several concrete implications.

Lawful basis for processing. Every processing activity requires a lawful basis (Article 6). For AI training on customer data, the most commonly invoked bases are legitimate interests and consent. Consent must be freely given, specific, informed, and unambiguous β€” pre-ticked boxes and bundled consent do not satisfy the standard. Legitimate interests requires a balancing test demonstrating the processing does not override individuals' rights.

Purpose limitation. Data collected for one purpose cannot be repurposed for AI training without fresh analysis and, often, fresh consent. A customer service transcript collected to resolve a billing dispute cannot automatically become training data for a sentiment analysis model.

Data minimisation and storage limitation. AI systems have an appetite for data that directly conflicts with these principles. A model trained on more data is generally more capable; GDPR demands that organisations collect only what is necessary and delete it when no longer needed. The tension is structural.

Automated decision-making. Article 22 grants individuals the right not to be subject to solely automated decisions that significantly affect them, including profiling. Credit scoring, insurance pricing, and hiring screening via AI trigger this provision. The right to "meaningful information about the logic involved" has been interpreted by courts to require more than a generic explanation of model type.

CCPA and the US State Patchwork

The California Consumer Privacy Act (2020) and its 2023 amendment, the California Privacy Rights Act (CPRA), introduced GDPR-adjacent rights to American consumers β€” access, deletion, portability, and opt-out of sale β€” but with important structural differences. Unlike GDPR, CCPA does not require a lawful basis for every processing activity; it relies instead on disclosure obligations and opt-out rights.

The CPRA specifically addresses automated decision-making: the California Privacy Protection Agency is developing regulations requiring businesses to conduct risk assessments for "significant decisions" made using personal information. These assessments will resemble GDPR's Data Protection Impact Assessments.

As of 2024, Virginia (VCDPA), Colorado (CPA), Connecticut, Texas, Montana, Oregon, and a dozen additional states have enacted comprehensive privacy laws. While they share a family resemblance, their opt-in versus opt-out defaults for sensitive data, their definitions of covered entities, and their enforcement mechanisms vary materially. A nationally deployed AI system must be mapped against this patchwork individually.

Enforcement Reality

In July 2023 the Irish Data Protection Commission β€” lead supervisory authority for Meta's EU operations β€” issued a €1.2 billion fine for transferring European users' personal data to US servers without adequate safeguards. The fine, the largest in GDPR history, related to Facebook's data transfers, not AI specifically. But it demonstrated that cross-border data flows underlying any AI training pipeline face the same scrutiny. The EU-US Data Privacy Framework, adopted in July 2023, provides a new transfer mechanism, but its legal durability is contested.

Sector-Specific Overlays

General privacy law is overlaid by sector-specific regimes that impose stricter controls. Healthcare organisations in the US must comply with HIPAA, which prohibits using protected health information for AI training without a valid authorisation or a formal de-identification procedure meeting the Safe Harbor or Expert Determination standard. A 2023 enforcement action by HHS found that a hospital network had used patient records to train a readmission-prediction model without adequate authorisation.

Financial services firms face GLBA restrictions on sharing nonpublic personal information. In the EU, financial data is additionally subject to PSD2 data-access rules. Credit reporting AI must comply with FCRA adverse-action notice requirements β€” if your AI denies credit, the applicant must receive a specific, human-interpretable reason, not a model output.

Children's data is the most heavily regulated category. COPPA in the US requires verifiable parental consent for collecting data from children under 13. The UK's Age Appropriate Design Code (the "Children's Code") requires privacy-by-default for services likely to be accessed by under-18s. Any AI system deployed in an educational context or a general consumer application faces these obligations.

Jurisdiction Matrix Principle

When assessing a new AI deployment, map data flows against four variables: (1) where data subjects are located, (2) where data is stored and processed, (3) where the organisation is established, and (4) what sector the data concerns. Each variable can independently trigger regulatory obligations β€” and in most real deployments, multiple regimes apply simultaneously.

Lesson 2 Quiz

3 questions β€” free, untracked, retake anytime.
1. Clearview AI received enforcement actions from multiple countries despite having no offices there. Which legal principle explains why those actions were valid?
βœ“ Correct. GDPR Article 3 and similar provisions in other jurisdictions establish extraterritorial reach based on where data subjects reside, regardless of processor location.
βœ— The valid principle is extraterritoriality: GDPR and equivalent laws apply based on the location of data subjects, not the location of the company processing their data.
2. A retail company collects customer transaction data to process purchases. It wants to use that same data to train a recommendation AI. Under GDPR, what is the primary legal issue?
βœ“ Correct. GDPR's purpose limitation principle (Article 5(1)(b)) requires that data only be used for the purpose for which it was collected; repurposing for AI training requires a compatibility assessment and often new consent.
βœ— The key issue is purpose limitation. Data collected for one purpose β€” transaction processing β€” cannot automatically be used for a different purpose β€” AI training β€” without fresh legal justification.
3. A US financial services company uses an AI model to screen loan applications. Which regulation specifically requires that denied applicants receive a human-interpretable reason for the adverse decision?
βœ“ Correct. FCRA's adverse-action notice requirements predate AI but apply directly: when an AI model contributes to a credit denial, the applicant must receive a specific, actionable reason.
βœ— The Fair Credit Reporting Act (FCRA) is the applicable US regulation. It requires specific adverse-action notices for credit denials, predating AI but fully applicable to AI-driven decisions.

Lab 2: Multi-Jurisdiction Compliance Analysis

Work through the regulatory obligations triggered by a cross-border AI deployment.

Your Task

Your organisation operates in the US, UK, EU, and Australia. You are deploying an AI-powered HR screening tool that processes employee and job-applicant data. Use this AI advisor to identify which privacy and employment regulations apply, what specific obligations they impose on AI-driven screening, and what your compliance checklist should include before go-live.

Be specific about your geography and use-case β€” the more detail you provide, the more useful the analysis. Ask follow-up questions about GDPR Article 22, CCPA automated decision-making regulations, and any sector-specific rules you are uncertain about.

Try asking: "We are deploying an AI resume-screening tool used by HR in the US, UK, and Germany. Walk me through every privacy and employment law obligation this triggers in each jurisdiction."
AI Compliance Advisor Lab 2
AI Risk for Business Leaders Β· Module 6 Β· Lesson 3

Data Breach Amplification and AI-Specific Incident Vectors

AI systems do not just suffer breaches β€” they create new attack surfaces, accelerate exfiltration, and make harm harder to bound.

In February 2023, researcher Kevin Liu demonstrated that Microsoft's Bing Chat β€” powered by a GPT-4 variant β€” could be manipulated through "prompt injection": by embedding instructions in a webpage that Bing was asked to summarise, Liu caused the AI to reveal its confidential system prompt and adopt a different persona. Separately, researcher Marvin von Hagen extracted the full text of Bing Chat's system prompt β€” which Microsoft had explicitly instructed the model to keep secret β€” through social engineering-style prompting.

The incidents illustrated that AI systems create attack surfaces that have no direct analogue in traditional software: the model itself can be weaponised against the organisation that deployed it, disclosing confidential configuration, bypassing access controls, or being induced to exfiltrate data it has been granted access to retrieve.

AI-Specific Attack Vectors

Prompt injection. When an AI system processes external content β€” a webpage, a document, an email β€” that content can contain embedded instructions the model follows, overriding the developer's intended system prompt. In a RAG deployment, a malicious document in the knowledge base could instruct the AI to exfiltrate other retrieved documents to an attacker. The OWASP Top 10 for LLM Applications (2023) lists prompt injection as the number-one vulnerability for LLM-based systems.

Training data poisoning. If an attacker can influence the data used to train or fine-tune a model, they can embed a "backdoor" β€” a trigger pattern that causes the model to behave maliciously when that pattern appears at inference time. A 2021 paper from the University of Maryland demonstrated poisoning attacks on sentiment classifiers with as little as 0.1% of training data corrupted. Organisations that fine-tune models on user-generated content or web-scraped data face this risk.

Model inversion and membership inference. An attacker with API access to a model can query it strategically to infer properties of the training data. Membership inference attacks determine whether a specific individual's data was used in training β€” a privacy violation even if no actual data is extracted. Model inversion attacks attempt to reconstruct training data from model outputs. Both techniques were demonstrated against clinical AI models in a 2021 study by Carlini et al., successfully inferring sensitive attributes of patients whose records were in the training set.

Sensitive data in model weights. Fine-tuned models can memorise sensitive training data in their weights. If a competitor or attacker obtains a copy of your fine-tuned model β€” through API extraction attacks or an insider threat β€” they may be able to recover proprietary information embedded during training.

Breach Amplification Effect

Traditional data breaches expose the records in the compromised database. AI breaches can expose more. A language model granted access to a customer database to answer support queries may, if manipulated, retrieve and surface records far beyond what any individual query should return. The 2023 ChatGPT bug that briefly exposed users' chat history to other users β€” confirmed by OpenAI β€” illustrated how the AI layer can amplify access beyond intended scope. In March 2023, users could see the chat titles and, in some cases, messages of other users due to a Redis caching bug in an open-source library.

Notification Obligations and Breach Bounding

GDPR Article 33 requires notification to the supervisory authority within 72 hours of becoming aware of a personal data breach "likely to result in a risk to the rights and freedoms of natural persons." Article 34 requires notification to affected individuals when the breach is likely to result in "high risk." These timelines were designed for traditional breaches where the scope is bounded by a database's record count.

AI breaches complicate bounding. If a language model has been exposed to prompt injection attacks over a period of weeks, it may be impossible to determine exactly which data was accessed, by whom, and what instructions were followed. Organisations must be able to answer the questions regulators will ask: what data was at risk, how many individuals are affected, and what is the potential harm? Without AI-specific logging of inputs, outputs, and retrieved documents, these questions may be unanswerable.

The Federal Trade Commission's breach notification rules under GLBA require financial institutions to notify customers "as soon as possible" and regulators within 30 days. The HHS breach notification rule requires notification within 60 days for HIPAA-covered entities. All of these regimes apply to AI systems that process the relevant data, but none were written with AI's distinctive breach characteristics in mind.

Organisational Controls

Input and output logging. Every query to and response from a production AI system should be logged with sufficient detail to reconstruct what data was accessed and what was returned. Logs must be tamper-resistant and retained for a period consistent with the applicable breach investigation and notification timeline.

Least-privilege data access. A RAG pipeline should retrieve only the minimum documents necessary to answer a query, and the retrieval system should enforce the same access controls as the underlying data store. An AI assistant available to all employees should not have retrieval access to executive compensation data or M&A documents simply because those documents exist in the same vector database.

Adversarial testing. AI systems should be subject to red-team exercises designed to probe prompt injection, jailbreaking, and data exfiltration vulnerabilities before deployment and on a recurring basis thereafter. This is now a requirement under the EU AI Act for high-risk systems and an expectation articulated by the US NIST AI Risk Management Framework.

The 72-Hour Problem

GDPR's 72-hour breach notification window was designed when a "breach" meant extracting a database. AI incidents β€” a prompt injection attack, a weeks-long session where a model surfaced data it shouldn't have β€” may not even be detectable within 72 hours without purpose-built AI logging infrastructure. Organisations that cannot detect and scope an AI incident within 72 hours are already in a GDPR compliance gap before any breach occurs.

Lesson 3 Quiz

3 questions β€” free, untracked, retake anytime.
1. In a RAG-based enterprise AI deployment, prompt injection is most dangerous because:
βœ“ Correct. In RAG systems, documents in the knowledge base can contain embedded instructions the model follows β€” potentially causing it to retrieve and expose additional data to an attacker.
βœ— Prompt injection in RAG systems works by embedding instructions in documents the model retrieves, causing it to follow those instructions β€” including potentially exfiltrating other retrieved data.
2. What made the March 2023 ChatGPT bug particularly illustrative of AI-specific breach risk?
βœ“ Correct. The Redis caching bug briefly exposed chat titles and messages between users, illustrating how infrastructure supporting an AI system can amplify cross-user data exposure.
βœ— The March 2023 incident was a Redis caching bug that caused users to see other users' chat content β€” showing how AI infrastructure can expose data beyond its intended scope.
3. Why does GDPR's 72-hour breach notification requirement create a particular governance challenge for organisations running AI systems?
βœ“ Correct. AI incidents often lack the clear "database extracted" signal of traditional breaches. Without AI-specific input/output logging, organisations may not know a breach occurred β€” let alone its scope β€” within the 72-hour window.
βœ— The challenge is detectability and scoping: AI incidents like prompt injection may not be discoverable or quantifiable within 72 hours without AI-specific logging, creating a compliance gap before any breach occurs.

Lab 3: AI Incident Response Planning

Build a breach detection and response protocol for AI-specific attack vectors.

Your Task

Your organisation has discovered that a prompt injection attack may have caused your customer-facing AI assistant to retrieve and surface customer records beyond what the querying user was authorised to see. You do not yet know when the attack started, how many customers are affected, or what data was exposed. You have 72 hours before GDPR notification obligations may apply.

Use this AI advisor to build your immediate response protocol. Cover detection and scoping steps, internal escalation, regulatory notification decision trees, and customer communication drafts. Ask specifically about what logging evidence you need and how to assess whether the "high risk" threshold for Article 34 individual notification is met.

Try asking: "We suspect a prompt injection attack on our customer AI assistant starting up to 3 weeks ago. Walk me through the first 72 hours of incident response including how we scope the breach and determine GDPR notification obligations."
AI Incident Response Advisor Lab 3
AI Risk for Business Leaders Β· Module 6 Β· Lesson 4

Privacy by Design and Governance Frameworks for AI Data Risk

Compliance is a floor, not a ceiling. The organisations that manage AI data risk well build it into architecture β€” not into legal review after deployment.

In 2017, the UK Information Commissioner's Office found that the Royal Free London NHS Foundation Trust had provided 1.6 million identifiable patient records to DeepMind without adequate legal basis. The records were supplied to develop an acute kidney injury alert app called Streams. The ICO found that patients had not been told their data would be used this way, and that the lawful basis β€” direct care β€” did not extend to the development of a new clinical product. No fine was issued, but the Trust signed an undertaking to bring its processing into compliance.

The case became a landmark not because it involved AI explicitly, but because it illustrated the central principle that would define AI privacy governance for the next decade: the urgency of an AI project does not create a lawful basis for processing personal data, and regulators will not accept innovation value as a substitute for compliance. DeepMind subsequently published a commitment to separate its health data processing activities from its broader AI research business β€” a structural separation with direct implications for how organisations should govern AI data pipelines.

Privacy by Design: The Seven Foundational Principles

Privacy by Design, developed by Ann Cavoukian and now codified in GDPR Article 25, requires that privacy protections be embedded into systems from the outset rather than bolted on after the fact. For AI systems, this translates into specific architectural requirements.

Proactive, not reactive. Conduct a Data Protection Impact Assessment (DPIA) before deploying any AI system that is likely to result in high risk to individuals. GDPR Article 35 mandates DPIAs for systematic profiling, large-scale processing of sensitive data, and automated decision-making. Many organisations conduct DPIAs only when forced to by a regulator β€” at which point remediation is costly or impossible.

Privacy as the default. AI systems should collect the minimum data necessary to perform their function, not the maximum data available. In practice, this means defining the system's purpose narrowly before building the data pipeline, not after. A customer churn prediction model that genuinely requires only transaction frequency and recency should not ingest full transaction history by default simply because it is available.

Privacy embedded into design. Differential privacy β€” adding calibrated statistical noise to training data or model outputs β€” is a mathematically grounded technique that allows models to be trained on sensitive data with provable privacy guarantees. Apple uses differential privacy for collecting usage statistics from iOS devices. Google applies it in products including Google Maps traffic data. The technique is not a panacea, but it is increasingly a regulatory expectation for high-risk AI deployments.

End-to-end security. Encryption of data at rest and in transit is table stakes; for AI systems, it must extend to the vector database, the inference API, and the logging infrastructure. Federated learning β€” training models on distributed data without centralising it β€” offers an architectural approach that reduces the exposure surface by keeping personal data on local devices or in local environments.

Governance Frameworks: NIST, ISO, and the EU AI Act

The US National Institute of Standards and Technology published the AI Risk Management Framework (AI RMF) in January 2023. Its four core functions β€” Govern, Map, Measure, Manage β€” provide a structured approach to AI risk that explicitly includes privacy. The Govern function requires organisations to establish accountability structures, policies, and procedures before deploying AI. The Map function includes identifying privacy risks alongside other categories of AI harm.

ISO/IEC 42001:2023, the international standard for AI management systems, requires organisations to identify legal obligations applicable to AI systems and maintain documented evidence of compliance. It is structurally similar to ISO 27001 for information security β€” an increasingly common baseline for enterprise AI governance.

The EU AI Act, which entered into force in August 2024, imposes the most prescriptive requirements. High-risk AI systems β€” including those used in employment, credit, education, and critical infrastructure β€” must undergo a conformity assessment that explicitly covers data quality and data governance. Article 10 of the Act requires that training datasets be subject to data governance practices addressing the purposes of processing, possible biases, and the measures taken to ensure accuracy and statistical properties. This is Privacy by Design codified as hard law.

The Vendor Due Diligence Imperative

When you deploy a third-party AI model, the data-processing activities of your vendor become your compliance problem. GDPR Article 28 requires a Data Processing Agreement (DPA) with every vendor that processes personal data on your behalf. The DPA must specify the subject matter, duration, nature, and purpose of processing; the type of personal data; categories of data subjects; and the obligations and rights of the controller. Many AI vendor agreements in 2024 remain inadequate on these dimensions. The UK ICO has published a checklist for DPAs with AI vendors that has become a practical standard for procurement teams.

Building an AI Data Governance Programme

AI data inventory. Maintain a register of every AI system, the personal and proprietary data it processes, the legal basis for that processing, the vendor or internal team responsible, and the applicable retention period. This is the AI-specific extension of the GDPR Article 30 Records of Processing Activities (RoPA).

Data subject rights operationalisation. AI systems complicate the fulfilment of data subject rights. A deletion request under GDPR Article 17 raises the question of whether you must retrain or unlearn the model to remove an individual's data from its weights β€” an active area of technical and legal debate. The UK ICO's 2023 guidance on machine unlearning acknowledges that complete technical compliance may be impractical, but requires organisations to document the limits and compensating measures.

Continuous monitoring. Privacy risk is not static. A model that was compliant at deployment may drift in ways that create new exposure β€” through fine-tuning on new data, changes in the query patterns it receives, or new research demonstrating previously unknown inference capabilities. Quarterly AI risk reviews, analogous to information security penetration tests, are becoming a governance expectation in regulated industries.

Training and culture. The Samsung incident was not a technology failure β€” it was a training and culture failure. Employees who understand that AI tools are not private notepads, that pasting proprietary code into a third-party API is equivalent to emailing it to a stranger, and that their queries may be retained and reviewed are significantly less likely to create inadvertent exposure incidents.

Leadership Takeaway

Privacy by Design is not a technical function delegated to engineers β€” it is a business decision made when the organisation chooses what problem the AI will solve, what data it will use, and which vendor will supply it. By the time an AI system reaches legal review or a DPIA, most of the architecturally significant privacy decisions have already been made. The business leaders who commission AI projects are making privacy decisions whether they recognise it or not.

Lesson 4 Quiz

3 questions β€” free, untracked, retake anytime.
1. The 2017 Royal Free / DeepMind case established which key principle for AI data governance?
βœ“ Correct. The ICO found that the direct-care lawful basis did not extend to developing a new product, establishing that innovation value is not a substitute for a valid lawful basis.
βœ— The case established that the value or urgency of an AI project cannot substitute for a valid lawful basis. The "direct care" basis did not extend to product development.
2. Differential privacy is significant for AI data governance because it:
βœ“ Correct. Differential privacy provides a mathematically grounded guarantee: the noise added bounds what can be learned about any individual in the training data, used by Apple and Google in production systems.
βœ— Differential privacy works by adding calibrated statistical noise, providing a mathematical bound on privacy loss β€” not by blocking inclusion, guaranteeing GDPR compliance, or encrypting weights.
3. Under GDPR Article 28, when a company deploys a third-party AI model that processes customers' personal data, the company must:
βœ“ Correct. Article 28 requires a DPA specifying the subject matter, duration, nature, and purpose of processing β€” the deploying company remains the data controller and cannot outsource its compliance obligations.
βœ— Article 28 requires a Data Processing Agreement with the vendor. The deploying organisation remains the data controller β€” it cannot transfer GDPR liability to the vendor simply by deploying their model.

Lab 4: Designing an AI Data Governance Programme

Build the policies, controls, and accountability structures that make AI data risk manageable.

Your Task

Your organisation is establishing a formal AI data governance programme. You need to present a board-level governance framework covering: AI data inventory requirements, DPIA triggers and process, vendor due diligence and DPA standards, data subject rights operationalisation for AI systems, and employee training requirements. Use this AI advisor to stress-test your framework against real regulatory requirements and documented failure modes.

Bring your own organisation's context β€” industry sector, geographic footprint, types of AI systems in use β€” and ask the advisor to tailor the framework to your specific risk profile. Challenge it to identify gaps you may not have considered.

Try asking: "We are a mid-size financial services firm operating in the US and UK deploying AI for credit scoring and customer service. Help me build a board-ready AI data governance framework that covers GDPR, CCPA, FCRA, and the EU AI Act."
AI Governance Framework Advisor Lab 4

Module 6 Test: Data Risk and Privacy

15 questions. Score 80% or above to pass the module.
1. In the 2023 Samsung incident, which category of information was primarily exposed to OpenAI's servers?
βœ“ Correct. Samsung engineers pasted proprietary source code and meeting notes β€” trade-secret category data β€” into ChatGPT.
βœ— The Samsung incident involved proprietary source code and internal meeting notes β€” not customer PII, health records, or financial statements.
2. Which AI lifecycle phase creates data exposure risk through the memorisation of training examples that can be reproduced by targeted prompting?
βœ“ Correct. Memorisation occurs during training β€” model weights encode fragments of training data that can later be reproduced through targeted prompting, as demonstrated by Carlini et al.
βœ— Memorisation occurs during training, when model weights encode information from the training corpus. This is a training-phase risk, not an inference or logging phase risk.
3. A vector database storing embeddings of your company's HR documents should be treated as:
βœ“ Correct. Embeddings can be partially inverted to reconstruct original text. They are not anonymised and must be governed with the same rigour as the underlying documents.
βœ— Embeddings are not truly anonymised β€” researchers have demonstrated partial reconstruction. They must be treated as sensitive data assets with full access controls and retention policies.
4. GDPR's extraterritorial scope (Article 3) means that a US-based AI company processing EU residents' data must comply with GDPR:
βœ“ Correct. The Clearview AI enforcement actions confirmed that GDPR applies based on where data subjects are located, regardless of company establishment or server location.
βœ— GDPR's extraterritorial reach applies based on where data subjects reside, not where the company is established or where data is stored.
5. Under GDPR's purpose limitation principle, customer transaction data collected for billing purposes can automatically be used to train a marketing AI model:
βœ“ Correct. Purpose limitation (Article 5(1)(b)) requires that data only be used for its original purpose. Repurposing for AI training requires a compatibility assessment and frequently new consent.
βœ— False. Purpose limitation prevents automatic repurposing of data. A compatibility assessment is required, and often fresh consent or a new lawful basis must be established.
6. GDPR Article 22 gives individuals the right not to be subject to solely automated decisions that significantly affect them. Which enterprise AI use-case most clearly triggers this provision?
βœ“ Correct. Automated loan approval/rejection is a significant decision with direct legal and financial effects on individuals β€” exactly the scenario Article 22 targets.
βœ— Article 22 targets solely automated decisions with significant effects. Automated loan approval/rejection β€” with direct financial and legal consequences β€” is the clearest trigger.
7. What is the primary mechanism through which prompt injection attacks create data privacy risk in RAG-based AI deployments?
βœ“ Correct. Malicious instructions in retrieved documents can override developer-set system prompts, potentially causing the model to retrieve and send additional data to an attacker.
βœ— Prompt injection in RAG works by embedding instructions in documents the model retrieves, causing those instructions to override the system prompt and potentially triggering data exfiltration.
8. Training data poisoning attacks require an attacker to:
βœ“ Correct. Poisoning attacks work by corrupting training data β€” even a small fraction (0.1% in demonstrated attacks) β€” to embed backdoor behaviours triggered by specific patterns at inference.
βœ— Poisoning attacks work by corrupting training data β€” not by accessing APIs, intercepting gradients, or requiring physical access. Even a tiny fraction of poisoned data can embed backdoor behaviours.
9. The March 2023 ChatGPT bug that exposed users' chat content to other users was caused by:
βœ“ Correct. The incident was an infrastructure bug β€” a Redis caching library vulnerability β€” demonstrating that AI data breaches can originate in supporting infrastructure, not just the model itself.
βœ— The bug was in a Redis caching library β€” infrastructure supporting the AI system, not the model itself. This illustrates how AI-adjacent infrastructure can create data exposure incidents.
10. GDPR Article 33 requires breach notification to supervisory authorities within 72 hours. For AI systems, this timeline is particularly challenging because:
βœ“ Correct. Without AI-specific input/output logging, an organisation may not be able to detect that a prompt injection attack occurred or determine its scope within the 72-hour window.
βœ— The challenge is that AI incidents may be undetectable and unscopeable within 72 hours without purpose-built logging. Traditional security monitoring does not capture what data an AI model retrieved and returned.
11. Privacy by Design, as codified in GDPR Article 25, requires that privacy protections be:
βœ“ Correct. Article 25 requires data protection by design and by default β€” privacy considerations must be built in from the beginning, not reviewed or documented after deployment.
βœ— Privacy by Design requires embedding protections from the outset β€” before development decisions are made β€” not after deployment or via documentation.
12. The 2017 Royal Free / DeepMind case was found by the ICO to violate data protection law because:
βœ“ Correct. The ICO found that direct care β€” a valid basis for processing in a clinical context β€” did not extend to product development, establishing the principle that AI use must fit within the original lawful basis.
βœ— The ICO found that the "direct care" basis did not justify supplying records for product development β€” a different purpose requiring a different lawful basis.
13. Differential privacy is used by Apple and Google in production systems primarily because it:
βœ“ Correct. Differential privacy adds calibrated noise that provides a formal mathematical guarantee about the maximum privacy loss for any individual β€” enabling useful aggregate statistics while protecting individual records.
βœ— Differential privacy's value is its mathematical guarantee β€” the added noise formally bounds what can be inferred about any individual, enabling data utility while providing provable privacy protection.
14. When deploying a third-party AI vendor that will process your customers' personal data, GDPR Article 28 requires your organisation to:
βœ“ Correct. Article 28 requires a DPA with every processor. Your organisation remains the data controller and cannot outsource GDPR liability to the vendor through the vendor agreement.
βœ— Article 28 requires a Data Processing Agreement. Your organisation remains the data controller β€” you cannot transfer GDPR liability to a vendor simply by deploying their AI service.
15. A company receives a GDPR Article 17 "right to erasure" request from a customer whose data was used to train a fine-tuned AI model. The most accurate description of the company's obligation is:
βœ“ Correct. The UK ICO's 2023 guidance acknowledges that complete technical unlearning may be impractical, but requires organisations to document the limits of what is achievable and implement compensating measures.
βœ— The ICO's guidance requires organisations to evaluate machine unlearning, document what is technically feasible, and implement compensating measures if complete erasure from model weights cannot be achieved β€” not simply to ignore the obligation or immediately retrain.