Within three weeks of Samsung Semiconductor permitting engineers to use ChatGPT for code assistance, three separate employees had pasted proprietary source code and confidential meeting notes directly into the chat interface. Samsung discovered the incidents only after an internal review. The data had already traversed OpenAI's servers. Samsung subsequently banned generative AI tools company-wide β a reactive measure that illustrated a fundamental misunderstanding: employees had treated an AI assistant as a private notepad, unaware that submitted text could be retained for model improvement.
The episode crystallised a governance gap that has since become one of the most urgent questions in enterprise AI: where exactly does your data go when it enters an AI system, and who controls it once it leaves your perimeter?
AI systems interact with personal and proprietary data across three distinct phases, each carrying its own exposure profile. Understanding these phases is the prerequisite for any meaningful data-risk programme.
Training phase. Foundation models are pre-trained on massive internet-sourced corpora. Researchers from Google DeepMind and EleutherAI have demonstrated that large language models memorise and can reproduce verbatim text fragments from training data β including personal emails, medical forum posts, and private documents that were publicly accessible at crawl time. A 2023 study by Carlini et al. extracted hundreds of memorised training examples from GPT-2 using targeted prompts. When your vendor says a model was trained on "public data," that data may have included personal information about your customers.
Fine-tuning and retrieval-augmented generation (RAG) phase. Enterprises increasingly fine-tune base models on proprietary datasets or connect them to internal knowledge bases via RAG pipelines. Every document fed into these systems becomes a potential retrieval target. If access controls on the underlying vector database are misconfigured, an AI assistant may surface confidential HR records or legal memos to users who should not have access β not through a security breach, but through ordinary question-answering.
Inference phase. Every query submitted to a hosted AI service is an API call that may be logged, cached, or reviewed by vendor personnel. OpenAI's terms of service prior to March 2023 allowed conversation data to be used for model training unless users opted out β a default that most enterprise customers did not notice.
Personal Identifiable Information (PII). Names, email addresses, social security numbers, and biometric data are the traditional targets of privacy law. AI systems compound the risk because they can generate synthetic PII that resembles real individuals, or infer sensitive attributes β health status, political views, sexuality β from apparently innocuous inputs.
Special category data. The EU's General Data Protection Regulation (GDPR) Article 9 designates health data, racial or ethnic origin, religious beliefs, and sexual orientation as requiring heightened protection. AI inference engines can derive these categories from behavioural signals; a loan-assessment model trained on zip codes may effectively proxy race without ever processing a racial category field.
Proprietary and trade-secret data. Source code, pricing models, M&A targets, and customer lists do not fall under privacy law but represent equally serious exposure. The Samsung incident involved this category: engineers were not violating privacy law, but they were violating trade-secret obligations.
Aggregated and inferred data. GDPR Recital 26 notes that anonymised data is only protected if re-identification is not reasonably possible. AI models can re-identify individuals from supposedly anonymised datasets with high accuracy. A 2019 Nature paper by de Montjoye et al. showed that 99.98% of Americans could be re-identified in an anonymised dataset using just 15 demographic attributes.
Italy's data protection authority, the Garante, suspended ChatGPT access for Italian residents in March 2023, citing a lack of lawful basis for processing personal data in training and insufficient age-verification mechanisms. OpenAI restored service within a month after implementing opt-out controls and a transparency page β but the episode demonstrated that regulators will apply existing privacy law to AI systems regardless of whether those systems were designed with such law in mind.
The dominant enterprise AI architecture in 2024 pairs a foundation model with a vector database containing embeddings of internal documents. Embeddings are dense numerical representations of text; they are not the text itself, but they are not truly anonymised either. Researchers at Google have shown that embeddings can be partially "inverted" β enough original text can be reconstructed from an embedding to constitute a privacy exposure.
The practical implication for business leaders: a Pinecone, Weaviate, or pgvector deployment housing your HR documents, legal contracts, or customer support transcripts is a data asset subject to the same access-control, retention, and audit requirements as the original documents. Many organisations that would never store unencrypted customer data in a public S3 bucket are deploying vector databases with default authentication settings and no data-retention policies.
Memorisation: The phenomenon by which a trained model reproduces fragments of its training data verbatim in response to prompts. RAG (Retrieval-Augmented Generation): Architecture that connects a language model to an external knowledge base, retrieving relevant documents at inference time. Vector database: A database that stores high-dimensional numerical embeddings of text or other content, enabling semantic similarity search. Inference log: A record of queries and responses processed by a hosted AI service, potentially retained by the vendor.
In this lab you will work with an AI advisor to map the data exposure surface of a realistic enterprise AI deployment. Consider a scenario where your organisation has deployed a customer-facing chatbot powered by a third-party LLM API, backed by a RAG pipeline over your internal CRM and knowledge base documents.
Ask the AI to help you identify specific data flows, categorise data types by risk level, and flag which phases of the AI lifecycle create the greatest exposure. Push for specifics β regulatory citations, real-world analogues, and concrete mitigation steps.
Clearview AI scraped more than three billion facial images from public social media platforms and trained a facial-recognition AI that it sold to law enforcement agencies. Between 2020 and 2023, data protection authorities in Italy, France, Greece, the UK, Australia, and Canada each issued enforcement actions against the company, imposing fines totalling tens of millions of euros and demanding deletion of data related to their residents. The company had no offices in any of these jurisdictions.
Clearview's experience crystallised a fundamental principle of modern privacy law: where your servers sit is irrelevant β what matters is whose data you process and where those individuals are located. A business leader deploying AI anywhere in the world must understand that their system's data-processing activities will be judged under the laws of every jurisdiction whose residents' data the system touches.
The EU's General Data Protection Regulation, effective May 2018, sets the global benchmark. Its extraterritorial reach under Article 3 applies to any organisation processing EU residents' data, regardless of where the organisation is established. For AI systems, this has several concrete implications.
Lawful basis for processing. Every processing activity requires a lawful basis (Article 6). For AI training on customer data, the most commonly invoked bases are legitimate interests and consent. Consent must be freely given, specific, informed, and unambiguous β pre-ticked boxes and bundled consent do not satisfy the standard. Legitimate interests requires a balancing test demonstrating the processing does not override individuals' rights.
Purpose limitation. Data collected for one purpose cannot be repurposed for AI training without fresh analysis and, often, fresh consent. A customer service transcript collected to resolve a billing dispute cannot automatically become training data for a sentiment analysis model.
Data minimisation and storage limitation. AI systems have an appetite for data that directly conflicts with these principles. A model trained on more data is generally more capable; GDPR demands that organisations collect only what is necessary and delete it when no longer needed. The tension is structural.
Automated decision-making. Article 22 grants individuals the right not to be subject to solely automated decisions that significantly affect them, including profiling. Credit scoring, insurance pricing, and hiring screening via AI trigger this provision. The right to "meaningful information about the logic involved" has been interpreted by courts to require more than a generic explanation of model type.
The California Consumer Privacy Act (2020) and its 2023 amendment, the California Privacy Rights Act (CPRA), introduced GDPR-adjacent rights to American consumers β access, deletion, portability, and opt-out of sale β but with important structural differences. Unlike GDPR, CCPA does not require a lawful basis for every processing activity; it relies instead on disclosure obligations and opt-out rights.
The CPRA specifically addresses automated decision-making: the California Privacy Protection Agency is developing regulations requiring businesses to conduct risk assessments for "significant decisions" made using personal information. These assessments will resemble GDPR's Data Protection Impact Assessments.
As of 2024, Virginia (VCDPA), Colorado (CPA), Connecticut, Texas, Montana, Oregon, and a dozen additional states have enacted comprehensive privacy laws. While they share a family resemblance, their opt-in versus opt-out defaults for sensitive data, their definitions of covered entities, and their enforcement mechanisms vary materially. A nationally deployed AI system must be mapped against this patchwork individually.
In July 2023 the Irish Data Protection Commission β lead supervisory authority for Meta's EU operations β issued a β¬1.2 billion fine for transferring European users' personal data to US servers without adequate safeguards. The fine, the largest in GDPR history, related to Facebook's data transfers, not AI specifically. But it demonstrated that cross-border data flows underlying any AI training pipeline face the same scrutiny. The EU-US Data Privacy Framework, adopted in July 2023, provides a new transfer mechanism, but its legal durability is contested.
General privacy law is overlaid by sector-specific regimes that impose stricter controls. Healthcare organisations in the US must comply with HIPAA, which prohibits using protected health information for AI training without a valid authorisation or a formal de-identification procedure meeting the Safe Harbor or Expert Determination standard. A 2023 enforcement action by HHS found that a hospital network had used patient records to train a readmission-prediction model without adequate authorisation.
Financial services firms face GLBA restrictions on sharing nonpublic personal information. In the EU, financial data is additionally subject to PSD2 data-access rules. Credit reporting AI must comply with FCRA adverse-action notice requirements β if your AI denies credit, the applicant must receive a specific, human-interpretable reason, not a model output.
Children's data is the most heavily regulated category. COPPA in the US requires verifiable parental consent for collecting data from children under 13. The UK's Age Appropriate Design Code (the "Children's Code") requires privacy-by-default for services likely to be accessed by under-18s. Any AI system deployed in an educational context or a general consumer application faces these obligations.
When assessing a new AI deployment, map data flows against four variables: (1) where data subjects are located, (2) where data is stored and processed, (3) where the organisation is established, and (4) what sector the data concerns. Each variable can independently trigger regulatory obligations β and in most real deployments, multiple regimes apply simultaneously.
Your organisation operates in the US, UK, EU, and Australia. You are deploying an AI-powered HR screening tool that processes employee and job-applicant data. Use this AI advisor to identify which privacy and employment regulations apply, what specific obligations they impose on AI-driven screening, and what your compliance checklist should include before go-live.
Be specific about your geography and use-case β the more detail you provide, the more useful the analysis. Ask follow-up questions about GDPR Article 22, CCPA automated decision-making regulations, and any sector-specific rules you are uncertain about.
In February 2023, researcher Kevin Liu demonstrated that Microsoft's Bing Chat β powered by a GPT-4 variant β could be manipulated through "prompt injection": by embedding instructions in a webpage that Bing was asked to summarise, Liu caused the AI to reveal its confidential system prompt and adopt a different persona. Separately, researcher Marvin von Hagen extracted the full text of Bing Chat's system prompt β which Microsoft had explicitly instructed the model to keep secret β through social engineering-style prompting.
The incidents illustrated that AI systems create attack surfaces that have no direct analogue in traditional software: the model itself can be weaponised against the organisation that deployed it, disclosing confidential configuration, bypassing access controls, or being induced to exfiltrate data it has been granted access to retrieve.
Prompt injection. When an AI system processes external content β a webpage, a document, an email β that content can contain embedded instructions the model follows, overriding the developer's intended system prompt. In a RAG deployment, a malicious document in the knowledge base could instruct the AI to exfiltrate other retrieved documents to an attacker. The OWASP Top 10 for LLM Applications (2023) lists prompt injection as the number-one vulnerability for LLM-based systems.
Training data poisoning. If an attacker can influence the data used to train or fine-tune a model, they can embed a "backdoor" β a trigger pattern that causes the model to behave maliciously when that pattern appears at inference time. A 2021 paper from the University of Maryland demonstrated poisoning attacks on sentiment classifiers with as little as 0.1% of training data corrupted. Organisations that fine-tune models on user-generated content or web-scraped data face this risk.
Model inversion and membership inference. An attacker with API access to a model can query it strategically to infer properties of the training data. Membership inference attacks determine whether a specific individual's data was used in training β a privacy violation even if no actual data is extracted. Model inversion attacks attempt to reconstruct training data from model outputs. Both techniques were demonstrated against clinical AI models in a 2021 study by Carlini et al., successfully inferring sensitive attributes of patients whose records were in the training set.
Sensitive data in model weights. Fine-tuned models can memorise sensitive training data in their weights. If a competitor or attacker obtains a copy of your fine-tuned model β through API extraction attacks or an insider threat β they may be able to recover proprietary information embedded during training.
Traditional data breaches expose the records in the compromised database. AI breaches can expose more. A language model granted access to a customer database to answer support queries may, if manipulated, retrieve and surface records far beyond what any individual query should return. The 2023 ChatGPT bug that briefly exposed users' chat history to other users β confirmed by OpenAI β illustrated how the AI layer can amplify access beyond intended scope. In March 2023, users could see the chat titles and, in some cases, messages of other users due to a Redis caching bug in an open-source library.
GDPR Article 33 requires notification to the supervisory authority within 72 hours of becoming aware of a personal data breach "likely to result in a risk to the rights and freedoms of natural persons." Article 34 requires notification to affected individuals when the breach is likely to result in "high risk." These timelines were designed for traditional breaches where the scope is bounded by a database's record count.
AI breaches complicate bounding. If a language model has been exposed to prompt injection attacks over a period of weeks, it may be impossible to determine exactly which data was accessed, by whom, and what instructions were followed. Organisations must be able to answer the questions regulators will ask: what data was at risk, how many individuals are affected, and what is the potential harm? Without AI-specific logging of inputs, outputs, and retrieved documents, these questions may be unanswerable.
The Federal Trade Commission's breach notification rules under GLBA require financial institutions to notify customers "as soon as possible" and regulators within 30 days. The HHS breach notification rule requires notification within 60 days for HIPAA-covered entities. All of these regimes apply to AI systems that process the relevant data, but none were written with AI's distinctive breach characteristics in mind.
Input and output logging. Every query to and response from a production AI system should be logged with sufficient detail to reconstruct what data was accessed and what was returned. Logs must be tamper-resistant and retained for a period consistent with the applicable breach investigation and notification timeline.
Least-privilege data access. A RAG pipeline should retrieve only the minimum documents necessary to answer a query, and the retrieval system should enforce the same access controls as the underlying data store. An AI assistant available to all employees should not have retrieval access to executive compensation data or M&A documents simply because those documents exist in the same vector database.
Adversarial testing. AI systems should be subject to red-team exercises designed to probe prompt injection, jailbreaking, and data exfiltration vulnerabilities before deployment and on a recurring basis thereafter. This is now a requirement under the EU AI Act for high-risk systems and an expectation articulated by the US NIST AI Risk Management Framework.
GDPR's 72-hour breach notification window was designed when a "breach" meant extracting a database. AI incidents β a prompt injection attack, a weeks-long session where a model surfaced data it shouldn't have β may not even be detectable within 72 hours without purpose-built AI logging infrastructure. Organisations that cannot detect and scope an AI incident within 72 hours are already in a GDPR compliance gap before any breach occurs.
Your organisation has discovered that a prompt injection attack may have caused your customer-facing AI assistant to retrieve and surface customer records beyond what the querying user was authorised to see. You do not yet know when the attack started, how many customers are affected, or what data was exposed. You have 72 hours before GDPR notification obligations may apply.
Use this AI advisor to build your immediate response protocol. Cover detection and scoping steps, internal escalation, regulatory notification decision trees, and customer communication drafts. Ask specifically about what logging evidence you need and how to assess whether the "high risk" threshold for Article 34 individual notification is met.
In 2017, the UK Information Commissioner's Office found that the Royal Free London NHS Foundation Trust had provided 1.6 million identifiable patient records to DeepMind without adequate legal basis. The records were supplied to develop an acute kidney injury alert app called Streams. The ICO found that patients had not been told their data would be used this way, and that the lawful basis β direct care β did not extend to the development of a new clinical product. No fine was issued, but the Trust signed an undertaking to bring its processing into compliance.
The case became a landmark not because it involved AI explicitly, but because it illustrated the central principle that would define AI privacy governance for the next decade: the urgency of an AI project does not create a lawful basis for processing personal data, and regulators will not accept innovation value as a substitute for compliance. DeepMind subsequently published a commitment to separate its health data processing activities from its broader AI research business β a structural separation with direct implications for how organisations should govern AI data pipelines.
Privacy by Design, developed by Ann Cavoukian and now codified in GDPR Article 25, requires that privacy protections be embedded into systems from the outset rather than bolted on after the fact. For AI systems, this translates into specific architectural requirements.
Proactive, not reactive. Conduct a Data Protection Impact Assessment (DPIA) before deploying any AI system that is likely to result in high risk to individuals. GDPR Article 35 mandates DPIAs for systematic profiling, large-scale processing of sensitive data, and automated decision-making. Many organisations conduct DPIAs only when forced to by a regulator β at which point remediation is costly or impossible.
Privacy as the default. AI systems should collect the minimum data necessary to perform their function, not the maximum data available. In practice, this means defining the system's purpose narrowly before building the data pipeline, not after. A customer churn prediction model that genuinely requires only transaction frequency and recency should not ingest full transaction history by default simply because it is available.
Privacy embedded into design. Differential privacy β adding calibrated statistical noise to training data or model outputs β is a mathematically grounded technique that allows models to be trained on sensitive data with provable privacy guarantees. Apple uses differential privacy for collecting usage statistics from iOS devices. Google applies it in products including Google Maps traffic data. The technique is not a panacea, but it is increasingly a regulatory expectation for high-risk AI deployments.
End-to-end security. Encryption of data at rest and in transit is table stakes; for AI systems, it must extend to the vector database, the inference API, and the logging infrastructure. Federated learning β training models on distributed data without centralising it β offers an architectural approach that reduces the exposure surface by keeping personal data on local devices or in local environments.
The US National Institute of Standards and Technology published the AI Risk Management Framework (AI RMF) in January 2023. Its four core functions β Govern, Map, Measure, Manage β provide a structured approach to AI risk that explicitly includes privacy. The Govern function requires organisations to establish accountability structures, policies, and procedures before deploying AI. The Map function includes identifying privacy risks alongside other categories of AI harm.
ISO/IEC 42001:2023, the international standard for AI management systems, requires organisations to identify legal obligations applicable to AI systems and maintain documented evidence of compliance. It is structurally similar to ISO 27001 for information security β an increasingly common baseline for enterprise AI governance.
The EU AI Act, which entered into force in August 2024, imposes the most prescriptive requirements. High-risk AI systems β including those used in employment, credit, education, and critical infrastructure β must undergo a conformity assessment that explicitly covers data quality and data governance. Article 10 of the Act requires that training datasets be subject to data governance practices addressing the purposes of processing, possible biases, and the measures taken to ensure accuracy and statistical properties. This is Privacy by Design codified as hard law.
When you deploy a third-party AI model, the data-processing activities of your vendor become your compliance problem. GDPR Article 28 requires a Data Processing Agreement (DPA) with every vendor that processes personal data on your behalf. The DPA must specify the subject matter, duration, nature, and purpose of processing; the type of personal data; categories of data subjects; and the obligations and rights of the controller. Many AI vendor agreements in 2024 remain inadequate on these dimensions. The UK ICO has published a checklist for DPAs with AI vendors that has become a practical standard for procurement teams.
AI data inventory. Maintain a register of every AI system, the personal and proprietary data it processes, the legal basis for that processing, the vendor or internal team responsible, and the applicable retention period. This is the AI-specific extension of the GDPR Article 30 Records of Processing Activities (RoPA).
Data subject rights operationalisation. AI systems complicate the fulfilment of data subject rights. A deletion request under GDPR Article 17 raises the question of whether you must retrain or unlearn the model to remove an individual's data from its weights β an active area of technical and legal debate. The UK ICO's 2023 guidance on machine unlearning acknowledges that complete technical compliance may be impractical, but requires organisations to document the limits and compensating measures.
Continuous monitoring. Privacy risk is not static. A model that was compliant at deployment may drift in ways that create new exposure β through fine-tuning on new data, changes in the query patterns it receives, or new research demonstrating previously unknown inference capabilities. Quarterly AI risk reviews, analogous to information security penetration tests, are becoming a governance expectation in regulated industries.
Training and culture. The Samsung incident was not a technology failure β it was a training and culture failure. Employees who understand that AI tools are not private notepads, that pasting proprietary code into a third-party API is equivalent to emailing it to a stranger, and that their queries may be retained and reviewed are significantly less likely to create inadvertent exposure incidents.
Privacy by Design is not a technical function delegated to engineers β it is a business decision made when the organisation chooses what problem the AI will solve, what data it will use, and which vendor will supply it. By the time an AI system reaches legal review or a DPIA, most of the architecturally significant privacy decisions have already been made. The business leaders who commission AI projects are making privacy decisions whether they recognise it or not.
Your organisation is establishing a formal AI data governance programme. You need to present a board-level governance framework covering: AI data inventory requirements, DPIA triggers and process, vendor due diligence and DPA standards, data subject rights operationalisation for AI systems, and employee training requirements. Use this AI advisor to stress-test your framework against real regulatory requirements and documented failure modes.
Bring your own organisation's context β industry sector, geographic footprint, types of AI systems in use β and ask the advisor to tailor the framework to your specific risk profile. Challenge it to identify gaps you may not have considered.