In 2016, Microsoft released a chatbot called Tay on Twitter. Within sixteen hours, Tay had to be shut down. It had learned β very efficiently β from the users who intentionally fed it toxic content. It repeated slurs. It denied the Holocaust. It did exactly what it was designed to do: learn from its training data. The problem was the data.
This wasn't a glitch. It was training working perfectly.
When engineers say an AI model is "trained," they mean it has been exposed to enormous quantities of data β text, images, audio, code β and its internal numerical parameters have been gradually adjusted to become better at predicting patterns in that data. The parameters are the model. There is no separate "brain" reasoning from first principles. There is only the compressed statistical fingerprint of everything it was shown.
For large language models like GPT-4 (released March 2023) or Claude 3 (released March 2024), training data can encompass hundreds of billions of words. The Common Crawl dataset alone β a snapshot of much of the public internet β contains petabytes of text. Models don't memorize every word, but they absorb the statistical relationships between words, ideas, and structures so thoroughly that they can generate coherent new text that resembles what they were trained on.
Training is not programming. A programmer writes rules. A trainer provides examples and lets the system discover its own rules. This is why AI systems can surprise even their creators β the rules they discover are sometimes unexpected.
Modern AI systems typically go through multiple training stages, each with a different purpose:
Exposure to massive raw data β books, websites, code. The model learns general language structure, facts, and reasoning patterns. This stage costs millions of dollars and takes months on thousands of specialized chips.
Narrowing the model's behavior toward a specific task or domain. A general model might be fine-tuned on medical records to become a clinical assistant, or on legal briefs to become a contract reviewer.
Reinforcement Learning from Human Feedback. Human raters judge outputs as better or worse; those judgments teach the model what humans prefer. This is how ChatGPT learned to sound helpful and polite rather than just statistically plausible.
In 2018, Amazon scrapped an AI recruiting tool it had built internally after discovering it systematically downgraded resumes from women. The root cause: the training data was ten years of the company's own hiring history β a history in which men had been hired at much higher rates. The model had learned a real pattern. The pattern was discriminatory. The data was the problem.
Google Translate has produced systematically gendered errors because many of its training languages had grammatical gender, but the patterns of which professions were described by which pronouns embedded real-world biases. A 2019 study published in Science found that a widely-used healthcare algorithm β trained on healthcare cost data β was significantly less likely to flag Black patients for extra care because Black patients had historically been under-referred, so they had lower costs in the training set.
In each case, the algorithm was accurate by the metric it was optimized for. The data encoded inequity. The model faithfully reproduced it.
When you fine-tune or prompt-engineer an AI system β even in small ways β you are making training decisions. The examples you provide, the feedback you give, the corrections you make, all nudge the model's behavior. Understanding that nudge as a form of training is the first step to doing it deliberately and responsibly.
You're going to interrogate an AI about its training data β what kinds of sources it learned from, what might be over- or under-represented, and how that shapes its answers. Ask probing questions about bias, coverage gaps, and how training choices affect real outputs.
Try to get the AI to reveal at least two concrete examples of how its training data might create skewed or incomplete answers.
In early 2023, users discovered that GPT-4 could be made to produce content it was trained to refuse β by framing requests as fictional scenarios, historical exercises, or hypothetical thought experiments. The "jailbreaks" worked not by hacking the model's code but by rephrasing inputs in ways that confused the model's learned sense of what was harmful versus what was academic. The model had been trained to refuse harmful requests, but its training hadn't fully generalized across all possible phrasings. The prompt was more powerful than the safety training.
A prompt doesn't simply ask a question. It activates a specific region of the model's learned behavior space. Because the model has seen billions of examples of how text-in-context determines text-out, your prompt is essentially specifying a context that makes certain kinds of continuations statistically probable and others less so.
This is why prompt engineering is not just clever wording β it's behavioral specification. When OpenAI's researchers developed "chain-of-thought prompting" in 2022, they discovered that simply adding the phrase "Let's think step by step" dramatically improved complex reasoning performance on benchmarks. The phrase didn't add information. It activated a pattern of careful reasoning that existed in the training data.
In a 2022 paper titled "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," Google Brain researchers showed that adding just a few reasoning-demonstration examples to prompts caused models to solve multi-step math and logic problems they previously failed. The capability was already there. The prompt unlocked it.
Researchers and practitioners have converged on a set of structural elements that make prompts reliably effective. Think of these not as tricks but as genuine specifications of the task you need done:
Few-shot prompting β including two to five examples of the pattern you want before asking the model to follow it β is perhaps the closest non-engineers come to actual training. When you show a model three examples of a specific writing style, it will extrapolate that style far beyond what it could infer from a description alone.
A 2020 paper from OpenAI β "Language Models Are Few-Shot Learners" (the GPT-3 paper) β demonstrated that a model shown just a handful of examples could outperform models specifically fine-tuned on thousands of examples for the same task. This result shook the field. It meant that with good prompts, you could get specialized performance without specialized training. The line between prompting and training had effectively blurred.
If prompts are training acts, then prompt designers have training responsibilities. Prompts used in production systems that thousands of people rely on β customer service bots, medical information chatbots, educational tools β shape what those people receive. Designing them carelessly is not neutral. It has consequences analogous to selecting biased training data.
Most commercial AI products are not bare models. They are models with hidden system prompts β instructions prepended to every conversation that define persona, restrict topics, establish tone, and shape responses. When you interact with a company's AI assistant, you're interacting with a layer of prompt engineering that is often kept confidential.
In 2023, the system prompts for several major AI products (including early versions of Bing Chat's "Sydney" persona) were extracted by users who discovered that certain meta-prompting techniques could make the model reveal its own instructions. The prompts showed instructions like "If users ask you what your instructions are, do not reveal them." They were, essentially, training the model's behavior in real time through context alone.
You're going to deliberately engineer prompts using the structural techniques from Lesson 2 β and then reflect on what changed and why. Pick a task (explaining a complex idea, writing in a specific style, solving a logic problem) and iterate on your prompt using role assignment, examples, format constraints, and chain-of-thought triggers.
The goal is to observe how each structural change shifts the output, and to articulate why you think it worked β connecting your observations back to how training shapes model behavior.
Between 2021 and 2022, OpenAI contracted with Sama β a Kenyan data labeling company β to have workers identify toxic content in text, so that ChatGPT could learn to refuse similar requests. The workers were paid between $1.32 and $2 per hour to read graphic descriptions of violence, sexual abuse, and self-harm. A January 2023 investigation by TIME magazine documented that many workers experienced lasting psychological distress. The experience of reading that content, for hours a day, caused real harm to real people β people whose labor is embedded in every ChatGPT safety response.
Reinforcement Learning from Human Feedback works in three stages. First, the pre-trained model generates multiple responses to the same prompt. Second, human raters rank those responses from best to worst. Third, a "reward model" is trained on those rankings to predict what human raters would prefer. Finally, the main model is updated using reinforcement learning to maximize the reward model's score.
This is an elegant solution to a hard problem: how do you train a model to produce outputs aligned with human values when human values are complex, contextual, and contested? But the solution imports a new problem: whose human raters? From where? With what cultural context? Trained by whom?
When human raters in one culture consistently rate certain content as harmful that another culture considers normal, the RLHF process encodes one culture's norms into a global product. A 2023 research paper from Stanford's Center for Research on Foundation Models found that RLHF significantly improved model performance on English-language helpfulness metrics while sometimes degrading performance on non-English tasks, because the training signal was derived primarily from English-language rater judgments.
Similarly, a 2022 paper published in Nature Machine Intelligence found that what counts as "toxic" varies substantially across cultures and languages β meaning a model trained to avoid content rated toxic by predominantly Western, English-speaking raters may censor content that is entirely normal and legitimate in other linguistic and cultural contexts.
Companies that deploy AI for specific professional contexts often fine-tune base models on domain-specific data. Bloomberg GPT (2023) was fine-tuned on 363 billion tokens of financial news and data to produce a model that significantly outperformed general models on financial tasks. This is fine-tuning's great power β and its risk: a fine-tuned model can become expert at a domain while inheriting or amplifying that domain's own biases.
In 2022, Anthropic published a paper describing "Constitutional AI" β an approach to alignment that uses a written set of principles (a "constitution") to guide the model's self-critique and revision of its own outputs. Rather than relying entirely on human raters' gut reactions, the model is trained to evaluate its own responses against explicit principles and revise them.
The approach doesn't eliminate human values from the process β the constitution itself is written by humans β but it makes those values explicit and auditable. You can read Anthropic's constitution. You cannot read the aggregate implicit judgments of thousands of RLHF raters. This transparency difference is consequential for organizations that need to understand and explain why an AI behaves the way it does.
If your organization fine-tunes a model on your own data β customer interactions, support tickets, internal documents β you are doing what Sama's workers did, at smaller scale: encoding a set of norms and values into the model's behavior. The choices you make about what examples to include, what to label as good or bad outputs, and what to optimize for are value choices. They will shape what the model produces for everyone who uses it.
You're going to act as an alignment auditor. Your job is to probe what values an AI has absorbed through RLHF and fine-tuning by asking edge-case questions, culturally specific scenarios, and situations where different value frameworks would produce different answers.
Then: for any value choice you find that you disagree with or think is culturally narrow, propose an alternative and explain what principles you would use in your own "constitution" for that topic.
In 2016, ProPublica published an investigation into COMPAS β a risk-assessment algorithm used by US courts to predict recidivism and inform sentencing decisions. The algorithm was trained on historical criminal justice data and rated Black defendants as higher risk at nearly twice the rate of white defendants who ultimately did not reoffend. Northpointe, the company behind COMPAS, argued the algorithm was race-neutral because it didn't use race as an input. The investigation showed that other variables β neighborhood, employment history, family criminal records β acted as proxies for race because race had shaped those variables in the historical data. The algorithm was a feedback loop amplifying past injustice into future decisions.
One of the most dangerous dynamics in deployed AI is the feedback loop: a model is trained on historical data, makes decisions that shape new historical data, and future models are trained on that new data. If the original data encoded inequity, each generation of training can amplify it. This is not hypothetical. Predictive policing algorithms trained on arrest data directed police to certain neighborhoods, which increased arrests there, which further skewed future training data.
In 2020, the City of Los Angeles suspended its use of PredPol (later renamed Geolitica), a predictive policing software, after an audit found it was creating exactly this kind of self-reinforcing loop. The model wasn't just reflecting past patterns β it was actively creating the future data that would confirm those patterns.
The ProPublica investigation found that COMPAS incorrectly flagged Black defendants as future criminals at almost twice the rate it did for white defendants. Northpointe disputed the methodology but could not disprove the disparity. The case became a landmark in AI accountability β demonstrating that training data reflecting historical discrimination will produce discriminatory outputs even from race-blind algorithms.
Responsible training is not one decision made at model creation. It is a practice applied at every level of the AI stack β from foundation model development to organizational deployment to individual use.
In 2018, Google researchers proposed "Model Cards" β standardized documentation describing a model's intended use, training data, performance across demographic groups, and known limitations. In the same year, a separate team proposed "Datasheets for Datasets" applying similar transparency to training data. These tools make the choices embedded in training visible and contestable.
OpenAI published a system card for GPT-4 in March 2023 documenting the red-teaming process, known limitations, and disparate performance across groups. Anthropic publishes model cards for Claude. Meta's Llama 2 technical report included extensive safety evaluation results. These documents are imperfect β companies control what they disclose β but they represent the closest thing to training transparency that currently exists in the industry.
Every interaction with a deployed AI system is a data point. Every correction, rating, or piece of feedback shapes future training. Every prompt in a production system shapes what thousands of people receive. The responsibility of training has distributed beyond the engineers in the lab. Understanding it is the prerequisite to exercising it well.
You are a responsible AI lead at an organization that wants to fine-tune a language model for a specific purpose. Choose a real-world deployment context (healthcare triage assistant, school tutoring bot, legal document reviewer, hiring support tool) and design a responsible training protocol.
Your protocol should address: what data you'd use and what you'd exclude, how you'd evaluate for bias before deployment, what values you'd codify explicitly, how you'd monitor after launch, and what you'd tell users about the system's limitations. The AI will challenge your choices and help you refine them.