In 1966, a computer scientist named Joseph Weizenbaum at MIT deployed a program called ELIZA. It could do almost nothing: it matched patterns in typed sentences and reflected them back as questions, mimicking a Rogerian psychotherapist. Weizenbaum expected users to find it trivially mechanical. Instead, his own secretary asked him to leave the room so she could speak with it privately. Colleagues suggested it might replace human therapists. Weizenbaum was disturbed enough by the reaction that he spent the next decade writing a book β Computer Power and Human Reason, 1976 β warning about exactly the kind of misplaced trust that a language-mimicking machine could generate.
What is happening now is different in degree by many orders of magnitude, but the underlying dynamic is strikingly similar. When OpenAI released ChatGPT in November 2022, one million users signed up within five days β faster than any consumer product in recorded history. Within two months there were one hundred million. The sensation people reported was not that they had found a useful tool; it was that something had, apparently, understood them. That feeling β accurate or not β is the central subject of this course.
This course will give you a grounded, technically honest account of how conversational AI systems actually work: the NLP fundamentals underneath them, how they are designed and trained, where they fail and why, and how they are being deployed across industries today. You will finish with enough vocabulary and conceptual depth to evaluate claims made about these systems and to make informed decisions about building or using them. We will not tell you AI is magic, and we will not tell you it is a fraud. The truth is considerably more interesting than either.
In 1988, IBM's Peter Brown and colleagues at the Thomas J. Watson Research Center were building a speech recognition system for financial news transcription. They faced a choice that defined the next three decades of NLP: write explicit rules about English grammar and vocabulary, or collect enormous amounts of text and let a statistical model infer the patterns. They chose statistics. Their system, trained on the Wall Street Journal corpus, outperformed handcrafted rule systems that had taken years to build. The lesson was uncomfortable for linguists who had spent careers formalizing grammar: raw data, combined with the right mathematical framework, could approximate linguistic competence without anyone ever specifying a rule.
That same tension β rules versus learning, symbolic versus statistical β runs through every generation of NLP research and resurfaces today in debates about large language models. Understanding it is the foundation of understanding conversational AI.
Human language operates at multiple simultaneous levels. Linguists have catalogued these for over a century, and NLP systems must grapple with all of them β either explicitly, through rules, or implicitly, through learned representations.
Phonology is the sound system: how phonemes combine and which combinations are permissible. For text-based systems this matters less, but for speech recognition it is foundational.
Morphology is word structure: how prefixes, suffixes, and roots combine. The word unbreakable contains three morphemes β un-, break, -able β each contributing to meaning in a predictable way. A system that does not handle morphology will treat "run," "runs," "running," and "ran" as entirely unrelated.
Syntax is grammatical structure: the rules governing how words combine into phrases and sentences. A syntactic parser builds a tree representing this structure. Knowing that "The dog bit the man" and "The man bit the dog" have identical words but opposite meanings requires syntactic understanding.
Semantics is meaning: what words and sentences refer to in the world. This is where most of the difficulty concentrates. "Bank" can mean a financial institution or a riverbank; disambiguating requires context.
Pragmatics is use in context: what a speaker intends beyond literal meaning. "Can you pass the salt?" is grammatically a question about physical ability but pragmatically a request. Most conversational failures occur at the pragmatic level.
Early rule-based chatbots like ELIZA operated purely at the surface level β pattern matching on words. They had no morphological, syntactic, semantic, or pragmatic processing. Modern large language models operate primarily at the statistical distribution level across all these layers simultaneously, without explicitly modeling any of them. Understanding the layers helps you diagnose where a system fails.
The dominant paradigm in NLP from roughly 1957 to the mid-1980s was symbolic AI: represent knowledge as explicit rules and logical structures that a computer could manipulate. Noam Chomsky's 1957 work Syntactic Structures gave linguists and computer scientists a formal grammar framework β context-free grammars and later transformational generative grammars β that seemed to promise a complete account of syntactic structure.
Systems like SHRDLU (Terry Winograd, MIT, 1970) demonstrated that within a tiny constrained domain β a simulated world of blocks on a table β a computer could conduct remarkably natural conversations about that domain. SHRDLU could respond correctly to "Pick up the big red block" or "Which cube is sitting on the table?" It knew about blocks, their colors, their positions, and the actions possible with them. But it knew nothing else. Winograd himself later concluded that the success of SHRDLU was partly illusory β the world it operated in was so constrained that the appearance of understanding was easy to manufacture.
The fundamental problem with symbolic approaches is knowledge acquisition: building the rules is enormously expensive, they break down on language variation and ambiguity, and they cannot handle input they were not specifically designed for. Real language is boundlessly variable.
By the late 1980s and through the 1990s, statistical methods β learning patterns from large text corpora β began outperforming hand-crafted rule systems on nearly every benchmark. The shift was driven by three factors: growing digital text, increasing computational power, and mathematical frameworks from information theory and probability.
N-gram language models were among the first statistical NLP tools to see wide deployment. An n-gram model estimates the probability of the next word given the previous nβ1 words. Given enough training text, such a model learns that "the cat sat on the" is much more likely to be followed by "mat" than by "philosophy." Google's autocomplete, as deployed from 2008 onward, relied heavily on n-gram statistics derived from billions of web pages.
Statistical methods brought robustness and scalability but introduced new problems: they required enormous training data, they performed poorly on rare or novel inputs, and they provided no interpretable explanation for their outputs. A rule-based system that got an answer wrong could at least show you which rule failed. A statistical model was, from the beginning, a black box.
The third paradigm shift began around 2013 with the publication of Word2Vec by Tomas Mikolov and colleagues at Google. Word2Vec learned dense vector representations of words from text such that words with similar meanings ended up near each other in vector space. Famously, the vector arithmetic king β man + woman β queen demonstrated that the model had captured something genuinely semantic, not merely co-occurrence statistics.
This representation learning approach β letting neural networks discover features rather than hand-engineering them β proved extraordinarily powerful. It culminated in 2017 with the publication of "Attention Is All You Need" by Vaswani et al. at Google Brain, which introduced the Transformer architecture. Transformers replaced recurrent neural networks with a mechanism called self-attention, allowing the model to weigh the relevance of every word in a sequence against every other word simultaneously. This parallelism enabled training on far larger datasets than any previous architecture.
BERT (Google, 2018), GPT-2 (OpenAI, 2019), and GPT-3 (OpenAI, 2020) followed in rapid succession, each dramatically larger and more capable than the last. By the time ChatGPT launched in November 2022, the underlying model (GPT-3.5) had been trained on hundreds of billions of words and fine-tuned with human feedback to produce conversational behavior. The ELIZA effect β the sensation of being understood β had been industrialized.
Modern NLP systems do not understand language the way humans do. They learn to predict likely continuations of text, a task that turns out to require capturing enormous amounts of syntactic, semantic, and world knowledge implicitly. That distinction β prediction versus comprehension β matters enormously when you are deciding what to trust these systems to do.
In this lab you will explore the five layers of linguistic analysis β phonology, morphology, syntax, semantics, and pragmatics β by discussing real examples with an AI tutor. Try asking the tutor to walk you through how a specific sentence or word would be analyzed at each layer, or ask it to show you where a chatbot might fail at a particular level.
In June 2020, OpenAI released GPT-3 with a technical paper noting that the model had been trained on roughly 570 gigabytes of filtered text β the equivalent of several hundred thousand novels. But before any of that text touched the neural network, every word, punctuation mark, and space had been transformed through a process called tokenization, reducing the rich surface of human writing to sequences of integer IDs. Token 50256 was the end-of-text marker. Token 198 was a newline. The entire English language, for the purposes of GPT-3, was a vocabulary of about 50,000 such tokens β and meaning was, somehow, to emerge from their statistical relationships.
Understanding what tokenization actually does β and what it obscures β is essential for understanding both the power and the characteristic failure modes of modern conversational AI.
Tokenization is the first step in almost every NLP pipeline: splitting raw text into atomic units the model can process. The choice of tokenization strategy has significant downstream effects.
Word tokenization is the simplest approach: split on whitespace and punctuation. It works reasonably well for English but fails badly for languages like German (which compounds words freely), Chinese (which uses no spaces), and any text containing new vocabulary, proper nouns, or deliberate misspellings.
Character tokenization splits text into individual characters. This handles any new word and any language, but sequences become very long, and the model must learn to compose meaning from scratch rather than exploiting word-level regularities.
Subword tokenization β used by virtually all modern large language models β is the pragmatic middle ground. Algorithms like Byte-Pair Encoding (BPE), developed by Philip Gage in 1994 and adapted for NLP by Rico Sennrich et al. in 2016, iteratively merge the most frequent character pairs into tokens. Common words become single tokens; rare words are split into meaningful subword pieces. "unbreakable" might tokenize as ["un", "break", "able"]. This keeps vocabulary size manageable while handling novel words gracefully.
GPT-4's tokenizer (tiktoken, using cl100k_base) splits "ChatGPT" into ["Chat", "G", "PT"]. It splits "tokenization" into ["token", "ization"]. It encodes the word "unfortunately" as a single token. These choices, made entirely on statistical frequency grounds, influence which words the model has seen "as a unit" and which it must compose from parts β with real effects on how it handles those words.
Once text is tokenized, each token must be converted to a numerical representation a neural network can process. The breakthrough insight of Word2Vec (Mikolov et al., Google, 2013) was that meaningful geometry could be learned entirely from distributional statistics β from which words tend to appear near which other words in large corpora.
Word2Vec trains a shallow neural network on a massive text corpus using one of two tasks: predict a word from its neighbors (Continuous Bag of Words), or predict neighbors from a word (Skip-gram). Neither task has anything to do with meaning, explicitly. But in solving these prediction tasks, the network develops internal representations β vectors of typically 100 to 300 floating-point numbers per word β that encode semantic and syntactic relationships.
The result was remarkable. Words with similar meanings cluster in vector space. Analogical relationships appear as vector arithmetic. Paris β France + Germany β Berlin. Doctor β man + woman β nurse (which also revealed concerning biases baked into the training data). GloVe (Pennington et al., Stanford, 2014) and FastText (Bojanowski et al., Facebook AI Research, 2017) extended this approach with different training objectives and subword-level representations.
Static embeddings like Word2Vec have a critical flaw: each word gets exactly one vector regardless of context. "Bank" gets the same embedding whether it appears in "river bank" or "bank account." This is clearly wrong β the word means different things and should have different representations.
ELMo (Embeddings from Language Models, Peters et al., Allen Institute, 2018) addressed this by producing contextual embeddings: the representation of each word is a function of the entire sentence around it, computed by a bidirectional LSTM. The word "bank" now has different vectors in "river bank" and "bank account."
BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., Google, 2018) took this further using the Transformer architecture. BERT is trained on two tasks: masked language modeling (predict randomly hidden tokens given all surrounding context) and next-sentence prediction. The resulting representations proved extraordinarily useful for downstream tasks β fine-tuning BERT on a specific task like sentiment analysis or question answering with just a small labeled dataset produced state-of-the-art results across virtually every NLP benchmark in 2018.
The critical insight is that in contextual embedding models, the vector for a word is computed anew for every sentence. This is computationally expensive but semantically much richer. Modern large language models like GPT-4 and Claude extend this principle to their full architecture: there is no static vocabulary of meaning, only context-dependent representations computed on the fly.
Embeddings capture statistical regularities in text, but text is not the world. A model trained exclusively on text has never seen a table, heard music, tasted food, or experienced time passing. When it discusses these things fluently, it is drawing on the way humans have written about them β which is extensive and often accurate, but is not the same as grounded experience. This gap between linguistic fluency and world knowledge is a persistent source of subtle errors.
Explore tokenization decisions and word embedding concepts with an AI tutor. Ask how specific words or phrases would be tokenized, why subword methods are preferred, or how embedding geometry captures semantic relationships. Probe the limits of what embeddings can and cannot represent.
In 2016, Facebook released its Wit.ai platform to developers, promising that any programmer could build a chatbot that understood natural language by simply providing training examples. Thousands of companies did. Many of those bots promptly failed when users phrased requests differently than the training examples anticipated β asking "I need to cancel" instead of "cancel my order," or saying "tomorrow" instead of a specific date. The problem was not that the NLP was bad. The problem was that developers had underestimated how much structural information a conversational system needs to extract β reliably and robustly β before it can do anything useful.
The discipline of intent classification and entity extraction addresses exactly this. It is older than deep learning and remains critically important even in the era of large language models.
In task-oriented dialogue systems β chatbots designed to accomplish specific goals like booking flights, checking account balances, or scheduling appointments β the first job of the NLP pipeline is to determine what the user wants to do. This is intent classification: mapping an utterance to one of a predefined set of intent categories.
An airline chatbot might recognize intents like BookFlight, CheckStatus, CancelReservation, and RequestRefund. A user saying "I want to fly to Tokyo next Tuesday" maps to BookFlight. "Where is my bag?" maps to CheckStatus. The challenge is that users express the same intent in wildly different ways, with different vocabulary, different sentence structures, misspellings, dialect variation, and code-switching between languages.
Classical intent classification used Support Vector Machines or logistic regression over bag-of-words or TF-IDF features. Modern systems fine-tune BERT-like models on intent classification datasets, dramatically improving accuracy but requiring labeled training data for each intent in each domain. The practical problem for small organizations is that collecting enough labeled examples β typically hundreds per intent β is expensive and time-consuming.
Zero-shot and few-shot intent classification using large language models addresses this by leveraging the LLM's pretrained knowledge of language. Instead of training a separate classifier, you describe the intent in natural language and ask the LLM to classify. This works surprisingly well for common intents but degrades on domain-specific or highly technical classifications.
Intent classification tells you what the user wants to do. Named Entity Recognition (NER) tells you with what β extracting the specific pieces of information mentioned in the utterance.
For "I want to fly from London to Tokyo on Tuesday for two people," a NER system must extract: ORIGIN = London, DESTINATION = Tokyo, DATE = Tuesday (resolved to a specific date), PASSENGER_COUNT = 2. Without this extraction, the chatbot knows the user wants to book a flight but cannot actually book anything.
Standard NER systems recognize canonical entity types: PERSON, ORGANIZATION, LOCATION, DATE, TIME, MONEY, PERCENT. Deployed chatbots typically require domain-specific entity types beyond these standards β product SKUs, account numbers, medical codes β which require custom annotated training data.
NER is typically framed as a sequence labeling problem: each token in the sentence gets a label indicating whether it is the beginning, inside, or outside of a named entity (the BIO tagging scheme). Early systems used Conditional Random Fields (CRFs), which model dependencies between adjacent token labels. Modern systems fine-tune BERT with a classification head on top, achieving near-human accuracy on standard benchmarks but still failing on out-of-domain or low-resource text.
Task-oriented dialogue systems organize extracted entities into "slots" β structured fields required to complete a task. A flight booking might need slots for ORIGIN, DESTINATION, DATE, and PASSENGER_COUNT. If the user provides all slots in one utterance, the system can proceed. If slots are missing, the system must ask follow-up questions. Managing this multi-turn slot-filling process β tracking what has been provided, what is missing, and what has changed β is the core challenge of dialogue state tracking, covered in Module 2.
The hardest problem in NLP is not classification β it is ambiguity. Natural language is systematically ambiguous at every level, and real users exploit this constantly without realizing it.
Lexical ambiguity: "I deposited money at the bank." (Financial institution or riverbank? Almost certainly financial institution in context, but the model must make that inference.)
Structural ambiguity: "I saw the man with the telescope." (Did I use a telescope to see him, or does he have a telescope?) Both parse trees are grammatically valid. Humans resolve this from context and world knowledge; NLP systems often get it wrong.
Referential ambiguity: "John told Paul he was wrong." (Who is wrong β John or Paul?) This requires coreference resolution β tracking which pronouns refer to which entities across a conversation. Coreference resolution remains one of the harder problems in NLP, with even state-of-the-art systems making errors that humans would find obvious.
Scope ambiguity: "Every student reads a book." (Is it one book that all students read, or a different book for each student?) This requires formal semantics to represent correctly and is largely beyond the reach of current NLP systems.
Large language models handle many of these ambiguities better than classical systems because they can draw on enormous statistical context β they have seen most typical patterns before. But they are not immune. They systematically fail on novel ambiguities, rare constructions, and cases where the correct resolution requires genuine world knowledge they lack.
When building or evaluating a conversational AI system, the question is never "does it handle ambiguity?" β every real-world system will encounter ambiguous inputs. The question is: what does it do when it encounters an ambiguous input? Does it make a reasonable default choice and proceed? Ask for clarification? Fail silently? A system that fails silently on ambiguous input β producing confident-sounding but wrong output β is far more dangerous than one that explicitly acknowledges uncertainty.
Practice identifying intents and entities in natural language utterances, and explore how ambiguity complicates extraction. Ask the tutor to classify example sentences, extract entities, or explain why a specific sentence is structurally or referentially ambiguous. Design a simple intent taxonomy for a domain of your choice.
On the morning of July 11, 2023, researchers at Stanford and UC Berkeley published a paper titled "Are Emergent Abilities of Large Language Models a Mirage?" β challenging the widely reported claim that LLMs suddenly acquire new capabilities at certain scale thresholds. The paper argued the apparent emergence was partly an artifact of the evaluation metrics chosen. The debate was technical but the stakes were not: if capabilities emerge unpredictably at scale, no one can reliably anticipate what a next-generation model will be able to do. If capabilities grow smoothly with scale, planning and safety analysis become more tractable. The argument is not resolved. But it illustrates something important: we are building systems whose behavior at scale we do not fully predict from their architecture.
Understanding the Transformer architecture well enough to reason about these questions β rather than simply using these systems β is what this lesson is about.
A language model assigns probabilities to sequences of tokens. More precisely, it learns to estimate P(token_n | token_1, token_2, ..., token_{n-1}) β the probability of the next token given all preceding tokens. This is the autoregressive language modeling task, and it is what GPT-2, GPT-3, GPT-4, and similar models are fundamentally trained to do.
This seems like a narrow task. Why would predicting the next word require learning anything interesting? The answer becomes clear when you consider what accurate next-word prediction actually demands. To predict "Einstein" as a likely continuation of "The physicist who published the theory of special relativity in 1905 was," the model must know who published the theory of special relativity, when, and be able to match that knowledge to the grammatical expectation of a name at that point in the sentence. Accurate prediction across all possible sentences requires, implicitly, encoding enormous amounts of world knowledge, grammatical structure, logical inference, and even stylistic convention.
This is why large language models appear to "know" so much: not because they were explicitly taught facts, but because predicting text accurately requires representing facts as an intermediate computational step.
The Transformer, introduced by Vaswani et al. in 2017, consists of two main components: an encoder that processes input sequences and a decoder that generates output sequences. For language generation (GPT-style models), only the decoder is used. For understanding tasks (BERT-style models), only the encoder is used. For sequence-to-sequence tasks like translation (the original Transformer application), both are used.
The central innovation is self-attention. For each token in a sequence, self-attention computes a weighted sum of representations of all other tokens, where the weights reflect how relevant each other token is to understanding the current one. In the sentence "The animal didn't cross the street because it was too tired," self-attention allows the model to learn that "it" most strongly attends to "animal" β resolving the pronoun reference that has tripped up classical NLP systems for decades.
Self-attention is computed via three learned linear projections of each token: Query (Q), Key (K), and Value (V). The attention weight between token i and token j is the dot product of Q_i and K_j, scaled and softmax-normalized. This is the "scaled dot-product attention" formula: Attention(Q,K,V) = softmax(QK^T / βd_k) Β· V. Multiple attention "heads" run in parallel (multi-head attention), each attending to different aspects of the relationships between tokens.
Scaling a Transformer β increasing the number of parameters, the training data, and the compute β produces dramatic and consistent improvements across benchmarks. Kaplan et al. at OpenAI published scaling laws in 2020 showing that loss decreases predictably as a power law of model size, dataset size, and compute budget. This empirical regularity gave researchers confidence that simply building larger models and training them on more data would continue to yield improvements β a prediction that has, so far, largely held.
But a pretrained language model is not immediately a useful chatbot. A model trained purely on next-token prediction will complete text in the statistical style of its training data β helpful sometimes, but also capable of generating harmful, biased, or nonsensical continuations with equal facility. Converting a pretrained model into a system that follows instructions and declines harmful requests requires additional training steps.
Instruction tuning fine-tunes the model on datasets of (instruction, desired response) pairs. Reinforcement Learning from Human Feedback (RLHF), applied by OpenAI to create InstructGPT (2022) and subsequently ChatGPT, goes further: human raters compare model outputs and rank them; a reward model is trained on these preferences; the language model is then fine-tuned using reinforcement learning to maximize the reward model's score. RLHF is the primary technique responsible for making large language models helpful and relatively safe β and its limitations explain much of why they still sometimes fail in predictable ways.
Language models are trained to produce fluent, contextually appropriate continuations of text. They are not trained to be accurate β accuracy is only incidentally rewarded insofar as accurate text appears more in training data. When a model generates a plausible-sounding but false statement ("The Eiffel Tower was built in 1871" β it was actually completed in 1889), this is called a hallucination. RLHF reduces but does not eliminate hallucination, because human raters often cannot verify factual claims. This is one of the most important practical limitations of current conversational AI for high-stakes applications.
Understanding the Transformer architecture and training process directly illuminates the characteristic behaviors of modern chatbots. A model that generates one token at a time, conditioning on all previous context, will be coherent within its context window and coherent-seeming beyond the facts it knows β but it has no persistent memory between conversations, no ability to access current information without external tools, and no mechanism to verify its own outputs.
These are not bugs to be fixed. They are structural properties of the architecture. Retrieval-augmented generation (RAG), tool use, and multimodal extensions address some of these limitations β but they do so by adding external systems, not by changing the fundamental prediction engine. The conversational AI systems you will encounter in the rest of this course all sit on top of this foundation. Understanding the foundation is what allows you to reason about why they behave as they do β and what to expect when they are pushed beyond their design envelope.
NLP has moved from hand-coded rules, to statistical learning, to deep learning, to the current era of large Transformer-based language models trained on internet-scale text. At each stage, the systems became more capable and the internal representations became less interpretable. Modern conversational AI systems are extraordinarily capable text-predictors that have implicitly absorbed enormous world knowledge β and that limitation, the prediction-versus-comprehension gap, is the most important thing to carry into every practical application decision you make.
Engage with an AI tutor about Transformer architecture, the autoregressive generation process, RLHF, and the structural reasons for hallucination. Ask the tutor to explain self-attention in plain language, trace through how a specific sentence would be generated token by token, or discuss why RLHF is necessary but not sufficient for safety.