Conversational AI and Chatbots · Introduction

Machines That Speak Have Been Arriving for a Very Long Time

Why the history of human language and automated systems is longer — and stranger — than most people assume.

In 1966, a computer scientist named Joseph Weizenbaum at MIT deployed a program called ELIZA. It could do almost nothing: it matched patterns in typed sentences and reflected them back as questions, mimicking a Rogerian psychotherapist. Weizenbaum expected users to find it trivially mechanical. Instead, his own secretary asked him to leave the room so she could speak with it privately. Colleagues suggested it might replace human therapists. Weizenbaum was disturbed enough by the reaction that he spent the next decade writing a book — Computer Power and Human Reason, 1976 — warning about exactly the kind of misplaced trust that a language-mimicking machine could generate.

What is happening now is different in degree by many orders of magnitude, but the underlying dynamic is strikingly similar. When OpenAI released ChatGPT in November 2022, one million users signed up within five days — faster than any consumer product in recorded history. Within two months there were one hundred million. The sensation people reported was not that they had found a useful tool; it was that something had, apparently, understood them. That feeling — accurate or not — is the central subject of this course.

This course will give you a grounded, technically honest account of how conversational AI systems actually work: the NLP fundamentals underneath them, how they are designed and trained, where they fail and why, and how they are being deployed across industries today. You will finish with enough vocabulary and conceptual depth to evaluate claims made about these systems and to make informed decisions about building or using them. We will not tell you AI is magic, and we will not tell you it is a fraud. The truth is considerably more interesting than either.

Conversational AI and Chatbots · Module 1 · Lesson 1

What Does It Mean for a Machine to Understand Language?

From symbols to statistics — how NLP moved from hand-coded rules to learned representations.

If a system always says the right thing, does it matter whether it "understands" anything at all?

In 1988, IBM's Peter Brown and colleagues at the Thomas J. Watson Research Center were building a speech recognition system for financial news transcription. They faced a choice that defined the next three decades of NLP: write explicit rules about English grammar and vocabulary, or collect enormous amounts of text and let a statistical model infer the patterns. They chose statistics. Their system, trained on the Wall Street Journal corpus, outperformed handcrafted rule systems that had taken years to build. The lesson was uncomfortable for linguists who had spent careers formalizing grammar: raw data, combined with the right mathematical framework, could approximate linguistic competence without anyone ever specifying a rule.

That same tension — rules versus learning, symbolic versus statistical — runs through every generation of NLP research and resurfaces today in debates about large language models. Understanding it is the foundation of understanding conversational AI.

The Layers of Language

Human language operates at multiple simultaneous levels. Linguists have catalogued these for over a century, and NLP systems must grapple with all of them — either explicitly, through rules, or implicitly, through learned representations.

Phonology is the sound system: how phonemes combine and which combinations are permissible. For text-based systems this matters less, but for speech recognition it is foundational.

Morphology is word structure: how prefixes, suffixes, and roots combine. The word unbreakable contains three morphemes — un-, break, -able — each contributing to meaning in a predictable way. A system that does not handle morphology will treat "run," "runs," "running," and "ran" as entirely unrelated.

Syntax is grammatical structure: the rules governing how words combine into phrases and sentences. A syntactic parser builds a tree representing this structure. Knowing that "The dog bit the man" and "The man bit the dog" have identical words but opposite meanings requires syntactic understanding.

Semantics is meaning: what words and sentences refer to in the world. This is where most of the difficulty concentrates. "Bank" can mean a financial institution or a riverbank; disambiguating requires context.

Pragmatics is use in context: what a speaker intends beyond literal meaning. "Can you pass the salt?" is grammatically a question about physical ability but pragmatically a request. Most conversational failures occur at the pragmatic level.

Why This Matters for Chatbots

Early rule-based chatbots like ELIZA operated purely at the surface level — pattern matching on words. They had no morphological, syntactic, semantic, or pragmatic processing. Modern large language models operate primarily at the statistical distribution level across all these layers simultaneously, without explicitly modeling any of them. Understanding the layers helps you diagnose where a system fails.

The Symbolic Approach: Rules All the Way Down

The dominant paradigm in NLP from roughly 1957 to the mid-1980s was symbolic AI: represent knowledge as explicit rules and logical structures that a computer could manipulate. Noam Chomsky's 1957 work Syntactic Structures gave linguists and computer scientists a formal grammar framework — context-free grammars and later transformational generative grammars — that seemed to promise a complete account of syntactic structure.

Systems like SHRDLU (Terry Winograd, MIT, 1970) demonstrated that within a tiny constrained domain — a simulated world of blocks on a table — a computer could conduct remarkably natural conversations about that domain. SHRDLU could respond correctly to "Pick up the big red block" or "Which cube is sitting on the table?" It knew about blocks, their colors, their positions, and the actions possible with them. But it knew nothing else. Winograd himself later concluded that the success of SHRDLU was partly illusory — the world it operated in was so constrained that the appearance of understanding was easy to manufacture.

The fundamental problem with symbolic approaches is knowledge acquisition: building the rules is enormously expensive, they break down on language variation and ambiguity, and they cannot handle input they were not specifically designed for. Real language is boundlessly variable.

The Statistical Turn

By the late 1980s and through the 1990s, statistical methods — learning patterns from large text corpora — began outperforming hand-crafted rule systems on nearly every benchmark. The shift was driven by three factors: growing digital text, increasing computational power, and mathematical frameworks from information theory and probability.

N-gram language models were among the first statistical NLP tools to see wide deployment. An n-gram model estimates the probability of the next word given the previous n−1 words. Given enough training text, such a model learns that "the cat sat on the" is much more likely to be followed by "mat" than by "philosophy." Google's autocomplete, as deployed from 2008 onward, relied heavily on n-gram statistics derived from billions of web pages.

Statistical methods brought robustness and scalability but introduced new problems: they required enormous training data, they performed poorly on rare or novel inputs, and they provided no interpretable explanation for their outputs. A rule-based system that got an answer wrong could at least show you which rule failed. A statistical model was, from the beginning, a black box.

Corpus A large structured collection of text (or speech) used to train, test, or evaluate NLP models. The size and quality of the corpus is often the dominant factor in model performance.

N-gram A contiguous sequence of n items (words, characters, or phonemes) from a text. Bigrams are 2-item sequences; trigrams are 3-item sequences. N-gram models estimate text probabilities from these sequences.

Tokenization The process of splitting raw text into discrete units (tokens) — typically words or subword pieces — that serve as the atomic inputs to an NLP model.

The Deep Learning Era

The third paradigm shift began around 2013 with the publication of Word2Vec by Tomas Mikolov and colleagues at Google. Word2Vec learned dense vector representations of words from text such that words with similar meanings ended up near each other in vector space. Famously, the vector arithmetic king − man + woman ≈ queen demonstrated that the model had captured something genuinely semantic, not merely co-occurrence statistics.

This representation learning approach — letting neural networks discover features rather than hand-engineering them — proved extraordinarily powerful. It culminated in 2017 with the publication of "Attention Is All You Need" by Vaswani et al. at Google Brain, which introduced the Transformer architecture. Transformers replaced recurrent neural networks with a mechanism called self-attention, allowing the model to weigh the relevance of every word in a sequence against every other word simultaneously. This parallelism enabled training on far larger datasets than any previous architecture.

BERT (Google, 2018), GPT-2 (OpenAI, 2019), and GPT-3 (OpenAI, 2020) followed in rapid succession, each dramatically larger and more capable than the last. By the time ChatGPT launched in November 2022, the underlying model (GPT-3.5) had been trained on hundreds of billions of words and fine-tuned with human feedback to produce conversational behavior. The ELIZA effect — the sensation of being understood — had been industrialized.

The Core Insight of This Module

Modern NLP systems do not understand language the way humans do. They learn to predict likely continuations of text, a task that turns out to require capturing enormous amounts of syntactic, semantic, and world knowledge implicitly. That distinction — prediction versus comprehension — matters enormously when you are deciding what to trust these systems to do.

Lesson 1 Quiz

Five questions · Select the best answer for each

1. Which level of linguistic analysis deals with what speakers intend beyond the literal meaning of their words?

Correct. Pragmatics is the study of language use in context — what a speaker intends by an utterance, which often differs from its literal grammatical meaning. "Can you pass the salt?" is a pragmatic request, not a literal capability question.

Not quite. Pragmatics is the level that addresses speaker intent and contextual meaning beyond literal interpretation. Morphology covers word structure, semantics covers literal meaning, and phonology covers sound systems.

2. ELIZA, created by Joseph Weizenbaum in 1966, demonstrated which of the following?

Correct. ELIZA's central lesson — which disturbed Weizenbaum deeply — was that extremely simple pattern-matching could cause users, including his own secretary, to attribute genuine understanding and empathy to the program. This became known as the ELIZA effect.

Not quite. ELIZA used only simple pattern matching and reflection — no genuine understanding, no statistical models, no transformers. Its significance was that users attributed understanding to it anyway, revealing how easily humans anthropomorphize language-producing machines.

3. What is an n-gram in the context of NLP?

Correct. N-gram models estimate the probability of the next word given the previous n−1 words. Google's early autocomplete relied heavily on n-gram statistics. They are simple, scalable, and completely lack semantic understanding.

Not quite. An n-gram is simply a contiguous sequence of n text items. N-gram language models use these sequences to estimate word probabilities from training corpora. They predate neural networks and vector representations by decades.

4. The 2017 paper "Attention Is All You Need" introduced which architecture that became foundational for modern large language models?

Correct. Vaswani et al. at Google Brain introduced the Transformer architecture, which uses self-attention to weigh relationships between all words in a sequence simultaneously. Its parallelism enabled training on vastly larger datasets than RNNs allowed, and it underlies BERT, GPT, and essentially every major LLM today.

Not quite. The Transformer, introduced in "Attention Is All You Need" (2017), is the correct answer. Word2Vec predates it (2013) and produces word embeddings rather than full language model architecture. RNNs and CNNs are earlier neural architectures that transformers largely supplanted for NLP tasks.

5. What fundamental limitation did symbolic, rule-based NLP systems face that statistical methods helped address?

Correct. The knowledge acquisition bottleneck was the central weakness of symbolic systems. Real language is boundlessly variable, ambiguous, and context-dependent — formalizing it completely proved impossible. Statistical systems learn these patterns from data, sidestepping the need for hand-crafted rules.

Not quite. Symbolic systems required human experts to write rules — they needed no training data at all. Their limitation was that the rules were brittle and could not cover the full variability of natural language. Statistical methods shifted the burden from rule-writing to data collection.

Lab 1 · The Layers of Language

Interactive conversation · Minimum 3 exchanges to complete

Lab Objective

In this lab you will explore the five layers of linguistic analysis — phonology, morphology, syntax, semantics, and pragmatics — by discussing real examples with an AI tutor. Try asking the tutor to walk you through how a specific sentence or word would be analyzed at each layer, or ask it to show you where a chatbot might fail at a particular level.

Suggested opening: "Can you show me how the sentence 'I saw the man with the telescope' would be analyzed at each linguistic level — and where a chatbot might get confused?"

NLP Tutor

Lab 1

Welcome to Lab 1. I'm here to help you explore the layers of linguistic analysis that underlie NLP systems. Ask me about phonology, morphology, syntax, semantics, or pragmatics — or give me a sentence and I'll show you how each layer applies. What would you like to investigate?

Conversational AI and Chatbots · Module 1 · Lesson 2

Tokenization, Embeddings, and the Architecture of Meaning

How raw text becomes numbers — and why the way you count words changes everything a model can do.

When a language model reads a sentence, what exactly is it reading?

In June 2020, OpenAI released GPT-3 with a technical paper noting that the model had been trained on roughly 570 gigabytes of filtered text — the equivalent of several hundred thousand novels. But before any of that text touched the neural network, every word, punctuation mark, and space had been transformed through a process called tokenization, reducing the rich surface of human writing to sequences of integer IDs. Token 50256 was the end-of-text marker. Token 198 was a newline. The entire English language, for the purposes of GPT-3, was a vocabulary of about 50,000 such tokens — and meaning was, somehow, to emerge from their statistical relationships.

Understanding what tokenization actually does — and what it obscures — is essential for understanding both the power and the characteristic failure modes of modern conversational AI.

Tokenization: Breaking Text Into Units

Tokenization is the first step in almost every NLP pipeline: splitting raw text into atomic units the model can process. The choice of tokenization strategy has significant downstream effects.

Word tokenization is the simplest approach: split on whitespace and punctuation. It works reasonably well for English but fails badly for languages like German (which compounds words freely), Chinese (which uses no spaces), and any text containing new vocabulary, proper nouns, or deliberate misspellings.

Character tokenization splits text into individual characters. This handles any new word and any language, but sequences become very long, and the model must learn to compose meaning from scratch rather than exploiting word-level regularities.

Subword tokenization — used by virtually all modern large language models — is the pragmatic middle ground. Algorithms like Byte-Pair Encoding (BPE), developed by Philip Gage in 1994 and adapted for NLP by Rico Sennrich et al. in 2016, iteratively merge the most frequent character pairs into tokens. Common words become single tokens; rare words are split into meaningful subword pieces. "unbreakable" might tokenize as ["un", "break", "able"]. This keeps vocabulary size manageable while handling novel words gracefully.

A Real Tokenization Quirk

GPT-4's tokenizer (tiktoken, using cl100k_base) splits "ChatGPT" into ["Chat", "G", "PT"]. It splits "tokenization" into ["token", "ization"]. It encodes the word "unfortunately" as a single token. These choices, made entirely on statistical frequency grounds, influence which words the model has seen "as a unit" and which it must compose from parts — with real effects on how it handles those words.

Word Embeddings: Meaning as Geometry

Once text is tokenized, each token must be converted to a numerical representation a neural network can process. The breakthrough insight of Word2Vec (Mikolov et al., Google, 2013) was that meaningful geometry could be learned entirely from distributional statistics — from which words tend to appear near which other words in large corpora.

Word2Vec trains a shallow neural network on a massive text corpus using one of two tasks: predict a word from its neighbors (Continuous Bag of Words), or predict neighbors from a word (Skip-gram). Neither task has anything to do with meaning, explicitly. But in solving these prediction tasks, the network develops internal representations — vectors of typically 100 to 300 floating-point numbers per word — that encode semantic and syntactic relationships.

The result was remarkable. Words with similar meanings cluster in vector space. Analogical relationships appear as vector arithmetic. Paris − France + Germany ≈ Berlin. Doctor − man + woman ≈ nurse (which also revealed concerning biases baked into the training data). GloVe (Pennington et al., Stanford, 2014) and FastText (Bojanowski et al., Facebook AI Research, 2017) extended this approach with different training objectives and subword-level representations.

Embedding A dense vector representation of a token, word, or sequence in a continuous high-dimensional space. Points close together in embedding space tend to have similar linguistic or semantic properties.

BPE (Byte-Pair Encoding) A subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs in a corpus into single tokens, producing a vocabulary that efficiently covers both common and rare words.

Distributional Hypothesis The linguistic claim, associated with J.R. Firth (1957), that "a word is characterized by the company it keeps." Words appearing in similar contexts tend to have similar meanings — the theoretical basis for word embedding methods.

From Static to Contextual Embeddings

Static embeddings like Word2Vec have a critical flaw: each word gets exactly one vector regardless of context. "Bank" gets the same embedding whether it appears in "river bank" or "bank account." This is clearly wrong — the word means different things and should have different representations.

ELMo (Embeddings from Language Models, Peters et al., Allen Institute, 2018) addressed this by producing contextual embeddings: the representation of each word is a function of the entire sentence around it, computed by a bidirectional LSTM. The word "bank" now has different vectors in "river bank" and "bank account."

BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., Google, 2018) took this further using the Transformer architecture. BERT is trained on two tasks: masked language modeling (predict randomly hidden tokens given all surrounding context) and next-sentence prediction. The resulting representations proved extraordinarily useful for downstream tasks — fine-tuning BERT on a specific task like sentiment analysis or question answering with just a small labeled dataset produced state-of-the-art results across virtually every NLP benchmark in 2018.

The critical insight is that in contextual embedding models, the vector for a word is computed anew for every sentence. This is computationally expensive but semantically much richer. Modern large language models like GPT-4 and Claude extend this principle to their full architecture: there is no static vocabulary of meaning, only context-dependent representations computed on the fly.

What Embeddings Cannot Do

Embeddings capture statistical regularities in text, but text is not the world. A model trained exclusively on text has never seen a table, heard music, tasted food, or experienced time passing. When it discusses these things fluently, it is drawing on the way humans have written about them — which is extensive and often accurate, but is not the same as grounded experience. This gap between linguistic fluency and world knowledge is a persistent source of subtle errors.

Lesson 2 Quiz

Five questions · Select the best answer for each

1. Which tokenization strategy is used by virtually all modern large language models, including GPT-4?

Correct. Subword tokenization via BPE or similar algorithms (WordPiece, SentencePiece) provides the practical balance between vocabulary size and handling of novel or rare words. Virtually every modern LLM uses this approach.

Not quite. Subword tokenization — specifically algorithms like Byte-Pair Encoding — is the standard approach for modern LLMs. Word tokenization struggles with rare words; character tokenization produces sequences too long for transformers to handle efficiently.

2. The Distributional Hypothesis, which underlies word embedding methods, was originally associated with which linguist?

Correct. J.R. Firth (1957) articulated the idea that "a word is characterized by the company it keeps" — the distributional hypothesis. This principle that words appearing in similar contexts tend to have similar meanings is the theoretical foundation for Word2Vec, GloVe, and all word embedding methods.

Not quite. J.R. Firth, a British linguist writing in 1957, is credited with the distributional hypothesis. Chomsky's work focused on formal grammar and syntax; Winograd built SHRDLU; Mikolov created Word2Vec, which implemented Firth's insight computationally decades later.

3. What critical limitation of static word embeddings like Word2Vec did ELMo and BERT address?

Correct. Static embeddings assign one vector per word — "bank" gets the same representation in "river bank" and "bank account." ELMo and BERT compute contextual embeddings: the representation of each word depends on the full surrounding context, enabling disambiguation of polysemous words.

Not quite. The fundamental limitation of static embeddings is that each word type gets one fixed vector, regardless of how it is used in a sentence. This prevents the model from distinguishing between different senses of ambiguous words. Contextual embeddings compute word representations dynamically based on surrounding context.

4. Word2Vec's famous vector arithmetic (king − man + woman ≈ queen) demonstrates which property of learned word embeddings?

Correct. The vector arithmetic results show that meaningful relationships — gender, country-capital, verb tense — are encoded as consistent directions in embedding space. This emerges entirely from the distributional statistics of the training corpus, without any explicit semantic programming.

Not quite. The arithmetic shows that embedding space has meaningful geometry — semantic relationships correspond to consistent vector directions. This is an emergent property of training on large corpora, not the result of explicit programming or grammar rules.

5. BERT was trained using which two pre-training tasks?

Correct. BERT's pre-training uses: (1) Masked Language Modeling — predicting randomly hidden tokens given bidirectional context, and (2) Next-Sentence Prediction — determining whether two sentences are consecutive in the original text. These tasks enable rich contextual representations useful across many downstream NLP tasks.

Not quite. BERT uses masked language modeling (predicting hidden tokens from bidirectional context) and next-sentence prediction. Skip-gram and CBOW are Word2Vec's training objectives. BERT's bidirectional masking is what distinguishes it from autoregressive models like GPT, which can only use left context.

Lab 2 · Tokenization and Embeddings

Interactive conversation · Minimum 3 exchanges to complete

Lab Objective

Explore tokenization decisions and word embedding concepts with an AI tutor. Ask how specific words or phrases would be tokenized, why subword methods are preferred, or how embedding geometry captures semantic relationships. Probe the limits of what embeddings can and cannot represent.

Suggested opening: "Why does it matter how a model tokenizes a word like 'GPT-4' or 'COVID-19'? What real effects does that have on model behavior?"

Tokenization & Embeddings Tutor

Lab 2

Welcome to Lab 2. Let's explore tokenization and word embeddings — the machinery that converts text into numbers a neural network can process. Ask me how specific words get tokenized, why BPE is standard for LLMs, or how vector arithmetic over embeddings encodes semantic relationships. What's on your mind?

Conversational AI and Chatbots · Module 1 · Lesson 3

Intent, Entities, and the Grammar of Dialogue

How NLP systems extract structured information from unstructured conversation — and where that extraction breaks down.

When a chatbot decides what you want, what is it actually computing?

In 2016, Facebook released its Wit.ai platform to developers, promising that any programmer could build a chatbot that understood natural language by simply providing training examples. Thousands of companies did. Many of those bots promptly failed when users phrased requests differently than the training examples anticipated — asking "I need to cancel" instead of "cancel my order," or saying "tomorrow" instead of a specific date. The problem was not that the NLP was bad. The problem was that developers had underestimated how much structural information a conversational system needs to extract — reliably and robustly — before it can do anything useful.

The discipline of intent classification and entity extraction addresses exactly this. It is older than deep learning and remains critically important even in the era of large language models.

Intent Classification

In task-oriented dialogue systems — chatbots designed to accomplish specific goals like booking flights, checking account balances, or scheduling appointments — the first job of the NLP pipeline is to determine what the user wants to do. This is intent classification: mapping an utterance to one of a predefined set of intent categories.

An airline chatbot might recognize intents like BookFlight, CheckStatus, CancelReservation, and RequestRefund. A user saying "I want to fly to Tokyo next Tuesday" maps to BookFlight. "Where is my bag?" maps to CheckStatus. The challenge is that users express the same intent in wildly different ways, with different vocabulary, different sentence structures, misspellings, dialect variation, and code-switching between languages.

Classical intent classification used Support Vector Machines or logistic regression over bag-of-words or TF-IDF features. Modern systems fine-tune BERT-like models on intent classification datasets, dramatically improving accuracy but requiring labeled training data for each intent in each domain. The practical problem for small organizations is that collecting enough labeled examples — typically hundreds per intent — is expensive and time-consuming.

Zero-shot and few-shot intent classification using large language models addresses this by leveraging the LLM's pretrained knowledge of language. Instead of training a separate classifier, you describe the intent in natural language and ask the LLM to classify. This works surprisingly well for common intents but degrades on domain-specific or highly technical classifications.

Intent The goal or purpose behind a user's utterance in a task-oriented dialogue system. Intent classification maps raw text to one of a predefined set of action categories (e.g., BookFlight, CheckBalance).

TF-IDF Term Frequency–Inverse Document Frequency. A numerical statistic reflecting how important a word is to a document relative to a corpus. Used as features for classical text classification before neural methods.

Named Entity Recognition

Intent classification tells you what the user wants to do. Named Entity Recognition (NER) tells you with what — extracting the specific pieces of information mentioned in the utterance.

For "I want to fly from London to Tokyo on Tuesday for two people," a NER system must extract: ORIGIN = London, DESTINATION = Tokyo, DATE = Tuesday (resolved to a specific date), PASSENGER_COUNT = 2. Without this extraction, the chatbot knows the user wants to book a flight but cannot actually book anything.

Standard NER systems recognize canonical entity types: PERSON, ORGANIZATION, LOCATION, DATE, TIME, MONEY, PERCENT. Deployed chatbots typically require domain-specific entity types beyond these standards — product SKUs, account numbers, medical codes — which require custom annotated training data.

NER is typically framed as a sequence labeling problem: each token in the sentence gets a label indicating whether it is the beginning, inside, or outside of a named entity (the BIO tagging scheme). Early systems used Conditional Random Fields (CRFs), which model dependencies between adjacent token labels. Modern systems fine-tune BERT with a classification head on top, achieving near-human accuracy on standard benchmarks but still failing on out-of-domain or low-resource text.

The Slot-Filling Problem

Task-oriented dialogue systems organize extracted entities into "slots" — structured fields required to complete a task. A flight booking might need slots for ORIGIN, DESTINATION, DATE, and PASSENGER_COUNT. If the user provides all slots in one utterance, the system can proceed. If slots are missing, the system must ask follow-up questions. Managing this multi-turn slot-filling process — tracking what has been provided, what is missing, and what has changed — is the core challenge of dialogue state tracking, covered in Module 2.

Ambiguity and the Limits of Extraction

The hardest problem in NLP is not classification — it is ambiguity. Natural language is systematically ambiguous at every level, and real users exploit this constantly without realizing it.

Lexical ambiguity: "I deposited money at the bank." (Financial institution or riverbank? Almost certainly financial institution in context, but the model must make that inference.)

Structural ambiguity: "I saw the man with the telescope." (Did I use a telescope to see him, or does he have a telescope?) Both parse trees are grammatically valid. Humans resolve this from context and world knowledge; NLP systems often get it wrong.

Referential ambiguity: "John told Paul he was wrong." (Who is wrong — John or Paul?) This requires coreference resolution — tracking which pronouns refer to which entities across a conversation. Coreference resolution remains one of the harder problems in NLP, with even state-of-the-art systems making errors that humans would find obvious.

Scope ambiguity: "Every student reads a book." (Is it one book that all students read, or a different book for each student?) This requires formal semantics to represent correctly and is largely beyond the reach of current NLP systems.

Large language models handle many of these ambiguities better than classical systems because they can draw on enormous statistical context — they have seen most typical patterns before. But they are not immune. They systematically fail on novel ambiguities, rare constructions, and cases where the correct resolution requires genuine world knowledge they lack.

Design Implication

When building or evaluating a conversational AI system, the question is never "does it handle ambiguity?" — every real-world system will encounter ambiguous inputs. The question is: what does it do when it encounters an ambiguous input? Does it make a reasonable default choice and proceed? Ask for clarification? Fail silently? A system that fails silently on ambiguous input — producing confident-sounding but wrong output — is far more dangerous than one that explicitly acknowledges uncertainty.

Lesson 3 Quiz

Five questions · Select the best answer for each

1. In a task-oriented dialogue system, what does intent classification determine?

Correct. Intent classification maps an utterance to a predefined action category — BookFlight, CheckBalance, CancelOrder — determining what the user wants to accomplish before any further processing.

Not quite. Intent classification determines the user's goal — what they want to do. Entity extraction (NER) handles specific named entities like dates, locations, and people. Grammaticality and sentiment are separate tasks.

2. Named Entity Recognition (NER) is typically framed as which type of problem?

Correct. NER uses the BIO tagging scheme (Beginning, Inside, Outside) to label each token in a sequence. This formulation allows the model to identify entity spans of any length within a sentence.

Not quite. NER is a sequence labeling problem: each token receives a label indicating whether it begins a named entity, continues one, or is outside any entity (BIO tags). This is different from document-level classification or sentence-level scoring.

3. "I saw the man with the telescope" is a classic example of which type of ambiguity?

Correct. Structural ambiguity arises when a sentence has multiple valid parse trees. "I saw the man with the telescope" can parse with the prepositional phrase attaching to the verb (I used a telescope) or to the noun phrase (the man had a telescope). Both are grammatically valid.

Not quite. This is structural or syntactic ambiguity — the sentence has two equally valid grammatical parse trees. Lexical ambiguity involves a single word with multiple meanings; referential ambiguity involves pronouns with unclear antecedents; scope ambiguity involves quantifier scope.

4. Which approach to intent classification avoids the need for large labeled training datasets per intent?

Correct. Zero-shot and few-shot LLM classification leverages the model's pretrained language knowledge to classify intents described in natural language, without needing hundreds of labeled training examples per category. This is especially valuable for small teams or rapidly changing intent taxonomies.

Not quite. SVM, fine-tuned BERT, and CRF approaches all require labeled training data. Zero-shot or few-shot LLM classification avoids this requirement by describing intents in natural language and asking the model to classify based on its pretrained knowledge — no domain-specific labeled data needed.

5. Coreference resolution addresses which specific NLP challenge?

Correct. Coreference resolution tracks entity references — determining that "he," "John," and "the professor" in a conversation all refer to the same person. It is essential for maintaining coherent dialogue state across multiple turns and remains one of the harder unsolved problems in NLP.

Not quite. Coreference resolution specifically addresses the problem of identifying which linguistic expressions — pronouns, definite noun phrases — refer to the same real-world entity within or across sentences. "John told Paul he was wrong" requires resolving which person "he" refers to.

Lab 3 · Intent, Entities, and Ambiguity

Interactive conversation · Minimum 3 exchanges to complete

Lab Objective

Practice identifying intents and entities in natural language utterances, and explore how ambiguity complicates extraction. Ask the tutor to classify example sentences, extract entities, or explain why a specific sentence is structurally or referentially ambiguous. Design a simple intent taxonomy for a domain of your choice.

Suggested opening: "If I were building a chatbot for a hospital appointment system, what intents and entity types would I need to define? Walk me through the design."

Intent & Entity Tutor

Lab 3

Welcome to Lab 3. Let's work through intent classification, named entity recognition, and the ambiguity problems that complicate both. Give me a sentence to classify or parse, describe a domain you want to build a chatbot for, or ask me to demonstrate how structural or referential ambiguity creates real problems for NLP pipelines. Where do you want to start?

Conversational AI and Chatbots · Module 1 · Lesson 4

Language Models, Transformers, and the Prediction Engine

How the architecture behind GPT actually works — and why predicting the next word turns out to require knowing so much.

Is there a meaningful difference between a system that always predicts the right next word and one that genuinely understands?

On the morning of July 11, 2023, researchers at Stanford and UC Berkeley published a paper titled "Are Emergent Abilities of Large Language Models a Mirage?" — challenging the widely reported claim that LLMs suddenly acquire new capabilities at certain scale thresholds. The paper argued the apparent emergence was partly an artifact of the evaluation metrics chosen. The debate was technical but the stakes were not: if capabilities emerge unpredictably at scale, no one can reliably anticipate what a next-generation model will be able to do. If capabilities grow smoothly with scale, planning and safety analysis become more tractable. The argument is not resolved. But it illustrates something important: we are building systems whose behavior at scale we do not fully predict from their architecture.

Understanding the Transformer architecture well enough to reason about these questions — rather than simply using these systems — is what this lesson is about.

The Language Modeling Task

A language model assigns probabilities to sequences of tokens. More precisely, it learns to estimate P(token_n | token_1, token_2, ..., token_{n-1}) — the probability of the next token given all preceding tokens. This is the autoregressive language modeling task, and it is what GPT-2, GPT-3, GPT-4, and similar models are fundamentally trained to do.

This seems like a narrow task. Why would predicting the next word require learning anything interesting? The answer becomes clear when you consider what accurate next-word prediction actually demands. To predict "Einstein" as a likely continuation of "The physicist who published the theory of special relativity in 1905 was," the model must know who published the theory of special relativity, when, and be able to match that knowledge to the grammatical expectation of a name at that point in the sentence. Accurate prediction across all possible sentences requires, implicitly, encoding enormous amounts of world knowledge, grammatical structure, logical inference, and even stylistic convention.

This is why large language models appear to "know" so much: not because they were explicitly taught facts, but because predicting text accurately requires representing facts as an intermediate computational step.

The Transformer Architecture

The Transformer, introduced by Vaswani et al. in 2017, consists of two main components: an encoder that processes input sequences and a decoder that generates output sequences. For language generation (GPT-style models), only the decoder is used. For understanding tasks (BERT-style models), only the encoder is used. For sequence-to-sequence tasks like translation (the original Transformer application), both are used.

The central innovation is self-attention. For each token in a sequence, self-attention computes a weighted sum of representations of all other tokens, where the weights reflect how relevant each other token is to understanding the current one. In the sentence "The animal didn't cross the street because it was too tired," self-attention allows the model to learn that "it" most strongly attends to "animal" — resolving the pronoun reference that has tripped up classical NLP systems for decades.

Self-attention is computed via three learned linear projections of each token: Query (Q), Key (K), and Value (V). The attention weight between token i and token j is the dot product of Q_i and K_j, scaled and softmax-normalized. This is the "scaled dot-product attention" formula: Attention(Q,K,V) = softmax(QK^T / √d_k) · V. Multiple attention "heads" run in parallel (multi-head attention), each attending to different aspects of the relationships between tokens.

Self-Attention A mechanism that allows each token in a sequence to attend to all other tokens, computing a weighted representation that captures contextual relationships. The core innovation of the Transformer architecture.

Autoregressive Generation A generation strategy where the model produces one token at a time, conditioning each new token on all previously generated tokens. GPT-style models generate text this way — they cannot revise earlier tokens once generated.

Context Window The maximum number of tokens a Transformer model can process at once. GPT-3 had a 4,096-token context; GPT-4 Turbo supports 128,000 tokens. Events beyond the context window are completely invisible to the model.

Scale, Training, and the Role of RLHF

Scaling a Transformer — increasing the number of parameters, the training data, and the compute — produces dramatic and consistent improvements across benchmarks. Kaplan et al. at OpenAI published scaling laws in 2020 showing that loss decreases predictably as a power law of model size, dataset size, and compute budget. This empirical regularity gave researchers confidence that simply building larger models and training them on more data would continue to yield improvements — a prediction that has, so far, largely held.

But a pretrained language model is not immediately a useful chatbot. A model trained purely on next-token prediction will complete text in the statistical style of its training data — helpful sometimes, but also capable of generating harmful, biased, or nonsensical continuations with equal facility. Converting a pretrained model into a system that follows instructions and declines harmful requests requires additional training steps.

Instruction tuning fine-tunes the model on datasets of (instruction, desired response) pairs. Reinforcement Learning from Human Feedback (RLHF), applied by OpenAI to create InstructGPT (2022) and subsequently ChatGPT, goes further: human raters compare model outputs and rank them; a reward model is trained on these preferences; the language model is then fine-tuned using reinforcement learning to maximize the reward model's score. RLHF is the primary technique responsible for making large language models helpful and relatively safe — and its limitations explain much of why they still sometimes fail in predictable ways.

The Hallucination Problem

Language models are trained to produce fluent, contextually appropriate continuations of text. They are not trained to be accurate — accuracy is only incidentally rewarded insofar as accurate text appears more in training data. When a model generates a plausible-sounding but false statement ("The Eiffel Tower was built in 1871" — it was actually completed in 1889), this is called a hallucination. RLHF reduces but does not eliminate hallucination, because human raters often cannot verify factual claims. This is one of the most important practical limitations of current conversational AI for high-stakes applications.

What This Means for Conversational AI

Understanding the Transformer architecture and training process directly illuminates the characteristic behaviors of modern chatbots. A model that generates one token at a time, conditioning on all previous context, will be coherent within its context window and coherent-seeming beyond the facts it knows — but it has no persistent memory between conversations, no ability to access current information without external tools, and no mechanism to verify its own outputs.

These are not bugs to be fixed. They are structural properties of the architecture. Retrieval-augmented generation (RAG), tool use, and multimodal extensions address some of these limitations — but they do so by adding external systems, not by changing the fundamental prediction engine. The conversational AI systems you will encounter in the rest of this course all sit on top of this foundation. Understanding the foundation is what allows you to reason about why they behave as they do — and what to expect when they are pushed beyond their design envelope.

Module 1 Core Takeaway

NLP has moved from hand-coded rules, to statistical learning, to deep learning, to the current era of large Transformer-based language models trained on internet-scale text. At each stage, the systems became more capable and the internal representations became less interpretable. Modern conversational AI systems are extraordinarily capable text-predictors that have implicitly absorbed enormous world knowledge — and that limitation, the prediction-versus-comprehension gap, is the most important thing to carry into every practical application decision you make.

Lesson 4 Quiz

Five questions · Select the best answer for each

1. Why does accurate next-word prediction require a language model to encode world knowledge, even though the training task makes no explicit reference to facts?

Correct. To predict that "Einstein" follows "The physicist who published the theory of special relativity in 1905 was," the model must encode that fact — not as explicit stored knowledge, but as a pattern that produces accurate predictions. World knowledge is a byproduct of accurate next-word prediction at scale.

Not quite. World knowledge in LLMs is not explicitly taught or injected from databases. It emerges implicitly because predicting accurate text continuations requires representing the facts that make those continuations accurate. The model learns facts as instrumental to its prediction task.

2. In the Transformer's scaled dot-product attention, what do the Query (Q), Key (K), and Value (V) vectors represent?

Correct. Q, K, and V are learned linear projections of token representations. The dot product of Q_i and K_j determines how much token i attends to token j; the resulting weights are applied to the V vectors to produce a weighted contextual representation. Each attention head learns different projection matrices, attending to different aspects of token relationships.

Not quite. Q (Query), K (Key), and V (Value) are three separate learned linear transformations applied to each token's representation. Attention weights are computed from Q-K dot products; the final representation is a weighted sum of V vectors. They are not related to input/embedding/position encodings or database lookups.

3. What does Reinforcement Learning from Human Feedback (RLHF) add to a pretrained language model?

Correct. RLHF trains a reward model on human preference rankings of model outputs, then fine-tunes the language model using reinforcement learning to maximize that reward. This is the key step that converted GPT-3 into InstructGPT and then ChatGPT — making the model helpful, harmless, and honest rather than merely fluent.

Not quite. RLHF fine-tunes the model to produce outputs that human raters prefer, using a reward model trained on human preference comparisons. It does not provide internet access, a larger corpus, or persistent memory — those require different architectural additions.

4. A language model "hallucination" occurs because of which fundamental property of how these models are trained?

Correct. The training objective — predict the next token — rewards fluency and contextual plausibility, not factual accuracy. A plausible-sounding false continuation is nearly as good as a true one from the loss function's perspective. RLHF mitigates this but cannot eliminate it because human raters cannot verify all factual claims.

Not quite. Hallucination is a structural consequence of the training objective: the model learns to produce fluent, contextually plausible text, not necessarily accurate text. Accuracy is only incidentally rewarded when accurate text happens to be more common in training data. Context window size and tokenization are not the primary cause.

5. The 2020 OpenAI scaling laws paper by Kaplan et al. showed that language model performance improves as a function of which three factors?

Correct. Kaplan et al. showed that language model performance (measured by next-token prediction loss) improves predictably as a power law of model parameters, training data size, and compute budget. This gave the field a principled basis for allocating resources when scaling models.

Not quite. The Kaplan et al. scaling laws relate performance to model size (number of parameters), dataset size (tokens of training data), and compute budget (FLOPs used for training). Architectural hyperparameters like attention heads and embedding dimension matter less than these three aggregate quantities.

Lab 4 · Transformers and the Prediction Engine

Interactive conversation · Minimum 3 exchanges to complete

Lab Objective

Engage with an AI tutor about Transformer architecture, the autoregressive generation process, RLHF, and the structural reasons for hallucination. Ask the tutor to explain self-attention in plain language, trace through how a specific sentence would be generated token by token, or discuss why RLHF is necessary but not sufficient for safety.

Suggested opening: "Explain to me, as concretely as possible, what happens computationally when ChatGPT generates the first word of a response to my message. Walk through each step."

Transformer Architecture Tutor

Lab 4

Welcome to Lab 4. We're going deep on Transformer architecture, autoregressive generation, scaling laws, and RLHF. Ask me to trace through token generation step by step, explain what self-attention is actually computing, or discuss why hallucination is a structural property rather than a fixable bug. What would you like to understand better?

Module 1 Test

15 questions · 80% required to pass · All lessons covered

1. Which branch of linguistics studies the rules governing how words combine into phrases and sentences?

Correct. Syntax is the study of grammatical structure — how words combine to form valid sentences. It is the level at which "The dog bit the man" and "The man bit the dog" are distinguished despite identical vocabulary.

Syntax governs sentence structure — how words combine grammatically. Morphology covers word structure; semantics covers meaning; pragmatics covers use in context.

2. ELIZA (1966) demonstrated that users would attribute understanding to a program that was actually only doing what?

Correct. ELIZA matched patterns in user input and reflected them back as questions. No understanding, no statistical inference, no neural networks — just pattern matching. The ELIZA effect is the name for users' tendency to anthropomorphize even this minimal behavior.

ELIZA used only simple pattern matching — detecting keywords and reflecting the user's own phrasing back as questions. Its significance was that even this minimal behavior caused users to attribute genuine understanding.

3. What was the main advantage of statistical NLP methods over symbolic rule-based systems?

Correct. Statistical methods learn from data, sidestepping the knowledge acquisition bottleneck that made rule-based systems brittle and expensive to build. They required large corpora but avoided the need to formalize language explicitly.

Statistical methods learn patterns from training data — they require large corpora but avoid hand-crafted rules. They are generally less interpretable than rule-based systems, not more so.

4. Word2Vec was developed at which organization and published in which year?

Correct. Word2Vec was published by Tomas Mikolov and colleagues at Google in 2013. Stanford's GloVe followed in 2014; Facebook's FastText appeared in 2016–2017.

Word2Vec was published by Tomas Mikolov et al. at Google in 2013. Stanford published GloVe in 2014; Facebook AI Research published FastText around 2016–2017.

5. Which tokenization algorithm, adapted for NLP by Sennrich et al. in 2016, is the basis for most modern LLM tokenizers?

Correct. Byte-Pair Encoding, originally a data compression algorithm by Philip Gage (1994), was adapted for NLP subword tokenization by Rico Sennrich et al. in 2016 and is now used directly or as the basis for variants like WordPiece and tiktoken in most major LLMs.

Byte-Pair Encoding (BPE), adapted by Sennrich et al. in 2016, is the foundational algorithm. WordPiece and SentencePiece are related but distinct algorithms. TF-IDF is a document weighting scheme, not a tokenizer.

6. The term "context window" in a Transformer model refers to what?

Correct. The context window is the maximum sequence length the model processes simultaneously. Tokens beyond this window are completely invisible to the model. GPT-3 had a 4K token context; GPT-4 Turbo supports 128K tokens.

The context window is the maximum token length of the sequence the model can process in one forward pass. It is not persistent memory between conversations — LLMs have no cross-conversation memory by default.

7. BERT is trained using bidirectional context, while GPT-style models use only left (preceding) context. What architectural choice causes this difference?

Correct. GPT-style decoders apply causal masking to the attention mechanism, preventing each token from attending to tokens that come after it in the sequence — necessary for autoregressive generation. BERT's encoder has no such mask, allowing full bidirectional attention, which is why it cannot generate text autoregressively.

The difference is architectural: GPT uses causal (left-to-right) masking in its decoder so it can generate text token by token without looking ahead. BERT's encoder has no masking, enabling full bidirectional attention — but preventing autoregressive text generation.

8. In the BIO tagging scheme used for Named Entity Recognition, what does "B" stand for?

Correct. BIO stands for Beginning, Inside, Outside. B marks the first token of a named entity; I marks subsequent tokens within the same entity; O marks tokens not part of any entity. This allows NER systems to identify multi-word entity spans.

BIO stands for Beginning, Inside, Outside. B marks the first token of a named entity span, I marks continuation tokens within the same entity, and O marks tokens not belonging to any entity.

9. The Distributional Hypothesis, foundational to word embeddings, was most directly articulated by which figure?

Correct. J.R. Firth, in 1957, articulated "a word is characterized by the company it keeps" — the distributional hypothesis that words appearing in similar contexts have similar meanings. Word2Vec and all subsequent embedding methods operationalize this claim.

J.R. Firth (1957) articulated the distributional hypothesis: "You shall know a word by the company it keeps." This is the theoretical basis for word embedding methods. Chomsky focused on formal syntax; Turing on computation; Shannon on information theory.

10. "John told Paul he was wrong" is an example of which type of linguistic ambiguity?

Correct. Referential ambiguity arises when a pronoun or definite noun phrase could refer to more than one antecedent. "He" could be John or Paul — resolving this requires coreference resolution, which remains one of the harder problems in NLP.

Referential ambiguity involves unclear pronoun or noun phrase reference. "He" could refer to either John or Paul. This is distinct from structural ambiguity (multiple parse trees), lexical ambiguity (words with multiple meanings), and scope ambiguity (quantifier scope).

11. Multi-head attention in the Transformer runs multiple attention computations in parallel. What is the key benefit of this design?

Correct. Each attention head learns different projection matrices (Q, K, V) and can therefore attend to different types of relationships — one head might track syntactic dependencies while another tracks coreference. The outputs of all heads are concatenated and projected to form the final representation.

Multi-head attention runs multiple parallel attention computations, each with different learned projections. Different heads capture different relationship types — syntactic, semantic, coreference, etc. — giving the model richer representations than a single attention computation would allow.

12. Which approach to intent classification can work without domain-specific labeled training examples?

Correct. Zero-shot classification with an LLM requires only natural language descriptions of each intent — no labeled training examples. The LLM's pretrained knowledge handles classification, though accuracy may be lower than a task-specific trained model on common intents.

Fine-tuned BERT, SVM classifiers, and CRF models all require labeled training data per intent. Zero-shot LLM classification bypasses this requirement by describing intents in plain language and leveraging the LLM's pretrained knowledge to classify without domain-specific examples.

13. ELMo (2018) was the first widely adopted approach to produce contextual word embeddings. What architecture did it use?

Correct. ELMo (Embeddings from Language Models), from the Allen Institute in 2018, used a bidirectional LSTM to produce contextual embeddings. It predated BERT, which used a Transformer encoder, and demonstrated that contextual representations significantly outperformed static embeddings on downstream NLP tasks.

ELMo used a bidirectional LSTM — predating the dominance of Transformer architectures for this type of task. BERT followed in late 2018 using a Transformer encoder, and largely superseded ELMo in practice.

14. The "slot-filling" problem in task-oriented dialogue refers to what challenge?

Correct. Slot filling manages the structured information required to complete a task. A flight booking needs ORIGIN, DESTINATION, DATE, and PASSENGER_COUNT slots filled. If a user provides all slots in one turn, the system proceeds; otherwise it asks follow-up questions for missing slots across multiple turns.

Slot filling tracks which required information fields (slots) have been provided in a task-oriented dialogue and manages the multi-turn process of asking follow-up questions to fill missing slots. It is distinct from tokenization, context window management, or NER tagging.

15. The Kaplan et al. (2020) scaling laws showed that language model loss decreases as a power law with three factors. Which combination correctly identifies them?

Correct. The Kaplan scaling laws identify model size (parameters), dataset size (tokens of training data), and compute budget (FLOPs) as the three principal determinants of language model performance, with performance improving predictably as a power law of each. This gave the field a principled basis for resource allocation decisions.

The Kaplan et al. scaling laws identify model parameters, training data tokens, and compute budget (FLOPs) as the three primary factors. Architectural hyperparameters like attention heads and embedding dimension matter less than these aggregate quantities when held in proportion.