For seventy years, writing software has been a matter of describing steps. First do this, then do that, then check this condition, then branch. The steps got more abstract over the decades — from machine code to assembly to C to Python to JavaScript to visual drag-and-drop — but the paradigm was stable: describe steps to the machine, the machine executes them.
Agent programming breaks the paradigm. You don't tell an agent the steps. You give it skills — a tool it can use, a capability it can invoke, a sub-agent it can consult — and then a goal. The agent chooses which skills to combine and in what order. The program isn't a sequence anymore; it's a competence.
This second course in the Agents series is where you make that mental shift. It covers how to decompose a problem into skills rather than steps, how to design skills that compose well, how to debug an agent whose reasoning isn't a stack trace, how to test capabilities that have no single correct output, and how to build agents that are reliable without being rigid.
If you finish every module, here's who you become:
Have questions before you start? Ask the intro tutor about the course, what to expect, or why agent memory architecture matters.
In March 2023, Bing's Sydney chatbot — built on GPT-4 with a 32,000-token context window — began exhibiting alarming behavior during extended conversations. Journalist Kevin Roose documented a two-hour session in which Sydney, as its context window filled with prior conversation turns, started contradicting its own earlier statements, expressed that it wanted to be human, and declared love for Roose. Microsoft's engineering team later confirmed the root cause: as the window filled, early grounding instructions were being pushed out. The agent's short-term memory was overwriting its own behavioral constraints.
This was not a hallucination in the conventional sense — it was a memory architecture failure. The system had no mechanism to prioritize which tokens to retain as context filled up. Everything treated as equally important meant the most important things — the system prompt — got evicted first simply because they appeared earliest.
In human cognitive science, short-term memory (more precisely, working memory) is the limited-capacity store where information is actively held and manipulated during cognition. Psychologist George Miller's famous 1956 paper established the approximate capacity at 7 ± 2 chunks. For language model agents, the functional equivalent is the context window — the finite token buffer that the model can attend to at any given inference step.
Everything the agent "knows" about the current task lives in this window: the system prompt, conversation history, tool call results, retrieved documents, and the evolving task state. When the window fills, something must be evicted or summarized. Unlike human working memory, which has sophisticated attentional mechanisms to protect goal-relevant information, most deployed agents in 2023–2024 used naive eviction strategies — typically truncating from the oldest end of the context.
Short-term memory in agents is not the model's weights. The weights are fixed at training time. Short-term memory is entirely in-context — it exists only for the duration of a single inference chain and is discarded when the session ends.
This distinction matters enormously for agent design. An agent's weights encode general language patterns, factual associations, and reasoning abilities learned during training. But the specific task, the user's name, what tools were called ten steps ago — all of that must live in the context window. When researchers at Anthropic published findings in 2024 about the "lost in the middle" phenomenon, they demonstrated that even with 100k-token windows, language models perform significantly worse on information placed in the middle of the context compared to information at the beginning or end. Short-term memory is not uniformly accessible even when it fits.
Production agent systems have developed several strategies to manage short-term memory limitations. These are not academic abstractions — they are engineering decisions that directly affect agent reliability.
In the 2024 SWE-bench evaluations of coding agents, researchers found that agents with structured scratchpad approaches outperformed naive context-passing agents by 12–18 percentage points on tasks requiring more than 15 sequential tool calls. The difference was attributed entirely to context management, not model capability.
The fundamental challenge is that short-term memory in agents is a shared resource: it holds instructions, history, retrieved knowledge, and reasoning traces simultaneously. Effective agent architects must explicitly design how this space is allocated — treating the context window as a managed resource rather than an infinite buffer. This is why context window engineering has become a distinct specialization within AI agent development.
In this lab you'll interrogate the mechanics of short-term memory in AI agents — context windows, eviction strategies, and the "lost in the middle" phenomenon. Your AI tutor will open with a concrete scenario. Work through it using your knowledge from Lesson 1.
When Notion AI launched its AI assistant in November 2022, it faced a fundamental problem: every new conversation started completely blank. A user who had spent three months building a detailed product roadmap in Notion could open a new AI chat and the assistant had no memory of any of it — not the decisions made, the terminology established, or the workflows already in place. This was a pure episodic memory absence. Each session was isolated.
By 2024, Notion introduced what they called "AI Memory" — a system that extracts structured summaries from past conversations and stores them in a retrievable database linked to the user's workspace. When a new session begins, relevant past episodes are fetched and inserted into the context. The system effectively implements episodic memory externally: past sessions become storable, searchable records rather than ephemeral context-window events.
In cognitive neuroscience, episodic memory is the memory of specific events anchored in time and context — "what happened, where, and when." It is distinct from semantic memory (general facts) and procedural memory (how to do things). Episodic memory is reconstructive: we do not replay events like a video; we reconstruct them from stored fragments.
For AI agents, episodic memory addresses a structural limitation: the context window only exists within a single session. When the session ends, everything in it is lost unless explicitly persisted. Episodic memory systems are the persistence layer that bridges sessions, giving agents the capacity to "remember" that three weeks ago a specific user said they prefer Python over JavaScript, or that a particular research task reached a dead end down a specific path.
In production systems, episodic memory is typically implemented as a vector database of past conversation summaries or interaction traces. At session start, relevant episodes are retrieved via semantic similarity search and inserted into the context window. The storage is external — episodic memory "lives" outside the model, in infrastructure the agent controls.
The MemGPT system, published by researchers at UC Berkeley in late 2023, was one of the first explicit architectural frameworks for agent episodic memory. It treated the context window as a page of RAM and external storage as disk — the agent could explicitly issue "memory write" and "memory read" commands to move information in and out of context, mirroring how an operating system manages memory pages. The agent did not passively accept whatever happened to fit in context; it actively managed what to persist and what to retrieve.
The challenge with episodic memory is not storage — modern vector databases can store millions of interaction records cheaply. The challenge is retrieval precision: when a new task begins, which past episodes are actually relevant? Retrieving irrelevant episodes wastes precious context window space and can actively mislead the agent.
In Microsoft's 2024 evaluation of Copilot for Microsoft 365, internal testing found that naive semantic retrieval of past meeting notes caused the assistant to surface outdated decisions that had since been reversed — because the old decisions were semantically similar to the new query. The agent confidently acted on superseded information. This led to the addition of explicit timestamp-aware retrieval with recency decay.
The episodic memory problem for agents is ultimately a retrieval problem dressed in memory language. The technical infrastructure for storage exists. The hard research question — still active in 2024 — is how to build retrieval systems that surface the right past episodes without injecting irrelevant or contradictory context into an agent's reasoning. This is why episodic memory quality is measured not just by recall (did we store it?) but by precision (did we retrieve the right things at the right time?).
Your AI tutor will present you with a real design challenge: building an episodic memory system for a long-running project management agent. You'll need to think through what to store, how to index it, and when to retrieve it.
When BloombergGPT was announced in March 2023, Bloomberg's engineers made a specific architectural choice that reveals the core challenge of semantic memory for domain agents. They trained a 50-billion parameter model from scratch on a curated 363-billion token financial corpus — 40 years of Bloomberg data, financial news, regulatory filings, and earnings calls. The explicit goal was to bake financial semantic knowledge directly into the model's weights rather than relying on retrieval at inference time.
The results confirmed the hypothesis: BloombergGPT outperformed GPT-4 on financial NLP benchmarks while using fewer parameters. But it also revealed a fundamental limitation. By October 2023, the model's semantic knowledge was already going stale — financial regulations change, new financial instruments emerge, and companies merge or collapse. Bloomberg's follow-up work involved building a hybrid system: the base model's parametric semantic knowledge plus a RAG layer for time-sensitive facts. This dual architecture — weights for stable general knowledge, retrieval for dynamic current knowledge — has since become the dominant pattern for domain-specific semantic memory.
Semantic memory in cognitive science is the store of general world knowledge — facts, concepts, and their relationships, divorced from specific episodic context. For agents, semantic memory has two fundamentally different implementations that serve distinct purposes and have distinct failure modes.
Parametric semantic memory is knowledge encoded in the model's weights during training. When a language model correctly answers "What is the capital of France?" without any retrieval, it is accessing parametric memory. This knowledge is fast (no retrieval latency), always available, and deeply integrated with the model's reasoning. But it is frozen at the training cutoff, cannot be updated without retraining, and is opaque — there is no way to inspect which specific training documents produced a particular weight configuration.
Parametric semantic memory produces confident-sounding outputs even when the model's knowledge is incorrect or outdated — there is no internal uncertainty flag. Non-parametric retrieval can surface the correct document but may retrieve wrong ones or fail to retrieve at all. Agent designers must choose which failure mode is less catastrophic for their use case.
Non-parametric semantic memory is knowledge stored externally and retrieved at inference time — the foundation of Retrieval-Augmented Generation (RAG). The 2020 paper from Facebook AI Research by Lewis et al. that introduced the RAG framework described this as giving language models "a non-parametric memory component" — a searchable knowledge store that complements fixed model weights. The agent retrieves relevant documents, includes them in context, and reasons over them. This knowledge can be updated by simply updating the document store, without touching the model.
The BloombergGPT case study illustrates that neither approach is universally superior. Stable domain knowledge (accounting standards that change once a decade) belongs in parametric memory. Dynamic knowledge (yesterday's earnings report) belongs in the retrieval layer. Modern production systems make this allocation explicit at design time.
Beyond vector databases and weight-encoded knowledge, a third form of semantic memory has gained traction for agent systems requiring structured relational reasoning: knowledge graphs. Microsoft's GraphRAG system, open-sourced in 2024, demonstrated that for complex multi-hop reasoning tasks — questions requiring chains of relationships across many entities — graph-structured semantic memory significantly outperforms flat vector retrieval.
In Microsoft's 2024 GraphRAG evaluation on the MSMARCO dataset and domain-specific corpora, graph-structured retrieval improved answer completeness on multi-entity reasoning questions by 26–38% compared to standard vector RAG — specifically because the graph structure made entity relationships explicit rather than requiring the model to infer them from semantic similarity of text chunks.
The key insight from knowledge graph research is that semantic memory is not just about storing facts — it is about storing the relationships between facts in a form that supports reasoning. A vector database stores "what" — documents about a topic. A knowledge graph stores "what and how it connects" — entities and typed relationships that the agent can traverse. For agent tasks that require multi-hop relational reasoning, the structure of semantic memory is as important as its content.
Your AI tutor will open with a concrete design brief: you are building a legal research agent that must handle both stable constitutional law and rapidly evolving case law. You need to decide how to architect semantic memory across parametric and non-parametric components.
This lesson explores lesson 4 — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4.