Module 3 · Lesson 1

Multimodal AI: Seeing, Hearing, and Speaking

Language was only the beginning. AI systems are rapidly acquiring the full range of human senses.

What changes when AI can process images, audio, and video alongside text — all at once?

When OpenAI demonstrated GPT-4V — the vision-capable version of GPT-4 — one of the most-shared examples was a researcher photographing a hand-drawn circuit diagram and asking the model to identify the flaw. The model located the error correctly. No OCR. No preprocessing. Just a photo and a question.

That single demo signaled something: the wall between "language AI" and "computer vision" had quietly dissolved.

What Multimodal Actually Means

Early AI systems operated in silos. Image classifiers handled images. Speech recognition systems handled audio. Language models handled text. Each required its own training pipeline, its own inputs, its own outputs.

Multimodal AI collapses those silos. A single model receives inputs across multiple modalities — text, images, audio, video, documents — and reasons across all of them simultaneously. The model doesn't translate; it integrates.

The key technical enabler is the transformer architecture, which turns out to generalize well beyond text. Researchers discovered that images, audio spectrograms, and even video frames can all be tokenized — broken into discrete chunks — and fed into the same attention mechanism that powers language models.

Landmark Example

In May 2024, OpenAI released GPT-4o ("o" for omni). Unlike previous multimodal systems that processed modalities sequentially, GPT-4o handled text, audio, and vision in a unified model. It could respond to spoken questions with spoken answers in real time — with the ability to detect and respond to emotional tone in the speaker's voice. The demo showed it being interrupted mid-sentence and adjusting its response live.

The Modalities Now Available

As of 2024–2025, leading frontier models routinely handle:

Images → Reading charts, identifying objects, describing scenes, reading handwritten text, interpreting medical scans (with domain fine-tuning), analyzing satellite imagery.

Audio → Transcription, translation, speaker identification, tone detection, and now real-time spoken conversation with emotional responsiveness.

Video → Temporal reasoning across frames — understanding what happens over time, not just in a single image. Google's Gemini 1.5 Pro demonstrated processing a full-length film and answering detailed plot questions.

Documents → PDFs, spreadsheets, slides — treated as visual objects with embedded text, not just text files. Layout and structure become part of the input.

The Gemini 1.5 Pro Context Window

In February 2024, Google DeepMind released Gemini 1.5 Pro with a 1-million-token context window — large enough to hold approximately 11 hours of audio, 30,000 lines of code, or an entire feature film. In testing, the model correctly located a specific scene in a 402-page transcript it had never been trained on, retrieving it from the middle of the document with high accuracy.

This matters for multimodal work specifically because it means a model can hold an entire document, an entire meeting recording, and a follow-up question all in the same context simultaneously — without needing to chunk and retrieve.

Near-Horizon Implication

The practical shift is already underway: AI stops being a text assistant and becomes a perception layer. Systems that can watch a manufacturing line and flag anomalies, listen to a customer call and generate a summary, or read a legal filing and cross-reference it against a database of prior rulings are not hypothetical. They are in active deployment or in well-documented pilots as of 2024.

What This Changes for You

Practically, multimodal capability means the inputs you can provide to AI systems have expanded dramatically. You no longer need to describe an image in text — you show it. You no longer need to transcribe a meeting — you upload the audio. The interface is becoming more natural, and the information density you can share with a model in a single exchange has increased by an order of magnitude.

The limiting factor is no longer modality support. It is context window size, latency, and cost per token — all of which are on declining curves as of 2024.

Lesson 1 Quiz

Multimodal AI: Seeing, Hearing, and Speaking

1. What architectural feature allowed language model techniques to extend naturally to images and audio?

Correct. Transformers generalize across modalities because any input — text, image patches, audio spectrograms — can be tokenized and processed by the same attention mechanism.

Not quite. The key insight was that the transformer architecture, originally built for text, generalizes to any tokenized input, including images and audio.

2. What was the significance of GPT-4o's release in May 2024, compared to earlier multimodal systems?

Correct. GPT-4o ("omni") was notable for integrating all three modalities in one model, enabling real-time spoken conversation with emotional-tone awareness rather than sequential processing pipelines.

Not quite. GPT-4o's significance was that it unified all modalities — text, audio, vision — in a single model, enabling real-time responses without sequential pipelines.

3. Gemini 1.5 Pro's 1-million-token context window is significant for multimodal AI primarily because it allows:

Correct. A million-token context means you can load hours of audio, tens of thousands of lines of code, or a full feature film alongside your question — without chunking and retrieval workarounds.

Not quite. The large context window primarily enables holding entire multimodal inputs — documents, audio recordings, full films — in context simultaneously without needing chunking strategies.

4. As of 2024, what are the main limiting factors on multimodal AI use — given that modality support is now broadly available?

Correct. The lesson notes that the modality barrier is largely solved — the remaining practical limits are context size, speed, and cost, and all three are improving.

Not quite. By 2024, modality support is broadly available. The remaining practical limits are context window size, latency, and cost per token — all of which are on declining curves.

Lab 1: Multimodal Scenario Mapping

Explore how multimodal capabilities change real workflows

Your Mission

You're going to think through how multimodal AI capabilities — vision, audio, video, document understanding — could change a specific workflow you know. The AI assistant will help you map the before/after and surface implications you might not have considered.

This lab is complete after 3 exchanges.

Start by telling the assistant: what industry or type of work are you thinking about? Or ask it to suggest a compelling multimodal use case and walk you through the implications.

Multimodal Scenario Lab

Lesson 1

Welcome to the Multimodal Scenario Lab. I'm here to help you map out how multimodal AI — systems that can see, hear, and read across formats simultaneously — could reshape a workflow you care about.

Tell me about a field or workflow you're curious about, and we'll explore what changes when AI gains full perceptual capability. Or I can suggest a compelling use case if you'd prefer to start from an example.

Module 3 · Lesson 2

Agentic AI: Systems That Act

From answering questions to completing tasks — AI is beginning to operate in the world, not just respond to it.

What changes when AI can browse the web, write and run code, and manage files — without a human hand-holding each step?

When OpenAI released plugins for ChatGPT in March 2023, and then the Code Interpreter in July, something structurally changed. The model was no longer just generating text to be read. It was writing code, executing it, observing the output, debugging, and repeating — all in a loop, inside a sandboxed environment, on behalf of the user. The user described a goal. The system took steps.

That loop — perceive, plan, act, observe, repeat — is the core of what researchers call an AI agent.

What an AI Agent Is

A language model by itself is reactive. It receives a prompt and generates a completion. It has no memory across sessions, no ability to take actions in the world, and no mechanism for iterating toward a goal.

An AI agent wraps a language model in an architecture that gives it all three of those capabilities:

Memory → Short-term (within the context window), long-term (retrieved from vector databases or files), and episodic (records of previous actions and their outcomes).

Tools → Web search, code execution, API calls, file read/write, browser control, calendar access, email — anything the agent can invoke via structured function calls.

Planning → The ability to decompose a high-level goal into sub-steps, execute them in sequence, check results, and revise the plan based on what it observes.

Documented Example — Devin, March 2024

Cognition AI's "Devin" was introduced in March 2024 as an autonomous software engineering agent. In public benchmarks (SWE-bench), it resolved 13.86% of real GitHub issues end-to-end — far below human-level but far above prior automated systems. More significantly, it used a browser, a terminal, and a code editor simultaneously — planning, coding, running tests, reading error messages, and debugging — without human intervention on individual steps.

The AutoGPT Moment

In March 2023, a developer named Toran Bruce Richards released AutoGPT on GitHub. It was an open-source framework that gave GPT-4 the ability to spawn sub-agents, search the web, write files, and run code in pursuit of a user-defined goal. Within weeks it became one of the fastest-growing GitHub repositories in history, reaching 100,000 stars in under two weeks.

AutoGPT's actual performance on complex tasks was unreliable — it frequently looped, got stuck, or took wrong turns. But its explosive popularity demonstrated something important: there was massive demand for AI systems that could do, not just say. The architecture was right, even if execution needed work.

Multi-Agent Systems

A more recent development is multi-agent architectures, where multiple AI agents with specialized roles collaborate on a task. A researcher agent gathers information. A writer agent drafts output. A critic agent reviews it. An orchestrator agent routes tasks and manages the overall workflow.

Microsoft's AutoGen framework, released in late 2023, is one of the most widely adopted multi-agent frameworks, enabling developers to define conversation patterns between agents with different roles and tools. Andrew Ng's team at AI Fund has described multi-agent workflows as one of the most important near-term developments for AI productivity — not because individual agents are superhuman, but because parallel specialization mimics how human teams actually work.

The Real Near-Horizon Challenge: Reliability

Current AI agents are unreliable over long task horizons. Errors compound: a wrong assumption at step 3 cascades into failure by step 10. Agents also face security challenges — prompt injection attacks can hijack an agent's tool-use by embedding malicious instructions in web content the agent reads.

The research community broadly agrees that agentic AI's near-horizon progress will be measured not by capability expansion but by reliability improvement — the ability to complete multi-step tasks without human supervision at an error rate acceptable for real deployment.

Where This Is Headed

Anthropic's Claude 3.5 Sonnet (June 2024) introduced "computer use" capability in a research preview — the model could take over a computer's mouse and keyboard to complete tasks in a standard desktop environment. Google's Gemini team is building similar capabilities. The trajectory is toward agents that operate on entire computers, not just within sandboxed tool APIs.

Lesson 2 Quiz

Agentic AI: Systems That Act

5. What three capabilities distinguish an AI agent from a standard language model?

Correct. Agents add memory (across sessions and tasks), tools (web, code, APIs), and planning (decomposing goals into executable sub-steps) to a base language model.

Not quite. The lesson defines three architectural additions that make a language model agentic: memory, tools, and planning toward a multi-step goal.

6. Why was AutoGPT's viral success in March 2023 significant even though its task performance was unreliable?

Correct. AutoGPT's speed to 100,000 GitHub stars — despite unreliable execution — showed that the demand for action-oriented AI was massive, even before the technical reliability was there.

Not quite. AutoGPT's significance was demand-side: its explosive popularity showed that users wanted AI that could pursue goals autonomously, even if execution was imperfect at that stage.

7. What is the primary near-horizon challenge for agentic AI systems, according to the research community?

Correct. Current agents fail because errors compound over multiple steps. The near-horizon priority is reliability — getting error rates low enough for real deployment without constant human supervision.

Not quite. The lesson emphasizes that the primary challenge is reliability over long horizons — errors cascade from early steps, making current agents unsuitable for unsupervised long-task deployment.

8. What capability did Anthropic introduce in Claude 3.5 Sonnet's "computer use" research preview in 2024?

Correct. Computer use was a significant milestone — rather than sandboxed tool APIs, the model could operate a real desktop environment, representing a step toward agents that work on entire computers.

Not quite. Anthropic's computer use capability meant the model could control an actual computer — mouse, keyboard, screen — rather than being limited to predefined tool APIs.

Lab 2: Designing an AI Agent Workflow

Think through what a real agentic system would need to accomplish a goal

Your Mission

You're going to design the architecture of a hypothetical AI agent for a specific real-world task. The assistant will help you think through: what tools the agent needs, how it should plan sub-steps, where it is likely to fail, and what reliability safeguards you'd want to build in.

This lab is complete after 3 exchanges.

Start with a task you'd want an agent to accomplish — something like "research competitors and draft a briefing," "monitor a shared inbox and route messages," or your own idea.

Agentic Workflow Design Lab

Lesson 2

Welcome to the Agentic Workflow Design Lab. We're going to think through what a real AI agent architecture looks like for a task you care about.

For any agentic task, we need to figure out: what the goal is, what tools the agent needs, how it should decompose the task into steps, and where it's likely to fail or need human review.

What task would you like to design an agent for? It can be a professional workflow, a personal productivity use case, or something you've seen discussed in industry.

Module 3 · Lesson 3

Long-Context and Memory: AI That Remembers

The amnesia problem is being solved — and it reshapes what AI can do across sessions, projects, and relationships.

What becomes possible when AI remembers everything about you, your organization, and every conversation you've ever had with it?

For most of AI's language model era, the "context window" was the hard limit. Everything the model knew about you vanished when the conversation ended. Every new session started from zero. Users learned to paste previous context back in, or to keep notes they'd re-feed the model each time. It was a workaround for a fundamental architectural constraint.

That constraint is now under serious engineering attack on multiple fronts — and the solutions being deployed look different from each other in important ways.

The Three Approaches to Memory

Researchers and product teams are pursuing memory in three distinct directions, each with different trade-offs:

Longer Contexts → Simply extending how much the model can read in one pass. GPT-4 launched with 8K tokens. By late 2024, Claude 3.5 Sonnet supports 200K tokens, Gemini 1.5 Pro supports 1 million. A million tokens is approximately 750,000 words — a 25-novel library.

Retrieval-Augmented Generation → Storing information outside the context window in a vector database, then retrieving relevant chunks on demand. The model "searches its own memory" before each response. This scales to essentially unlimited information but trades off coherence for breadth.

Persistent Memory Systems → Explicitly storing summaries, facts, and user preferences across sessions in structured form. OpenAI introduced persistent memory for ChatGPT in February 2024, letting the system remember user preferences, projects, and relationships across all conversations.

OpenAI Memory Feature — February 2024

When OpenAI launched persistent memory in ChatGPT in early 2024, it allowed the model to remember facts users told it — their profession, their family structure, their writing style preferences, ongoing projects — and surface them in future conversations. Users could view, edit, or delete stored memories. The feature launched to significant discussion about privacy, but also to widespread reports of meaningfully better conversation quality for regular users.

The Needle-in-a-Haystack Problem

Extending context window size doesn't automatically mean the model uses that context well. Early evaluations of large-context models revealed a counter-intuitive failure mode: models could recall information at the beginning and end of a long document reliably, but would "forget" details buried in the middle — even when those details were explicitly within their stated context limit.

A 2023 Stanford study by Nelson Liu et al. formalized this as the "lost in the middle" problem. Subsequent model generations — particularly Gemini 1.5 Pro and Claude 3 — made targeted improvements to this failure mode, with Gemini 1.5 demonstrating near-perfect recall of a single sentence inserted at a random position within a 1-million-token document.

What Organizational Memory Could Mean

The near-horizon implication that gets least attention in public discourse is organizational memory. Individuals' use of AI is already changing — but the more structural change is enterprise-level: AI systems that hold the entire history of a company's decisions, communications, code, documents, and institutional knowledge, and can reason across all of it.

Microsoft's Copilot for Microsoft 365 moves in this direction, indexing organizational email, Teams conversations, documents, and meetings and making them queryable. The vision is an AI that knows your organization the way a 20-year employee does — but can surface specific information in seconds.

The Privacy and Power Tension

Systems that remember everything are systems where data control becomes existential. Who owns an AI's memory of your conversations? What happens to that memory if your subscription lapses? Can it be subpoenaed? Can it be used to train future models? These questions are not yet resolved in law, contract, or social norm — and they will be among the most contested issues of the near-horizon AI era.

Practical Near-Horizon State

By the end of 2024, the memory landscape looks like this: long-context models are production-ready and widely deployed; RAG systems are mature enterprise infrastructure; persistent personal memory is live in consumer products but still primitive; and full organizational memory systems are in early deployment with significant adoption friction. The next two to three years will likely see all four categories mature substantially.

Lesson 3 Quiz

Long-Context and Memory: AI That Remembers

9. What are the three distinct approaches to solving AI's memory limitation, as described in the lesson?

Correct. The three approaches are extending context windows (more in-context), RAG (retrieval from external stores), and persistent memory (structured storage across sessions).

Not quite. The lesson identifies three architectural approaches: extending context windows, retrieval-augmented generation (RAG), and persistent cross-session memory systems.

10. What is the "lost in the middle" problem identified by Nelson Liu et al. at Stanford?

Correct. Liu et al. found a U-shaped recall curve: models perform well on content near the beginning and end of their context window but degrade significantly for information positioned in the middle.

Not quite. "Lost in the middle" describes a failure mode where models recall content near the start and end of a long context well, but information buried in the middle degrades — even when it's within the stated context limit.

11. What key distinction does the lesson draw between RAG (retrieval-augmented generation) and longer context windows?

Correct. RAG can scale to arbitrarily large knowledge stores by retrieving on demand, but it only surfaces chunks at a time, which can fragment coherent reasoning. Longer contexts hold everything at once but have hard limits.

Not quite. The trade-off is scale vs. coherence: RAG scales to unlimited knowledge by retrieving chunks on demand, while longer context windows hold everything simultaneously but have a hard size ceiling.

12. Which of the following correctly describes the state of AI memory systems as of late 2024?

Correct. The lesson explicitly maps each approach to its maturity level: long-context is production-ready, RAG is mature, personal memory is live-but-primitive, and organizational memory is in early deployment.

Not quite. The lesson provides a specific maturity map: long-context models are production-ready, RAG is mature enterprise infrastructure, personal memory is live but primitive, and organizational memory is in early deployment with adoption friction.

Lab 3: Memory Architecture Advisor

Choose the right memory approach for a real use case

Your Mission

You're going to work through a memory architecture decision for a specific AI use case. The assistant will help you evaluate which memory approach — long context, RAG, persistent memory, or a hybrid — is most appropriate, and why.

This lab is complete after 3 exchanges.

Describe a use case where an AI system needs to "remember" something: a customer service bot, a personal assistant, a code review tool, an enterprise knowledge system — whatever interests you. The assistant will help you think through the architecture.

Memory Architecture Lab

Lesson 3

Welcome to the Memory Architecture Lab. We're going to reason through which memory approach — long context windows, retrieval-augmented generation, persistent cross-session memory, or some combination — best fits a specific AI use case.

Each approach has different trade-offs in scale, coherence, cost, and privacy. The right choice depends on what the system needs to remember, how often it changes, how large it is, and how quickly it needs to be accessed.

What use case are you working with? Describe what the AI needs to remember and I'll help you think through the architecture.

Module 3 · Lesson 4

Reasoning Models and Scientific AI

A new class of AI systems is emerging — ones that pause to think before answering, and ones trained to accelerate scientific discovery itself.

What changes when AI doesn't just retrieve knowledge but actively reasons toward answers — and begins generating new scientific knowledge?

When OpenAI released o1 in September 2024, the benchmark scores looked like the usual AI announcement: impressive numbers, cautious excitement. But the underlying mechanism was different from every prior model. o1 didn't just predict the next token. It spent time — sometimes many seconds, sometimes minutes — working through a problem step by step before producing its final response. It was, in effect, thinking.

And the competition math scores reflected it: o1 scored in the 89th percentile on the American Mathematics Competition (AMC), a feat GPT-4 couldn't approach.

What Reasoning Models Actually Do

Standard large language models are trained to predict the most likely next token. This works remarkably well for many tasks — summarization, translation, code generation — but it fails on tasks that require chains of logical steps, because each step needs to be conditioned on whether the previous step was actually correct.

Reasoning models address this through a technique called chain-of-thought with reinforcement learning. The model is trained to produce internal reasoning traces — effectively a scratchpad of working — and then graded on the correctness of its final answers. Over many training iterations, it learns reasoning strategies that work, not just token sequences that look plausible.

OpenAI o1 → Released September 2024. Scored 83% on AIME (vs. GPT-4's 13%), 89th percentile on AMC, and reached the 49th percentile on Codeforces competitive programming benchmarks. Used "thinking time" as an explicit compute investment.

DeepSeek-R1 → Released January 2025 by a Chinese AI lab. Matched o1's performance on many reasoning benchmarks at a fraction of the reported training cost, and was released open-source. Caused significant market movement, with Nvidia stock dropping 17% on the day of release.

Google Gemini 2.0 Flash Thinking → Google's reasoning-mode model, introduced late 2024. Designed for fast reasoning with lower latency than o1, targeting developer use cases requiring step-by-step logical inference.

Why This Is a Category Change

Prior AI progress on benchmarks was often attributed to scale — more parameters, more data. Reasoning models represent a different lever: compute at inference time. You can make a model "smarter" not by retraining it but by giving it more time to think about a specific problem. This changes the economics and the design space of AI deployment significantly.

AlphaFold and the Scientific AI Frontier

The second major thread in this lesson is AI for scientific discovery — a domain where AI has already achieved demonstrably superhuman performance in at least one major area.

DeepMind's AlphaFold 2, released in 2020 and published in Nature in 2021, solved the protein structure prediction problem — a challenge that had stumped structural biologists for 50 years. Given an amino acid sequence, AlphaFold could predict the 3D folded structure of the resulting protein with accuracy matching experimental methods. By 2022, AlphaFold had predicted the structure of over 200 million proteins — essentially every protein known to science — and deposited them in a publicly accessible database.

The Nobel Prize Committee recognized AlphaFold's impact in 2024, awarding the Chemistry Nobel to Demis Hassabis and John Jumper of DeepMind, alongside David Baker whose lab had developed independent AI-based protein design methods.

What Comes After AlphaFold

AlphaFold solved structure prediction. The next AI frontier in biology is design: creating novel proteins with specified functions — enzymes that break down plastics, antibodies that neutralize specific pathogens, drugs that bind precise molecular targets. DeepMind's AlphaProteo (2024) and David Baker's RFdiffusion (2023) represent early progress on this harder problem.

Beyond biology, AI systems are accelerating materials science (Google DeepMind's GNoME identified 2.2 million new stable crystal structures in a 2023 paper), mathematics (DeepMind's FunSearch solved open combinatorics problems in 2023), and climate modeling (Google's GraphCast outperformed the best traditional numerical weather prediction models in 2023).

The Near-Horizon Synthesis

Reasoning models and scientific AI represent the convergence of two trajectories: AI that can reason carefully about hard problems, and AI that has access to structured scientific knowledge. When reasoning models are applied to drug discovery, materials design, or climate modeling — the combination may produce the most significant near-horizon impact of any AI capability discussed in this module. The timeline from AI-suggested hypothesis to clinical trial or deployed material is still long. But the first steps are already documented.

Lesson 4 Quiz

Reasoning Models and Scientific AI

13. What is the core technical mechanism that makes reasoning models like OpenAI o1 perform better on mathematical and logical tasks?

Correct. o1 and similar models use RL-trained chain-of-thought — producing a scratchpad of reasoning steps and being optimized on the correctness of final answers, learning strategies that actually work rather than plausible-looking token sequences.

Not quite. Reasoning models use chain-of-thought with reinforcement learning: the model generates internal reasoning traces and is graded on final answer correctness, learning reasoning strategies over many training iterations.

14. Why did DeepSeek-R1's release in January 2025 cause significant market disruption despite matching o1's performance?

Correct. DeepSeek-R1 challenged the assumption that frontier reasoning performance requires massive compute investment — it was reportedly trained at a fraction of the cost of o1, and its open-source release created immediate competitive pressure.

Not quite. DeepSeek-R1 was disruptive because it achieved comparable reasoning benchmark performance to o1 at a fraction of the training cost, and was released open-source — challenging assumptions about the compute required for frontier AI.

15. What structural biology problem did DeepMind's AlphaFold 2 solve, and how was its impact recognized?

Correct. AlphaFold 2 solved the 50-year-old protein folding problem, predicted over 200 million protein structures, and Demis Hassabis and John Jumper were awarded the 2024 Nobel Prize in Chemistry in recognition.

Not quite. AlphaFold 2 solved protein structure prediction — given an amino acid sequence, predict the 3D folded structure — a 50-year-old problem. Hassabis and Jumper received the 2024 Nobel Prize in Chemistry for this work.

Lab 4: Reasoning and Scientific AI Exploration

Probe the boundaries of what reasoning models and scientific AI can and cannot do

Your Mission

You're going to explore the practical capabilities and limits of reasoning models and scientific AI. The assistant can help you understand how reasoning models approach a specific problem type, what scientific domains AI is making the most progress in, or how to evaluate whether a task is "reasoning-hard" vs. "retrieval-easy."

This lab is complete after 3 exchanges.

Ask about a specific reasoning challenge you're curious about, a scientific domain where AI progress interests you, or have the assistant walk you through why a task is easier or harder for a reasoning model vs. a standard LLM.

Reasoning & Scientific AI Lab

Lesson 4

Welcome to the Reasoning and Scientific AI Lab. We're going to dig into what reasoning models actually do differently, and where AI is making real scientific progress — versus where the hype outpaces the reality.

Some productive directions we can take this:
• I can walk you through how a reasoning model approaches a specific type of problem (math, logic, code debugging, legal analysis) and why it does better than a standard LLM
• We can explore a scientific domain — biology, materials science, climate — and map where AI is genuinely ahead
• We can work through how to evaluate whether your own tasks would benefit from a reasoning model vs. a faster standard model

What would you like to explore?

Module 3 Test

Capabilities on the Near Horizon — 15 questions · Pass at 80%

1. Which of the following best describes how multimodal AI processes different input types?

Correct. The transformer architecture generalizes: any input type can be tokenized and processed through the same attention mechanism in a single model.

Not quite. Modern multimodal models use a single transformer that tokenizes all input types — they don't require separate specialist models.

2. GPT-4o ("omni"), released May 2024, was notable primarily for:

Correct. GPT-4o integrated all three modalities in one model with real-time audio, including emotional-tone awareness — a step beyond sequential pipeline approaches.

Not quite. GPT-4o was notable for its unified multimodal architecture: text, audio, and vision in one model, enabling real-time emotionally-aware spoken conversation.

3. Gemini 1.5 Pro's 1-million-token context window can hold approximately:

Correct. Gemini 1.5 Pro's million-token context was demonstrated to hold 11 hours of audio, a feature film, or 30,000 lines of code — enabling whole-document reasoning without chunking.

Not quite. One million tokens represents approximately 11 hours of audio, 30,000 lines of code, or an entire feature film — enabling reasoning across truly large multimodal inputs in one context.

4. The "perceive, plan, act, observe, repeat" loop describes:

Correct. This loop — perceive state, plan next action, act, observe result, repeat — is the fundamental cycle of an AI agent operating toward a goal across multiple steps.

Not quite. The perceive-plan-act-observe-repeat loop is the core architecture of an AI agent — what distinguishes agents from single-turn language models.

5. Cognition AI's "Devin," introduced in March 2024, demonstrated which agentic capability?

Correct. Devin resolved 13.86% of SWE-bench issues end-to-end — far below human level but significant for autonomous multi-step software engineering, using real tools without step-by-step human guidance.

Not quite. Devin resolved 13.86% of SWE-bench issues autonomously — significant for its use of browser, terminal, and code editor together, not for surpassing humans.

6. What is the primary security vulnerability specific to agentic AI systems described in the lesson?

Correct. Prompt injection is a specific risk for agents: web content the agent reads as part of a task can contain embedded instructions that redirect the agent's actions — bypassing the original user's intent.

Not quite. Prompt injection is the key agentic security risk: content the agent reads during a task (web pages, emails, documents) can contain instructions that hijack the agent's behavior.

7. Andrew Ng's framing of multi-agent workflows draws an analogy to:

Correct. Ng's argument is that multi-agent systems mimic human team structures — a researcher, a writer, a critic — and the productivity gains come from parallel specialization, not superhuman individual agents.

Not quite. Ng's analogy is to human team structure: just as teams use parallel specialization (researcher, writer, reviewer), multi-agent AI achieves more than a single general agent by assigning roles.

8. In the context of AI memory, what trade-off does RAG (retrieval-augmented generation) make compared to long context windows?

Correct. RAG can reference unlimited information via retrieval, but it only surfaces relevant chunks at a time — which can fragment reasoning across a large knowledge base. Long contexts hold everything at once but have a hard size limit.

Not quite. RAG trades coherence for scale: it can access unlimited stored information, but only retrieves chunks at a time, while long context windows hold everything in one pass up to their size limit.

9. The "lost in the middle" problem, documented by Nelson Liu et al. at Stanford, describes what failure mode?

Correct. Liu et al. found a U-shaped recall curve — reliable at the edges, degraded in the middle of the context — even when the context was within the model's stated limit.

Not quite. "Lost in the middle" is a positional failure: models handle content at context edges well but lose track of information buried in the middle, even within their stated context window.

10. OpenAI's persistent memory feature for ChatGPT, launched February 2024, allows the system to:

Correct. OpenAI's memory feature stores user-specific facts and preferences across sessions, with transparency and user control over what is remembered.

Not quite. The feature stores structured facts about users (preferences, projects, relationships) across conversations, with user visibility and control over the stored memory.

11. What makes reasoning models like o1 a fundamentally different lever for AI improvement compared to prior scaling?

Correct. Prior scaling invested compute in training. Reasoning models invest compute at inference time — you make the model think harder about a specific problem without retraining it.

Not quite. The key innovation is inference-time compute: reasoning models spend more computation on each query (thinking through the problem) rather than relying solely on larger training runs.

12. What was OpenAI o1's score on the American Mathematics Competition (AMC) — and how did it compare to GPT-4?

Correct. o1 scored in the 89th percentile on AMC — a dramatic improvement over GPT-4, which could not approach that level, demonstrating the reasoning model's step-change on mathematical reasoning.

Not quite. The lesson states o1 scored in the 89th percentile on AMC — a level GPT-4 couldn't approach — illustrating the step-change that inference-time reasoning provides on math benchmarks.

13. What recognition did AlphaFold's creators receive in 2024, and why?

Correct. Demis Hassabis and John Jumper received the 2024 Chemistry Nobel for AlphaFold's solution to the protein folding problem — one of the most significant experimental biology challenges of the past half-century.

Not quite. Hassabis and Jumper received the 2024 Nobel Prize in Chemistry specifically for solving protein structure prediction — predicting 3D structure from amino acid sequence, a challenge that had stumped biology for 50 years.

14. Google DeepMind's GNoME paper (2023) demonstrated AI progress in which scientific domain?

Correct. GNoME identified 2.2 million new stable crystal structures in a single 2023 paper — a scale of materials discovery that would take decades by conventional experimental methods.

Not quite. GNoME's contribution was in materials science: identifying 2.2 million new stable crystal structures, dramatically expanding the known catalog of potentially useful materials.

15. Which of the following best captures the "near-horizon synthesis" of reasoning models and scientific AI, as described in Lesson 4?

Correct. The lesson's synthesis is that reasoning + scientific knowledge represents a powerful combination, but cautions that the path from AI-generated hypothesis to real-world impact (clinical trial, deployed material) is still measured in years.

Not quite. The lesson argues that reasoning models applied to scientific domains represent perhaps the most significant near-horizon combination — but carefully notes that deployment timelines remain long even when the AI science is strong.