When OpenAI demonstrated GPT-4V β the vision-capable version of GPT-4 β one of the most-shared examples was a researcher photographing a hand-drawn circuit diagram and asking the model to identify the flaw. The model located the error correctly. No OCR. No preprocessing. Just a photo and a question.
That single demo signaled something: the wall between "language AI" and "computer vision" had quietly dissolved.
Early AI systems operated in silos. Image classifiers handled images. Speech recognition systems handled audio. Language models handled text. Each required its own training pipeline, its own inputs, its own outputs.
Multimodal AI collapses those silos. A single model receives inputs across multiple modalities β text, images, audio, video, documents β and reasons across all of them simultaneously. The model doesn't translate; it integrates.
The key technical enabler is the transformer architecture, which turns out to generalize well beyond text. Researchers discovered that images, audio spectrograms, and even video frames can all be tokenized β broken into discrete chunks β and fed into the same attention mechanism that powers language models.
In May 2024, OpenAI released GPT-4o ("o" for omni). Unlike previous multimodal systems that processed modalities sequentially, GPT-4o handled text, audio, and vision in a unified model. It could respond to spoken questions with spoken answers in real time β with the ability to detect and respond to emotional tone in the speaker's voice. The demo showed it being interrupted mid-sentence and adjusting its response live.
As of 2024β2025, leading frontier models routinely handle:
In February 2024, Google DeepMind released Gemini 1.5 Pro with a 1-million-token context window β large enough to hold approximately 11 hours of audio, 30,000 lines of code, or an entire feature film. In testing, the model correctly located a specific scene in a 402-page transcript it had never been trained on, retrieving it from the middle of the document with high accuracy.
This matters for multimodal work specifically because it means a model can hold an entire document, an entire meeting recording, and a follow-up question all in the same context simultaneously β without needing to chunk and retrieve.
The practical shift is already underway: AI stops being a text assistant and becomes a perception layer. Systems that can watch a manufacturing line and flag anomalies, listen to a customer call and generate a summary, or read a legal filing and cross-reference it against a database of prior rulings are not hypothetical. They are in active deployment or in well-documented pilots as of 2024.
Practically, multimodal capability means the inputs you can provide to AI systems have expanded dramatically. You no longer need to describe an image in text β you show it. You no longer need to transcribe a meeting β you upload the audio. The interface is becoming more natural, and the information density you can share with a model in a single exchange has increased by an order of magnitude.
The limiting factor is no longer modality support. It is context window size, latency, and cost per token β all of which are on declining curves as of 2024.
You're going to think through how multimodal AI capabilities β vision, audio, video, document understanding β could change a specific workflow you know. The AI assistant will help you map the before/after and surface implications you might not have considered.
This lab is complete after 3 exchanges.
When OpenAI released plugins for ChatGPT in March 2023, and then the Code Interpreter in July, something structurally changed. The model was no longer just generating text to be read. It was writing code, executing it, observing the output, debugging, and repeating β all in a loop, inside a sandboxed environment, on behalf of the user. The user described a goal. The system took steps.
That loop β perceive, plan, act, observe, repeat β is the core of what researchers call an AI agent.
A language model by itself is reactive. It receives a prompt and generates a completion. It has no memory across sessions, no ability to take actions in the world, and no mechanism for iterating toward a goal.
An AI agent wraps a language model in an architecture that gives it all three of those capabilities:
Cognition AI's "Devin" was introduced in March 2024 as an autonomous software engineering agent. In public benchmarks (SWE-bench), it resolved 13.86% of real GitHub issues end-to-end β far below human-level but far above prior automated systems. More significantly, it used a browser, a terminal, and a code editor simultaneously β planning, coding, running tests, reading error messages, and debugging β without human intervention on individual steps.
In March 2023, a developer named Toran Bruce Richards released AutoGPT on GitHub. It was an open-source framework that gave GPT-4 the ability to spawn sub-agents, search the web, write files, and run code in pursuit of a user-defined goal. Within weeks it became one of the fastest-growing GitHub repositories in history, reaching 100,000 stars in under two weeks.
AutoGPT's actual performance on complex tasks was unreliable β it frequently looped, got stuck, or took wrong turns. But its explosive popularity demonstrated something important: there was massive demand for AI systems that could do, not just say. The architecture was right, even if execution needed work.
A more recent development is multi-agent architectures, where multiple AI agents with specialized roles collaborate on a task. A researcher agent gathers information. A writer agent drafts output. A critic agent reviews it. An orchestrator agent routes tasks and manages the overall workflow.
Microsoft's AutoGen framework, released in late 2023, is one of the most widely adopted multi-agent frameworks, enabling developers to define conversation patterns between agents with different roles and tools. Andrew Ng's team at AI Fund has described multi-agent workflows as one of the most important near-term developments for AI productivity β not because individual agents are superhuman, but because parallel specialization mimics how human teams actually work.
Current AI agents are unreliable over long task horizons. Errors compound: a wrong assumption at step 3 cascades into failure by step 10. Agents also face security challenges β prompt injection attacks can hijack an agent's tool-use by embedding malicious instructions in web content the agent reads.
The research community broadly agrees that agentic AI's near-horizon progress will be measured not by capability expansion but by reliability improvement β the ability to complete multi-step tasks without human supervision at an error rate acceptable for real deployment.
Anthropic's Claude 3.5 Sonnet (June 2024) introduced "computer use" capability in a research preview β the model could take over a computer's mouse and keyboard to complete tasks in a standard desktop environment. Google's Gemini team is building similar capabilities. The trajectory is toward agents that operate on entire computers, not just within sandboxed tool APIs.
You're going to design the architecture of a hypothetical AI agent for a specific real-world task. The assistant will help you think through: what tools the agent needs, how it should plan sub-steps, where it is likely to fail, and what reliability safeguards you'd want to build in.
This lab is complete after 3 exchanges.
For most of AI's language model era, the "context window" was the hard limit. Everything the model knew about you vanished when the conversation ended. Every new session started from zero. Users learned to paste previous context back in, or to keep notes they'd re-feed the model each time. It was a workaround for a fundamental architectural constraint.
That constraint is now under serious engineering attack on multiple fronts β and the solutions being deployed look different from each other in important ways.
Researchers and product teams are pursuing memory in three distinct directions, each with different trade-offs:
When OpenAI launched persistent memory in ChatGPT in early 2024, it allowed the model to remember facts users told it β their profession, their family structure, their writing style preferences, ongoing projects β and surface them in future conversations. Users could view, edit, or delete stored memories. The feature launched to significant discussion about privacy, but also to widespread reports of meaningfully better conversation quality for regular users.
Extending context window size doesn't automatically mean the model uses that context well. Early evaluations of large-context models revealed a counter-intuitive failure mode: models could recall information at the beginning and end of a long document reliably, but would "forget" details buried in the middle β even when those details were explicitly within their stated context limit.
A 2023 Stanford study by Nelson Liu et al. formalized this as the "lost in the middle" problem. Subsequent model generations β particularly Gemini 1.5 Pro and Claude 3 β made targeted improvements to this failure mode, with Gemini 1.5 demonstrating near-perfect recall of a single sentence inserted at a random position within a 1-million-token document.
The near-horizon implication that gets least attention in public discourse is organizational memory. Individuals' use of AI is already changing β but the more structural change is enterprise-level: AI systems that hold the entire history of a company's decisions, communications, code, documents, and institutional knowledge, and can reason across all of it.
Microsoft's Copilot for Microsoft 365 moves in this direction, indexing organizational email, Teams conversations, documents, and meetings and making them queryable. The vision is an AI that knows your organization the way a 20-year employee does β but can surface specific information in seconds.
Systems that remember everything are systems where data control becomes existential. Who owns an AI's memory of your conversations? What happens to that memory if your subscription lapses? Can it be subpoenaed? Can it be used to train future models? These questions are not yet resolved in law, contract, or social norm β and they will be among the most contested issues of the near-horizon AI era.
By the end of 2024, the memory landscape looks like this: long-context models are production-ready and widely deployed; RAG systems are mature enterprise infrastructure; persistent personal memory is live in consumer products but still primitive; and full organizational memory systems are in early deployment with significant adoption friction. The next two to three years will likely see all four categories mature substantially.
You're going to work through a memory architecture decision for a specific AI use case. The assistant will help you evaluate which memory approach β long context, RAG, persistent memory, or a hybrid β is most appropriate, and why.
This lab is complete after 3 exchanges.
When OpenAI released o1 in September 2024, the benchmark scores looked like the usual AI announcement: impressive numbers, cautious excitement. But the underlying mechanism was different from every prior model. o1 didn't just predict the next token. It spent time β sometimes many seconds, sometimes minutes β working through a problem step by step before producing its final response. It was, in effect, thinking.
And the competition math scores reflected it: o1 scored in the 89th percentile on the American Mathematics Competition (AMC), a feat GPT-4 couldn't approach.
Standard large language models are trained to predict the most likely next token. This works remarkably well for many tasks β summarization, translation, code generation β but it fails on tasks that require chains of logical steps, because each step needs to be conditioned on whether the previous step was actually correct.
Reasoning models address this through a technique called chain-of-thought with reinforcement learning. The model is trained to produce internal reasoning traces β effectively a scratchpad of working β and then graded on the correctness of its final answers. Over many training iterations, it learns reasoning strategies that work, not just token sequences that look plausible.
Prior AI progress on benchmarks was often attributed to scale β more parameters, more data. Reasoning models represent a different lever: compute at inference time. You can make a model "smarter" not by retraining it but by giving it more time to think about a specific problem. This changes the economics and the design space of AI deployment significantly.
The second major thread in this lesson is AI for scientific discovery β a domain where AI has already achieved demonstrably superhuman performance in at least one major area.
DeepMind's AlphaFold 2, released in 2020 and published in Nature in 2021, solved the protein structure prediction problem β a challenge that had stumped structural biologists for 50 years. Given an amino acid sequence, AlphaFold could predict the 3D folded structure of the resulting protein with accuracy matching experimental methods. By 2022, AlphaFold had predicted the structure of over 200 million proteins β essentially every protein known to science β and deposited them in a publicly accessible database.
The Nobel Prize Committee recognized AlphaFold's impact in 2024, awarding the Chemistry Nobel to Demis Hassabis and John Jumper of DeepMind, alongside David Baker whose lab had developed independent AI-based protein design methods.
AlphaFold solved structure prediction. The next AI frontier in biology is design: creating novel proteins with specified functions β enzymes that break down plastics, antibodies that neutralize specific pathogens, drugs that bind precise molecular targets. DeepMind's AlphaProteo (2024) and David Baker's RFdiffusion (2023) represent early progress on this harder problem.
Beyond biology, AI systems are accelerating materials science (Google DeepMind's GNoME identified 2.2 million new stable crystal structures in a 2023 paper), mathematics (DeepMind's FunSearch solved open combinatorics problems in 2023), and climate modeling (Google's GraphCast outperformed the best traditional numerical weather prediction models in 2023).
Reasoning models and scientific AI represent the convergence of two trajectories: AI that can reason carefully about hard problems, and AI that has access to structured scientific knowledge. When reasoning models are applied to drug discovery, materials design, or climate modeling β the combination may produce the most significant near-horizon impact of any AI capability discussed in this module. The timeline from AI-suggested hypothesis to clinical trial or deployed material is still long. But the first steps are already documented.
You're going to explore the practical capabilities and limits of reasoning models and scientific AI. The assistant can help you understand how reasoning models approach a specific problem type, what scientific domains AI is making the most progress in, or how to evaluate whether a task is "reasoning-hard" vs. "retrieval-easy."
This lab is complete after 3 exchanges.