Building AI Agents I · Introduction

From programs you run to entities you hire.

The boundary between tool and teammate is being redrawn in real time. This course starts at the line.

Software used to be a thing you launched and closed. You opened the spreadsheet, you opened the browser, you ran the script. The computer waited for your input, did one step, then waited again.

Agents break that. An agent is software you give a goal to — not a command — and then it decides which steps to take, which tools to call, which other agents to ask for help. It might finish in three seconds or run quietly for three hours. It might succeed, or fail, or ask you a clarifying question, or realize halfway through that your real goal isn't what you said and come back to confirm.

This shift — from tools you use to agents you delegate to — is the most disruptive change in how we work with computers since the graphical user interface. This first course in the Agents series starts with where agents are being deployed today, what they're good at, what they break, and what decisions you need to make before putting one between you and your customer.

If you finish every module, here's who you become:

You'll know the precise difference between a tool, an assistant, and an agent — and why collapsing those terms causes real decisions to go wrong.
You'll be able to trace any agent's behavior back to the perception-reasoning-action-observation loop and explain why it did what it did.
You'll recognize which use cases — research, coding, customer service, data analysis — actually benefit from agency and which ones don't.
You'll apply a decision framework to a real scenario and argue, with specifics, whether an agent is the right solution or an expensive distraction.
You'll understand how ReAct, Plan-and-Execute, and multi-agent orchestration patterns differ and when each one is called for.
You'll be able to name the categories of agent failure — and spot the design choices that invite them before a system ships.
You're becoming someone who can sit in a room where 'AI agent' gets thrown around loosely and immediately sharpen the conversation.

🎯 Advanced · Lesson 1 of 4

Defining the Agent:
Beyond the Chatbot

What formal computer-science and AI-research literature actually means by "agent" — and why the word is used so loosely in product marketing.

In March 2023, Stanford researchers released a paper titled "Generative Agents: Interactive Simulacra of Human Behavior." They ran 25 language-model instances inside a virtual town called Smallville. Each instance maintained a persistent memory stream, wrote plans each morning, and revised those plans when new information arrived — entirely without human prompting. One agent organised a Valentine's Day party: it sent invitations, checked whether others RSVP'd, and rerouted logistics when a venue conflict arose. No human typed a single instruction after the initial character description. The researchers explicitly used the word "agent" to distinguish these systems from chatbots, which require a human turn to produce every output.

That same month, OpenAI shipped GPT-4 with a system-prompt field. Tech journalists immediately called it an "AI agent." The two uses of the word described fundamentally different systems — and that gap is what this module resolves.

The Formal Definition: Perception, Cognition, Action

Stuart Russell and Peter Norvig's textbook Artificial Intelligence: A Modern Approach — the standard reference in university AI courses — defines an agent as "anything that perceives its environment through sensors and acts upon that environment through actuators." The definition has three non-negotiable parts.

First, the agent must perceive something external to itself. A static lookup table is not an agent; it has no sensors. A language model that reads the current date from a system clock, scans a live web page, or observes the output of its own previous action is perceiving its environment.

Second, the agent must reason or decide. Russell and Norvig describe a spectrum from simple reflex agents (if X then do Y) up through goal-based and utility-maximising agents. Modern LLM-based agents sit at the goal-based tier: they are given an objective, generate a plan, and attempt to execute it.

Third, the agent must act — produce outputs that change state in the world, not just text for a human to read. Running a shell command, committing code to a repository, sending an email, calling an API: these are actions. Returning a string to a chat window is borderline — it acts on a human, but not directly on an external system.

Key Distinction

The word "autonomous" is not in the formal definition — but persistence across time is implied. An agent that acts, observes the result, and then acts again is exhibiting the perception–action loop. A system that produces one response and stops is better described as a tool or a completion endpoint.

Why Product Marketing Muddies the Water

In November 2023, Microsoft announced "Copilot" — described in press releases as an "AI agent embedded in Microsoft 365." At launch, Copilot summarised emails and drafted documents on request. It did not autonomously monitor an inbox, decide which emails needed follow-up, or send replies without a human click. By the formal definition it was an assistant with access to user data — not an agent. Microsoft used the word to signal ambition, not current capability.

By contrast, when Cognition AI shipped Devin in March 2024 — their software-engineering system — independent evaluators at Uplevel Data confirmed it could receive a GitHub issue, write code across multiple files, run tests, read the failure output, and iterate until tests passed, all without human intervention between steps. Devin was behaving as an agent by the formal definition: multi-step perception–action loops, persistent memory across a task, and real-world effects (committed code).

The distinction matters because it determines how you design, deploy, and audit a system. If you believe your "agent" is actually an assistant, you will over-trust its outputs. If you believe your "assistant" is an agent, you will add unnecessary guardrails and slow it down unnecessarily.

Tool: executes a single function when called; no memory, no goal, no loop.
Assistant: responds to human turns; memory optional; human drives the loop.
Agent: pursues a goal across multiple steps; runs the loop itself; acts on external systems.

Why This Module Exists

The AESOP AI Academy uses "agent" only in the formal sense throughout this course. When a real product blurs the line, we will say so explicitly — and explain which parts of that product are agentic and which are not.

🎯 Advanced · Lesson 1 Quiz

Quiz: Defining the Agent

3 questions — free, untracked, retake anytime.

1. According to Russell & Norvig's formal definition, which component is NOT explicitly required for a system to be called an agent?

✓ Correct — ✓ Correct. Russell & Norvig define agents by perception, decision, and action — not by the presence or absence of human oversight. Autonomy is a spectrum, not a binary gate.

Not quite. Full autonomy is not part of the formal definition. Russell & Norvig require perception, decision, and action — human oversight doesn't disqualify a system from being an agent.

2. What made the Stanford Smallville agents (2023) qualify as agents rather than chatbots, by the formal definition?

✓ Correct — ✓ Correct. The key agentic property was the perception–action loop running without a human turn required between steps — plus persistent memory that let them track and revise plans over time.

Not quite. The virtual environment and model version were implementation details. What qualified them as agents was the persistent memory and the multi-step loop operating without human prompting.

3. Microsoft Copilot at its November 2023 launch was more accurately described as which of the following?

✓ Correct — ✓ Correct. At launch, Copilot required a human turn to produce every output. It didn't autonomously monitor, plan, or act across steps — which places it in the "assistant" category despite Microsoft's marketing language.

Not quite. At launch Copilot was human-turn-driven. It drafted and summarised on request but did not run an autonomous multi-step loop — making it an assistant, not an agent by the formal definition.

🎯 Advanced · Lesson 1 Lab

Lab: Classify Real Systems

Apply the formal three-part definition to real AI products announced in 2023–2024.

Your Task

The AI below has been briefed on the formal agent definition (perceive → decide → act in a loop). Pick any real AI product from 2023–2024 — ChatGPT, Devin, AutoGPT, Gemini, Claude, Perplexity, GitHub Copilot, or another — and ask the AI to classify it. Challenge its reasoning. Push it to explain exactly which of the three components are or aren't present.

Try: "Is AutoGPT a true agent by the Russell & Norvig definition? Walk through all three components."

🤖 AESOP Agent Classifier Lesson 1 Lab

🎯 Advanced · Lesson 2 of 4

The Architecture of Agency:
Memory, Tools, and Goals

How real agentic systems are built — the four components that transform a language model into something that can act in the world.

In May 2023, a team at Significant Gravitas released AutoGPT on GitHub. Within two weeks it had over 100,000 stars — the fastest-growing open-source repository at that point in GitHub's history. AutoGPT accepted a high-level goal from the user ("Increase Twitter followers for @username by 20%"), then entered a loop: it used GPT-4 to generate a task list, executed tasks one at a time using web-search and file-write tools, stored results in a text file it called its "memory," and re-read that file at the start of each new reasoning step. It didn't always succeed — but its architecture was a direct implementation of the four-component agent design: profile (goal + persona), memory, tools, and action.

The Four-Component Framework

Researchers at Renmin University published a comprehensive survey in 2023 titled "A Survey on Large Language Model based Autonomous Agents" (Wang et al., arXiv:2308.11432). They reviewed over 100 agent papers and found that virtually every system decomposed into four components.

Profile is the agent's identity and objective. It is typically injected via a system prompt and defines what the agent is trying to accomplish and what role it occupies. Without a clear profile, the agent has no selection criterion for which actions to take when multiple options are available.

Memory encompasses how the agent retains information across steps. There are two types. In-context memory is everything currently in the model's context window — fast but limited. External memory is a database, file, or vector store that the agent can read and write — slower but unlimited. AutoGPT used a text file; production systems like Cognition's Devin use a persistent workspace.

Tools are the actuators — the interfaces through which the agent changes external state. Common tools include web search, code execution sandboxes, email APIs, calendar APIs, and file systems. A language model without tools can only produce text; with tools, it can produce real-world effects.

Action is the execution layer — how the agent translates a reasoning step into a concrete tool call, and how it handles the result. This includes error handling: what does the agent do when a tool call returns an error? A well-designed agent treats errors as new observations and revises its plan.

Research Note

Wang et al. (2023) found that memory was the most frequently under-specified component in early agent papers. Systems often described tool use in detail but left memory architecture vague — which explains why early agents like AutoGPT frequently "forgot" previous steps and repeated work.

How the Loop Actually Runs

In practice, an LLM-based agent runs what researchers call a ReAct loop — Reason + Act — first described by Yao et al. in a 2022 paper (arXiv:2210.03629). Each iteration follows the same structure: the model receives the current context (goal + memory + last observation), generates a thought (a reasoning trace), generates an action (a specific tool call with parameters), the tool executes, and the tool's output becomes the next observation, which is added to context. The loop repeats until the agent decides it has completed the goal or a step limit is reached.

OpenAI formalised a close variant of this in their Function Calling API (released June 2023): the model outputs a JSON object specifying which function to call and with which arguments, the application executes the function, and returns the result as a new message. This is architecturally a constrained ReAct loop — the "thought" is implicit in the model's internal reasoning, and the "action" is always a structured function call.

The context window is the agent's working memory — everything it can see at once.
Tool outputs are observations — the agent's sensors reporting back.
The thought trace is the agent's decision process — auditable in chain-of-thought output.
The action is the agent's effector — the only thing that changes external state.

Practical Implication

Understanding the loop architecture tells you where agents fail. The three most common failure modes are: context-window overflow (too much history to fit), tool-call hallucination (the model invents parameters that don't exist), and goal drift (the agent pursues a sub-goal so deeply it forgets the top-level objective). Each failure maps directly onto one of the four components.

🎯 Advanced · Lesson 2 Quiz

Quiz: Architecture of Agency

3 questions — free, untracked, retake anytime.

1. In the Wang et al. (2023) four-component framework, which component was most frequently under-specified in early agent research papers?

✓ Correct — ✓ Correct. Wang et al. found memory was most often vague or under-specified — which directly caused early agents like AutoGPT to repeat steps they had already completed because they couldn't reliably retrieve prior results.

Not quite. Memory was the most under-specified component. Tool use was often described in detail, but researchers glossed over exactly how agents stored and retrieved information across steps.

2. What does the "Observe" step in a ReAct loop represent architecturally?

✓ Correct — ✓ Correct. The observation is the sensor report — the tool executed and returned data, which the agent reads as new information about the state of the environment. This closes the perception–action loop.

Not quite. Observation = tool return value added to context. It's the agent's sensors reporting back the real-world result of its last action. The reasoning trace is the "thought" step; planning is also part of "reason."

3. Which failure mode maps directly onto a context-window limitation?

✓ Correct — ✓ Correct. Context-window overflow is a memory failure — the in-context memory fills up and older observations get truncated, so the agent can't see what it already tried. This is why external memory stores are critical for long-running tasks.

Not quite. Context-window overflow causes the agent to lose earlier observations as history is truncated — a direct in-context memory failure. Tool hallucination and JSON parsing are separate failure modes in the action component.

🎯 Advanced · Lesson 2 Lab

Lab: Diagnose Agent Failures

Use the four-component framework to explain why a specific agent broke down.

Your Task

Below is an AI briefed on the Wang et al. four-component framework and the three common failure modes. Describe a scenario — real or hypothetical — where an agent fails, and ask it to diagnose which component broke down and why. Then push it to suggest a concrete architectural fix.

Try: "An agent tasked with booking a flight kept re-searching the same routes it already checked. Which component failed and how would you fix it?"

🤖 AESOP Architecture Analyst Lesson 2 Lab

🎯 Advanced · Lesson 3 of 4

Autonomy Levels and
Human-in-the-Loop Design

Why autonomy is a spectrum, how real deployments choose their level, and the consequences of choosing wrong.

In February 2024, Air Canada lost a civil court case in British Columbia. The airline's chatbot had told a passenger — Jake Moffatt — that he could apply for a bereavement discount after travel and receive a retroactive refund. Air Canada argued in court that the chatbot was a "separate legal entity" and that the airline wasn't responsible for its outputs. The tribunal rejected this argument and ordered Air Canada to pay the discount. The chatbot was an assistant operating without human review of individual responses. It had no tool that could verify current policy before replying, and no human-in-the-loop step before customer-facing output. The cost was not just the refund — Air Canada's legal team fees and reputational damage exceeded the original $812 claim by orders of magnitude.

The Autonomy Spectrum

Researchers at MIT's Computer Science and AI Lab and at Anthropic have both described autonomy in AI systems as a spectrum rather than a binary state. The spectrum typically has five levels, analogous — deliberately — to the SAE levels of vehicle automation.

Level 0 — Fully manual: The AI suggests; a human decides and acts. Spell-check is at this level. So is a language model generating a draft that a human rewrites before sending.

Level 1 — Assisted action: The AI acts on narrow, pre-approved tasks with no ambiguity. A rule-based autoresponder falls here. The action space is completely enumerated in advance.

Level 2 — Supervised autonomy: The AI acts, but a human reviews before real-world effect. GitHub Copilot suggesting a code block that an engineer must accept with a keypress is Level 2. Air Canada's chatbot should have been at Level 2 for policy-sensitive claims — it was not.

Level 3 — Conditional autonomy: The AI acts without human review for routine cases, but escalates to a human when it detects uncertainty or when stakes exceed a threshold. This requires the agent to have a well-calibrated confidence estimator and a reliable escalation path.

Level 4 — High autonomy: The AI acts across long multi-step tasks with minimal checkpoints. Devin operating on a contained development environment is close to Level 4. Failures are contained within the sandbox.

Design Principle

The appropriate autonomy level for a deployment is determined by two axes: the reversibility of actions (can the agent undo what it did?) and the blast radius of errors (how many people or systems are affected by a wrong action?). High reversibility and low blast radius permit higher autonomy levels. Irreversible actions with large blast radius require human-in-the-loop at Level 2 or below.

How Production Teams Actually Choose

Anthropic published a model specification in 2024 describing how they train Claude to calibrate autonomy. Key passages describe "minimal footprint" as a default: the model should request only the permissions it needs, prefer reversible over irreversible actions, and err on the side of doing less and confirming with users when uncertain about intended scope. This is not timidity — it is a formal design choice about where on the autonomy spectrum the default should sit.

In contrast, Cognition AI's Devin was designed to operate at the high-autonomy end for a specific, sandboxed domain: software development. The key constraint that allowed this was domain specificity and physical isolation. Devin's actions — writing and running code — were contained inside a virtual machine. A bug in Devin's code couldn't directly affect a production system without a human deploying it. The sandbox substituted for human-in-the-loop review.

Map every tool call to a reversibility rating before deployment.
Set explicit escalation triggers — specific conditions under which the agent must pause and ask.
Log every observation and action — not for compliance theatre, but because debugging requires the full trace.
Test with adversarial inputs specifically designed to push the agent toward irreversible actions.

The Air Canada Lesson

Air Canada's chatbot failure was not an AI failure — it was a deployment design failure. The team set the chatbot to Level 0 input but Level 4 output authority on policy questions. Matching autonomy level to stakes is the practitioner's first responsibility when deploying any AI system.

🎯 Advanced · Lesson 3 Quiz

Quiz: Autonomy Levels

3 questions — free, untracked, retake anytime.

1. What were the two key design axes that determine the appropriate autonomy level for an agent deployment?

✓ Correct — ✓ Correct. Reversibility (can the agent undo it?) and blast radius (how many systems or people are affected?) are the two fundamental axes. High reversibility + low blast radius = higher autonomy permitted.

Not quite. The two design axes are reversibility of actions and blast radius of errors. Model size and cost are infrastructure decisions; they don't determine the safe autonomy level.

2. Cognition AI's Devin was able to operate at a high autonomy level primarily because of which architectural constraint?

✓ Correct — ✓ Correct. The virtual machine sandbox contained Devin's blast radius. Code it wrote couldn't reach production without a human deployment step. The sandbox substituted for continuous human-in-the-loop review.

Not quite. Devin's high autonomy was enabled by operating inside a sandboxed virtual machine. Human deployment remained required before any code affected production — the sandbox was the containment mechanism.

3. What was the fundamental deployment design error in Air Canada's chatbot case?

✓ Correct — ✓ Correct. The chatbot was operating at high autonomy on policy questions that should have had a Level 2 human-review gate. The mismatch between stakes and oversight level was the core design failure — not the AI technology itself.

Not quite. The core failure was autonomy-level mismatch: the chatbot had high-stakes output authority (committing to refund policy) without any human review before responses reached customers. That's a deployment design failure, not a model capability failure.

🎯 Advanced · Lesson 3 Lab

Lab: Design an Autonomy Level

Map a real AI use case to the correct autonomy level using reversibility and blast-radius analysis.

Your Task

The AI below is briefed on the five autonomy levels and the two design axes (reversibility, blast radius). Describe a real business scenario where you'd want to deploy an AI agent — customer service, code review, data analysis, anything — and ask it to recommend the correct autonomy level with a full justification. Challenge its recommendation if you disagree.

Try: "We want to deploy an AI agent that can process expense reports and approve or reject them. What autonomy level should it operate at, and why?"

🤖 AESOP Autonomy Advisor Lesson 3 Lab

Building AI Agents I — Use Cases · Module 1 · Lesson 4

Lesson 4

Advanced concepts, real-world applications, and practical implications

Core Concepts

This lesson explores lesson 4 — examining the key principles, real-world applications, and implications for practitioners working in this domain.

Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.

Practical Applications

The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.

Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.

Looking Forward

As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.

Lesson 4 Quiz

Lesson 4

What is the primary focus of Lesson 4?

✓ Correct — Correct. This lesson bridges theory and practice, focusing on real-world implementation.

Review the lesson — the focus is on connecting frameworks to practical reality.

Why does real-world deployment introduce challenges that pure theory doesn't capture?

✓ Correct — Correct. Real deployment requires judgment, not just framework application.

Practice doesn't invalidate theory — it reveals complexities that require nuanced application of theoretical principles.

What separates effective practitioners from those who merely follow checklists?

✓ Correct — Correct. Critical thinking and adaptability matter more than memorized procedures.

The key differentiator is critical thinking ability, not experience or resources alone.

🎯 Advanced · Lesson 4 Lab

Lab: Apply What You've Learned

Synthesize concepts from Lesson 4 through guided AI conversation

Your Task

Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4.

Try: "I'm building an AI agent to manage a company's IT help desk. Walk me through the full design: what are its perception channels, what memory and tools does it need, and what autonomy level should it operate at — and why?"

🤖 AESOP Lab Assistant Lesson 4 Lab

Module 1 Test

What Is an AI Agent · 15 Questions · 70% to Pass

Score: 0/15

1. According to Russell & Norvig, what three components define an agent?

2. What distinguished Stanford's Smallville agents from typical chatbots?

3. Cognition AI's Devin qualifies as a true agent because it...

4. What is the key difference between an assistant and an agent?

5. Why is autonomy NOT part of the formal Russell & Norvig definition of an agent?

6. The Wang et al. (2023) four-component framework identifies which components?

7. What was AutoGPT's primary memory mechanism?

8. In the ReAct framework (Yao et al., 2022), each iteration follows which structure?

9. According to Wang et al., which component was most frequently under-specified in early agent papers?

10. What are the three most common failure modes in agent architectures?

11. What two axes determine the appropriate autonomy level for an agent deployment?

12. At what autonomy level does GitHub Copilot operate?

13. In the Air Canada chatbot case (2024), what was the root architectural failure?

14. What is Anthropic's "minimal footprint" design principle?

15. Why was Cognition AI's Devin able to operate at high autonomy?

From programs you run to entities you hire.

Defining the Agent:Beyond the Chatbot

The Formal Definition: Perception, Cognition, Action

Why Product Marketing Muddies the Water

Quiz: Defining the Agent

Lab: Classify Real Systems

Your Task

The Architecture of Agency:Memory, Tools, and Goals

The Four-Component Framework

How the Loop Actually Runs

Quiz: Architecture of Agency

Lab: Diagnose Agent Failures

Your Task

Autonomy Levels andHuman-in-the-Loop Design

The Autonomy Spectrum

How Production Teams Actually Choose

Quiz: Autonomy Levels

Lab: Design an Autonomy Level

Your Task

Lesson 4

Lesson 4 Quiz

Lab: Apply What You've Learned

Your Task

Module 1 Test

Defining the Agent:
Beyond the Chatbot

The Architecture of Agency:
Memory, Tools, and Goals

Autonomy Levels and
Human-in-the-Loop Design