Building AI Agents III · Introduction

An agent without tools is a conversation partner. An agent with tools is an operator.

Tool use is the moment agents stopped being chat interfaces and started being leverage.

For the first couple of years of the modern AI wave, an AI could tell you how to do a thing. It couldn't do the thing. It could write a SQL query but not run it, describe an API call but not make it, explain how to send an email but not send one. The gap between I know the answer and I did the work was where agents lived or died.

Tool use closed that gap. A tool-enabled agent can query databases, call APIs, edit files, send messages, control browsers, trigger other agents. The agent doesn't just advise — it operates. The moment you plug a real tool into a capable model, the model graduates from consultant to colleague.

This third course in the Agents series is about tools as a first-class design surface. It covers how to wrap existing APIs as tools an agent can reliably call, the pitfalls of tool proliferation, authentication and permission models that don't get you fired, how to handle tool errors gracefully, and the emerging patterns for building agent systems where the tools themselves are agents.

If you finish every module, here's who you become:

You'll understand how function calling works under the hood — typed schemas, structured tool calls, and why reliability jumped from 60% to above 95% when schemas replaced prompt parsing.
You'll be able to wrap any third-party API as a tool an agent can call reliably, including auth flows, rate limit handling, and graceful error recovery.
You'll know how to design tool selection logic that stays efficient — avoiding the cost and failure modes of tool proliferation and poorly scoped toolsets.
You'll configure sandboxed code execution environments and file system access with the security posture that keeps tool-using agents safe to run in production.
You'll build MCP-compatible agents using the Model Context Protocol — understanding its architecture well enough to both consume and expose tools in multi-agent systems.
You'll think like someone who treats tools as a first-class design surface, not an afterthought bolted onto a finished prompt.
You'll be the person on a team who can take an agent from advisor to operator — closing the gap between knowing the answer and doing the work.

🎯 Advanced · Lesson 1 of 4

What Is Function Calling?

How language models escape their text bubble and reach into the real world — the mechanics of structured tool invocation.

In March 2023, OpenAI shipped function calling as a first-class API feature for GPT-4 and GPT-3.5-turbo. Before this, developers had to parse free-text model output with fragile regex to extract structured actions. The new system let developers declare typed function signatures in JSON, and the model would emit a structured function_call object instead of prose whenever it determined a tool was needed. Stripe, Shopify, and Duolingo integrated the feature within weeks — not because it added new intelligence, but because it made the model's intent machine-readable and reliable enough to execute in production.

The shift was architectural, not cosmetic. Reliability of structured extraction jumped from roughly 60–70% with prompt engineering to above 95% with schema-enforced function calling in controlled benchmarks reported by early adopters at developer conferences that spring.

The Core Mechanism

Function calling — sometimes called tool use — is the process by which a language model signals that it wants to invoke an external capability rather than answer with text alone. The model does not execute code itself. Instead, it emits a structured object that specifies: which function to call, and what arguments to pass. The surrounding application reads that object, runs the real function, and feeds the result back into the conversation.

This creates a precise division of labor. The model handles natural-language understanding and decision-making. The application handles execution. Neither side needs to trust the other with tasks it cannot safely perform.

Key Distinction

The model never directly executes code, queries databases, or calls APIs. It emits a request to do so. The execution runtime decides whether to honor that request, and with what safety checks. This separation is fundamental to safe agentic design.

The flow looks like this: the developer registers available tools by passing their schemas in the API request. When the model decides a tool is needed, it returns a special response type rather than a text completion. The application detects this, runs the tool, appends the result to the conversation history as a tool-role message, and calls the model again. The model then generates a natural-language response informed by the tool result.

Why Text Parsing Failed

Before structured function calling, the dominant pattern was prompt engineering: tell the model to respond in JSON, then parse the output. This broke in production for several interconnected reasons.

Models would add prose before or after the JSON block, breaking parsers.
Field names varied across rephrasings — search_query vs query vs q.
Numeric types arrived as strings; required fields were omitted; enum values were hallucinated.
Error handling required the application to re-prompt the model and hope for valid output on retry.
There was no native signal distinguishing "I want to call a tool" from "here is my answer."

Schema-enforced function calling solves all of these by moving structural guarantees into the model's constrained decoding layer. The model is trained — and at inference time, constrained — to emit output that validates against the declared schema. The application no longer needs to guess what the model meant.

Production Reality

LangChain's early v0.x changelog records at least four breaking changes related to output parsing reliability between January and June 2023. The team explicitly cited OpenAI's function calling release as the event that allowed them to deprecate their most complex parser logic.

The Anatomy of a Tool Call

A minimal function call interaction has four distinct message types in the conversation history. Understanding each is essential for debugging agentic systems.

{
  "role": "user",
  "content": "What is the weather in Tokyo right now?"
}

// Model emits:
{
  "role": "assistant",
  "tool_calls": [{
    "id": "call_abc123",
    "type": "function",
    "function": {
      "name": "get_current_weather",
      "arguments": "{\"location\": \"Tokyo\", \"unit\": \"celsius\"}"
    }
  }]
}

// Application executes, then appends:
{
  "role": "tool",
  "tool_call_id": "call_abc123",
  "content": "{\"temperature\": 24, \"condition\": \"partly cloudy\"}"
}

// Model then returns:
{
  "role": "assistant",
  "content": "It's currently 24°C and partly cloudy in Tokyo."
}

Notice that arguments is a JSON-encoded string, not a nested object. This is a common source of bugs for developers new to function calling — the arguments must be parsed a second time on the application side.

→ Quiz · Lesson 1

🎯 Advanced · Quiz 1

Quiz: What Is Function Calling?

3 questions — free, untracked, retake anytime.

1. When a model emits a function call, what actually executes the underlying code or API request?

✓ Correct — ✅ Correct. The model only emits a structured request. The application runtime decides whether to honor it and actually runs the function — maintaining a clean separation between reasoning and execution.

Not quite. The model never executes code directly. It emits a structured object; the application runtime reads it and performs the actual execution.

2. What was the primary failure mode of the pre-function-calling approach of prompting models to output JSON?

✓ Correct — ✅ Correct. Structural inconsistency was the central problem: models added preamble text, invented field names, omitted required values, and changed formats across rephrasings — all of which broke downstream parsers.

Not quite. The core issue was structural unreliability — inconsistent field names, mixed prose and JSON, and missing required fields — not speed or permission problems.

3. In the OpenAI function calling message format, the arguments field contains the tool's input parameters as:

✓ Correct — ✅ Correct. arguments is a JSON string, not a nested object. Developers must call JSON.parse() on it — a subtle but common source of bugs when first integrating function calling.

Not quite. Arguments arrive as a JSON-encoded string, not a nested object, so the application must parse it explicitly with something like JSON.parse().

← Lesson 1 → Lab · Lesson 1

🎯 Advanced · Lab 1

Lab: Dissecting a Function Call

Trace the four-message cycle and identify where things break in real systems.

Your Mission

You'll work through the mechanics of a real function calling exchange with an AI assistant trained on this material. Focus on the message roles, the argument parsing issue, and the execution boundary.

Ask the assistant to walk you through the exact four message types in a function call cycle using a concrete example of your choosing.
Then ask: "What happens to the conversation if the application fails to return the tool result to the model?" Push for the specific failure mode, not a generic answer.
Finally, ask the assistant to identify two real bugs developers commonly introduce when first handling the arguments field.

Challenge question: "If a model emits a tool call but the application silently swallows the result and sends nothing back, what does the model see in the next turn — and what does it do?"

🧪 Lab Assistant — Function Calling Mechanics Advanced

← Quiz 1 → Lesson 2

🎯 Advanced · Lesson 2 of 4

Tool Schemas in Depth

JSON Schema anatomy, parameter typing, required vs optional fields, and the description fields that drive model behavior.

In late 2023, Notion's engineering team published a post-mortem on their AI assistant integration. A tool called search_pages had been returning irrelevant results in production for weeks. The bug was not in search logic — it was in the schema. The query parameter description read: "The search term." After the team rewrote it to: "A concise natural-language phrase describing what the user wants to find — avoid filler words, include entity names and dates when relevant," retrieval precision improved by 31% with no changes to the underlying search code. The model had been passing literal user utterances as queries; richer description text changed what the model inferred "good input" looked like.

Schema Structure: The Full Anatomy

A tool schema passed to the model is a JSON object with three top-level fields: name, description, and parameters. The parameters object follows the JSON Schema specification, with type, properties, and required as the core sub-fields. Here is a production-quality example:

{
  "name": "search_knowledge_base",
  "description": "Retrieve documents from the internal knowledge base. Use when the user asks about company policy, product specs, or historical decisions. Do NOT use for real-time data.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "Concise search phrase. Include entity names, dates, and product identifiers. Avoid filler words like 'please' or 'I need'."
      },
      "category": {
        "type": "string",
        "enum": ["policy", "product", "engineering", "finance"],
        "description": "Filter results to this document category. Omit if the category is unclear."
      },
      "max_results": {
        "type": "integer",
        "minimum": 1,
        "maximum": 20,
        "description": "Number of documents to return. Default to 5 unless the user explicitly requests more."
      }
    },
    "required": ["query"]
  }
}

Every field in this schema is doing work. The top-level description tells the model when to call the tool and — critically — when not to. The parameter descriptions tell the model how to construct valid inputs. The enum constraint on category prevents hallucinated values. The minimum/maximum on max_results prevents out-of-range integers.

Schema as Prompt

Tool schemas are not just type contracts — they are a form of prompting. The model reads your descriptions at inference time. Vague descriptions produce vague arguments. Precise, action-oriented descriptions with examples embedded in the description text consistently produce better-formed tool calls in production systems.

Required vs Optional: Strategic Design

The required array deserves careful thought. If a field is listed as required, the model will always attempt to populate it — even by inferring or hallucinating a value. If a field is optional, the model may omit it when uncertain, which is often safer than a wrong value.

Required fields: Use only for parameters without which the function cannot execute meaningfully. User ID, action verb, and primary resource identifier are common required fields.
Optional with enum: Use when the model should filter only if it has high confidence. Letting the model omit a filter is often better than picking the wrong category.
Optional with default in description: Document the default explicitly in the description string so the model knows what behavior will result if it omits the field.
Never required: confidence or uncertainty fields. The model cannot accurately self-assess confidence via structured output reliably enough to trust it in routing logic.

Anthropic's function calling documentation (2024) specifically recommends keeping the required array as small as possible and using description text to guide optional field population rather than schema enforcement — the model's language understanding is more flexible than JSON Schema constraints allow.

Failure Pattern

A common production bug: marking a date_range field as required when the user's query is time-agnostic. The model invents a date range, the query is incorrectly scoped, and results are filtered to empty. The fix is always to make the field optional and describe when it should be populated.

Description Engineering

The highest-leverage optimization in tool schema design is description engineering. Unlike type constraints, descriptions operate on the model's language understanding layer — they can encode nuanced intent, negative examples, and disambiguation rules that no type system can express.

Include negative examples: "Do NOT pass the user's raw message. Rephrase as a keyword query." dramatically reduces literal-utterance passthrough.
Specify the caller's intent: "Use this tool when the user asks about past orders. Do not use for tracking live shipments — use track_shipment instead." This cross-tool disambiguation is only possible in description text.
Embed format examples: For date fields, include the expected format string: "ISO 8601 format: YYYY-MM-DD. Example: 2024-03-15." Models comply reliably when the format is explicit.
State units: "Temperature in Celsius, not Fahrenheit." Simple, but absent from most schemas encountered in the wild.

← Lab 1 → Quiz · Lesson 2

🎯 Advanced · Quiz 2

Quiz: Tool Schemas in Depth

3 questions — free, untracked, retake anytime.

1. Notion's 2023 case study showed a 31% improvement in retrieval precision by changing which part of their tool schema?

✓ Correct — ✅ Correct. The fix was purely in the description text — changing "The search term" to a detailed, action-oriented instruction. No search code changed. This demonstrates that schema descriptions are a form of prompting that directly shapes model behavior.

Not quite. The improvement came from rewriting the parameter description — specifically the query field — to give the model clearer guidance on what a good search term looks like.

2. What is the primary risk of marking a date_range field as required when the user's query is time-agnostic?

✓ Correct — ✅ Correct. When a field is required, the model attempts to populate it regardless of whether the user provided relevant information — leading to hallucinated values. Making the field optional allows the model to omit it safely when no date context exists.

Not quite. The risk is hallucination: the model will invent a date range to satisfy the required constraint, incorrectly scoping the query and degrading result quality.

3. Which of the following belongs in a tool's top-level description field rather than a parameter's description?

✓ Correct — ✅ Correct. The top-level description governs the model's tool-selection decision — it's where you specify use cases, anti-patterns, and disambiguation from sibling tools. Parameter descriptions govern how to construct individual arguments once the tool is selected.

Not quite. Routing logic — when to use this tool vs. another — belongs in the top-level description. Format requirements, units, and value ranges belong in parameter-level descriptions.

← Lesson 2 → Lab · Lesson 2

🎯 Advanced · Lab 2

Lab: Schema Critique & Redesign

Identify schema weaknesses and rewrite them for production reliability.

Your Mission

You'll pressure-test your understanding of schema design by working through a critique exercise with the assistant. Bring your reasoning — the assistant will push back.

Paste this deliberately weak schema into the chat and ask the assistant to identify every flaw: {"name":"send_email","parameters":{"to":"string","body":"string"}}
Ask: "Rewrite this schema with production-quality descriptions. Show your reasoning for each change."
Then challenge the assistant: "Is there any scenario where a very long, verbose description is actually worse than a short one?"

Deeper challenge: "If two tools have overlapping use cases, where exactly in the schema should I resolve the ambiguity — and what happens if I don't?"

🧪 Lab Assistant — Schema Design Advanced

← Quiz 2 → Lesson 3

🎯 Advanced · Lesson 3 of 4

How Agents Select Tools

The model's tool-selection process: context matching, confidence thresholds, forced calling, and parallel invocation.

In Q4 2023, the GitHub Copilot team published internal findings on tool selection accuracy after expanding their agent from 3 tools to 14. With 3 tools, the model selected the correct tool 94% of the time. At 14 tools, accuracy dropped to 79% — a degradation driven primarily by semantic overlap between tools like search_code, search_issues, and search_pull_requests. Their fix had two parts: sharper disambiguation language in top-level descriptions, and a deliberate reduction back to 9 tools by merging the search family into a single search tool with a scope enum. Selection accuracy recovered to 91%. The lesson was structural: more tools is not always more capable.

The Selection Decision

When a model receives a user message and a set of tool definitions, it must make a multi-class classification decision at each turn: respond with text, call exactly one tool, or call multiple tools in parallel. This decision is not based on hard rules — it emerges from the model's training on when tool use produces better outcomes than text alone.

The decision is shaped by several factors that designers can influence:

Semantic match: How well the user's intent matches a tool's description. Dense, precise descriptions improve match quality.
Tool count: More tools increase the chance of confusion between similar-sounding options. GitHub's data showed a roughly linear degradation with tool count past ~8-10 tools.
Conversation context: Prior turns influence tool selection. If the user just asked about orders, a follow-up "and cancel it" will likely route to an order-cancellation tool without re-specifying the order ID.
System prompt instruction: Explicit routing instructions in the system prompt ("Always prefer search_internal over search_web for product questions") can override ambiguous inference.

Architect's Insight

Tool selection quality is a function of schema clarity multiplied by tool set design. A perfect schema in a poorly organized tool set still misfires. The right architecture question is: "Can a human, reading only the tool names and descriptions, unambiguously decide which tool to use for every query type?" If not, neither can the model.

Forced Calling and Tool Choice

Most APIs expose a tool_choice parameter that controls selection behavior. Understanding the three modes is essential for production system design:

auto (default): The model decides freely whether to call a tool or respond with text. Appropriate for general-purpose agents.
required (or any): The model must call at least one tool. Use when you want to guarantee structured output — for example, in a classification pipeline where text responses are never valid.
{"type": "function", "function": {"name": "..."} } (forced): The model must call this specific function. Use for structured extraction from a single document when the correct tool is known at call time.

Forced calling is a powerful technique for extraction tasks. If you know the user's message is a support ticket and you want to extract structured fields, forcing the extract_ticket_fields tool guarantees structured output without relying on the model's routing judgment. This is how many document processing pipelines achieve near-100% structured output rates.

Forcing vs. Prompting

Forced tool choice is more reliable than prompt instructions like "You must respond by calling a tool." Prompt instructions are soft constraints that the model may violate under edge-case inputs. tool_choice: required is enforced at the decoding level. Use schema constraints rather than prompt constraints whenever the API supports it.

Parallel Tool Calling

OpenAI introduced parallel tool calling in November 2023. Instead of one tool call per turn, the model can emit multiple tool calls simultaneously in a single response. The application executes all of them — potentially concurrently — and returns all results before the model generates a final response.

Parallel calling is most valuable when the query decomposes into independent sub-tasks. "What's the weather in Tokyo and the stock price of Sony?" requires two unrelated lookups that can run simultaneously rather than sequentially. A two-turn sequential approach takes twice as long for the same output.

The model emits an array of tool_calls objects in one response.
Each has a unique id. The application returns one tool-role message per ID.
The model waits for all results before synthesizing a response.
If tasks are not independent, sequential calling is safer — later calls can adapt based on earlier results.

The distinction between tasks that should be parallel vs. sequential is a key architectural judgment. Code that executes in parallel when it should be sequential — for example, sending a payment before confirming an order exists — can produce irreversible errors in production.

← Lab 2 → Quiz · Lesson 3

🎯 Advanced · Quiz 3

Quiz: How Agents Select Tools

3 questions — free, untracked, retake anytime.

1. GitHub Copilot's team found that expanding from 3 to 14 tools dropped selection accuracy from 94% to 79%. What was their primary architectural fix?

✓ Correct — ✅ Correct. The fix was structural — merge semantically overlapping tools and sharpen the description text that disambiguates them. Adding a classification layer or fine-tuning were not their approach.

Not quite. The primary fix was merging overlapping tools (reducing from 14 to 9) and improving the disambiguation language in descriptions — not adding routing infrastructure or fine-tuning.

2. When should you use tool_choice: required instead of auto?

✓ Correct — ✅ Correct. required mode guarantees the model will call a tool rather than return text — essential when downstream logic depends on structured output and a text response would break the pipeline.

Not quite. tool_choice: required is for situations where text responses are never valid — extraction pipelines, classification tasks, and similar structured-output scenarios. Using it everywhere would prevent the model from responding conversationally.

3. Which scenario is a dangerous candidate for parallel tool calling that should instead be sequential?

✓ Correct — ✅ Correct. Payment and order creation are causally dependent — payment should only proceed if order creation succeeds. Running them in parallel risks charging the card for an order that fails to create, producing an irreversible error.

Not quite. The dangerous case is causal dependency: charging a card before confirming the order exists creates a risk of irreversible error. Tasks with causal dependencies must be sequential, not parallel.

← Lesson 3 → Lab · Lesson 3

🎯 Advanced · Lesson 3 Lab

Lab: Explore Lesson 3 Concepts

Apply what you learned in Lesson 3 through guided AI conversation

Your Task

Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.

Try asking about a specific concept from Lesson 3 and how it applies in practice.

🤖 AESOP Lab Assistant Lesson 3 Lab

Building AI Agents III — Tools · Module 1 · Lesson 4

Lesson 4

Advanced concepts, real-world applications, and practical implications

Core Concepts

This lesson explores lesson 4 — examining the key principles, real-world applications, and implications for practitioners working in this domain.

Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.

Practical Applications

The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.

Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.

Looking Forward

As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.

Lesson 4 Quiz

Lesson 4

What is the primary focus of Lesson 4?

✓ Correct — Correct. This lesson bridges theory and practice, focusing on real-world implementation.

Review the lesson — the focus is on connecting frameworks to practical reality.

Why does real-world deployment introduce challenges that pure theory doesn't capture?

✓ Correct — Correct. Real deployment requires judgment, not just framework application.

Practice doesn't invalidate theory — it reveals complexities that require nuanced application of theoretical principles.

What separates effective practitioners from those who merely follow checklists?

✓ Correct — Correct. Critical thinking and adaptability matter more than memorized procedures.

The key differentiator is critical thinking ability, not experience or resources alone.

🎯 Advanced · Lesson 4 Lab

Lab: Apply What You've Learned

Synthesize concepts from Lesson 4 through guided AI conversation

Your Task

Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4.

Try: "How would the concepts from this lesson apply to a real-world scenario in this field?"

🤖 AESOP Lab Assistant Lesson 4 Lab

Module 1 Test

Tool Use Fundamentals · 15 Questions · 70% to Pass

Score: 0/15

1. What is the core objective of Tool Use Fundamentals?

2. How should practitioners approach applying concepts from this module?

3. Which best describes the relationship between theory and practice in Building AI Agents III — Tools?

4. What distinguishes expert practitioners from novices in this field?

5. How does Tool Use Fundamentals build on previous modules?

6. What role do constraints play in practical implementation?

7. When applying frameworks from this module, what is most important?

8. How should practitioners handle conflicting perspectives in this field?

9. What makes the concepts in Tool Use Fundamentals relevant beyond their immediate context?

10. How should practitioners continue developing expertise after completing this module?

11. What is the relationship between understanding Building AI Agents III — Tools concepts and making decisions?

12. How do the lessons from this module apply to novel situations?

13. What is the value of understanding multiple perspectives on {course_title}?

14. How should practitioners evaluate new information or developments in this field?

15. What is the ultimate goal of learning Tool Use Fundamentals?