For the first couple of years of the modern AI wave, an AI could tell you how to do a thing. It couldn't do the thing. It could write a SQL query but not run it, describe an API call but not make it, explain how to send an email but not send one. The gap between I know the answer and I did the work was where agents lived or died.
Tool use closed that gap. A tool-enabled agent can query databases, call APIs, edit files, send messages, control browsers, trigger other agents. The agent doesn't just advise — it operates. The moment you plug a real tool into a capable model, the model graduates from consultant to colleague.
This third course in the Agents series is about tools as a first-class design surface. It covers how to wrap existing APIs as tools an agent can reliably call, the pitfalls of tool proliferation, authentication and permission models that don't get you fired, how to handle tool errors gracefully, and the emerging patterns for building agent systems where the tools themselves are agents.
If you finish every module, here's who you become:
How language models escape their text bubble and reach into the real world — the mechanics of structured tool invocation.
In March 2023, OpenAI shipped function calling as a first-class API feature for GPT-4 and GPT-3.5-turbo. Before this, developers had to parse free-text model output with fragile regex to extract structured actions. The new system let developers declare typed function signatures in JSON, and the model would emit a structured function_call object instead of prose whenever it determined a tool was needed. Stripe, Shopify, and Duolingo integrated the feature within weeks — not because it added new intelligence, but because it made the model's intent machine-readable and reliable enough to execute in production.
The shift was architectural, not cosmetic. Reliability of structured extraction jumped from roughly 60–70% with prompt engineering to above 95% with schema-enforced function calling in controlled benchmarks reported by early adopters at developer conferences that spring.
Function calling — sometimes called tool use — is the process by which a language model signals that it wants to invoke an external capability rather than answer with text alone. The model does not execute code itself. Instead, it emits a structured object that specifies: which function to call, and what arguments to pass. The surrounding application reads that object, runs the real function, and feeds the result back into the conversation.
This creates a precise division of labor. The model handles natural-language understanding and decision-making. The application handles execution. Neither side needs to trust the other with tasks it cannot safely perform.
The model never directly executes code, queries databases, or calls APIs. It emits a request to do so. The execution runtime decides whether to honor that request, and with what safety checks. This separation is fundamental to safe agentic design.
The flow looks like this: the developer registers available tools by passing their schemas in the API request. When the model decides a tool is needed, it returns a special response type rather than a text completion. The application detects this, runs the tool, appends the result to the conversation history as a tool-role message, and calls the model again. The model then generates a natural-language response informed by the tool result.
Before structured function calling, the dominant pattern was prompt engineering: tell the model to respond in JSON, then parse the output. This broke in production for several interconnected reasons.
search_query vs query vs q.Schema-enforced function calling solves all of these by moving structural guarantees into the model's constrained decoding layer. The model is trained — and at inference time, constrained — to emit output that validates against the declared schema. The application no longer needs to guess what the model meant.
LangChain's early v0.x changelog records at least four breaking changes related to output parsing reliability between January and June 2023. The team explicitly cited OpenAI's function calling release as the event that allowed them to deprecate their most complex parser logic.
A minimal function call interaction has four distinct message types in the conversation history. Understanding each is essential for debugging agentic systems.
{
"role": "user",
"content": "What is the weather in Tokyo right now?"
}
// Model emits:
{
"role": "assistant",
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\": \"Tokyo\", \"unit\": \"celsius\"}"
}
}]
}
// Application executes, then appends:
{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "{\"temperature\": 24, \"condition\": \"partly cloudy\"}"
}
// Model then returns:
{
"role": "assistant",
"content": "It's currently 24°C and partly cloudy in Tokyo."
}
Notice that arguments is a JSON-encoded string, not a nested object. This is a common source of bugs for developers new to function calling — the arguments must be parsed a second time on the application side.
3 questions — free, untracked, retake anytime.
arguments field contains the tool's input parameters as:arguments is a JSON string, not a nested object. Developers must call JSON.parse() on it — a subtle but common source of bugs when first integrating function calling.JSON.parse().Trace the four-message cycle and identify where things break in real systems.
You'll work through the mechanics of a real function calling exchange with an AI assistant trained on this material. Focus on the message roles, the argument parsing issue, and the execution boundary.
arguments field.JSON Schema anatomy, parameter typing, required vs optional fields, and the description fields that drive model behavior.
In late 2023, Notion's engineering team published a post-mortem on their AI assistant integration. A tool called search_pages had been returning irrelevant results in production for weeks. The bug was not in search logic — it was in the schema. The query parameter description read: "The search term." After the team rewrote it to: "A concise natural-language phrase describing what the user wants to find — avoid filler words, include entity names and dates when relevant," retrieval precision improved by 31% with no changes to the underlying search code. The model had been passing literal user utterances as queries; richer description text changed what the model inferred "good input" looked like.
A tool schema passed to the model is a JSON object with three top-level fields: name, description, and parameters. The parameters object follows the JSON Schema specification, with type, properties, and required as the core sub-fields. Here is a production-quality example:
{
"name": "search_knowledge_base",
"description": "Retrieve documents from the internal knowledge base. Use when the user asks about company policy, product specs, or historical decisions. Do NOT use for real-time data.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Concise search phrase. Include entity names, dates, and product identifiers. Avoid filler words like 'please' or 'I need'."
},
"category": {
"type": "string",
"enum": ["policy", "product", "engineering", "finance"],
"description": "Filter results to this document category. Omit if the category is unclear."
},
"max_results": {
"type": "integer",
"minimum": 1,
"maximum": 20,
"description": "Number of documents to return. Default to 5 unless the user explicitly requests more."
}
},
"required": ["query"]
}
}
Every field in this schema is doing work. The top-level description tells the model when to call the tool and — critically — when not to. The parameter descriptions tell the model how to construct valid inputs. The enum constraint on category prevents hallucinated values. The minimum/maximum on max_results prevents out-of-range integers.
Tool schemas are not just type contracts — they are a form of prompting. The model reads your descriptions at inference time. Vague descriptions produce vague arguments. Precise, action-oriented descriptions with examples embedded in the description text consistently produce better-formed tool calls in production systems.
The required array deserves careful thought. If a field is listed as required, the model will always attempt to populate it — even by inferring or hallucinating a value. If a field is optional, the model may omit it when uncertain, which is often safer than a wrong value.
Anthropic's function calling documentation (2024) specifically recommends keeping the required array as small as possible and using description text to guide optional field population rather than schema enforcement — the model's language understanding is more flexible than JSON Schema constraints allow.
A common production bug: marking a date_range field as required when the user's query is time-agnostic. The model invents a date range, the query is incorrectly scoped, and results are filtered to empty. The fix is always to make the field optional and describe when it should be populated.
The highest-leverage optimization in tool schema design is description engineering. Unlike type constraints, descriptions operate on the model's language understanding layer — they can encode nuanced intent, negative examples, and disambiguation rules that no type system can express.
track_shipment instead." This cross-tool disambiguation is only possible in description text.3 questions — free, untracked, retake anytime.
query field — to give the model clearer guidance on what a good search term looks like.date_range field as required when the user's query is time-agnostic?description field rather than a parameter's description?Identify schema weaknesses and rewrite them for production reliability.
You'll pressure-test your understanding of schema design by working through a critique exercise with the assistant. Bring your reasoning — the assistant will push back.
{"name":"send_email","parameters":{"to":"string","body":"string"}}The model's tool-selection process: context matching, confidence thresholds, forced calling, and parallel invocation.
In Q4 2023, the GitHub Copilot team published internal findings on tool selection accuracy after expanding their agent from 3 tools to 14. With 3 tools, the model selected the correct tool 94% of the time. At 14 tools, accuracy dropped to 79% — a degradation driven primarily by semantic overlap between tools like search_code, search_issues, and search_pull_requests. Their fix had two parts: sharper disambiguation language in top-level descriptions, and a deliberate reduction back to 9 tools by merging the search family into a single search tool with a scope enum. Selection accuracy recovered to 91%. The lesson was structural: more tools is not always more capable.
When a model receives a user message and a set of tool definitions, it must make a multi-class classification decision at each turn: respond with text, call exactly one tool, or call multiple tools in parallel. This decision is not based on hard rules — it emerges from the model's training on when tool use produces better outcomes than text alone.
The decision is shaped by several factors that designers can influence:
search_internal over search_web for product questions") can override ambiguous inference.Tool selection quality is a function of schema clarity multiplied by tool set design. A perfect schema in a poorly organized tool set still misfires. The right architecture question is: "Can a human, reading only the tool names and descriptions, unambiguously decide which tool to use for every query type?" If not, neither can the model.
Most APIs expose a tool_choice parameter that controls selection behavior. Understanding the three modes is essential for production system design:
auto (default): The model decides freely whether to call a tool or respond with text. Appropriate for general-purpose agents.required (or any): The model must call at least one tool. Use when you want to guarantee structured output — for example, in a classification pipeline where text responses are never valid.{"type": "function", "function": {"name": "..."} } (forced): The model must call this specific function. Use for structured extraction from a single document when the correct tool is known at call time.Forced calling is a powerful technique for extraction tasks. If you know the user's message is a support ticket and you want to extract structured fields, forcing the extract_ticket_fields tool guarantees structured output without relying on the model's routing judgment. This is how many document processing pipelines achieve near-100% structured output rates.
Forced tool choice is more reliable than prompt instructions like "You must respond by calling a tool." Prompt instructions are soft constraints that the model may violate under edge-case inputs. tool_choice: required is enforced at the decoding level. Use schema constraints rather than prompt constraints whenever the API supports it.
OpenAI introduced parallel tool calling in November 2023. Instead of one tool call per turn, the model can emit multiple tool calls simultaneously in a single response. The application executes all of them — potentially concurrently — and returns all results before the model generates a final response.
Parallel calling is most valuable when the query decomposes into independent sub-tasks. "What's the weather in Tokyo and the stock price of Sony?" requires two unrelated lookups that can run simultaneously rather than sequentially. A two-turn sequential approach takes twice as long for the same output.
tool_calls objects in one response.id. The application returns one tool-role message per ID.The distinction between tasks that should be parallel vs. sequential is a key architectural judgment. Code that executes in parallel when it should be sequential — for example, sending a payment before confirming an order exists — can produce irreversible errors in production.
3 questions — free, untracked, retake anytime.
tool_choice: required instead of auto?required mode guarantees the model will call a tool rather than return text — essential when downstream logic depends on structured output and a text response would break the pipeline.tool_choice: required is for situations where text responses are never valid — extraction pipelines, classification tasks, and similar structured-output scenarios. Using it everywhere would prevent the model from responding conversationally.Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.
This lesson explores lesson 4 — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4.