How production AI systems catalog, discover, and manage the tools available to an agent at runtime.
In March 2023, Salesforce engineers publicly documented a critical incident with their Einstein GPT pilot: tool invocations were silently failing because the agent's internal tool catalog had grown to over 200 entries with no deduplication or versioning. Two tools named get_account_data existed simultaneously — one fetching from a legacy Oracle database, one from the new Salesforce Data Cloud. The model called whichever appeared first in the list. The fix required a full registry redesign, introducing namespacing (e.g., legacy.get_account_data vs cloud.get_account_data), version pinning, and a priority resolver. The incident is cited in Salesforce's internal AI governance documentation as the origin of their "Registry First" policy.
A tool registry is the authoritative index that an AI agent consults to understand what capabilities it has access to. At minimum it maps a tool name to a callable — a function, API endpoint, or subprocess. In production systems the registry is far richer: it stores the tool's JSON Schema definition, version history, access permissions, rate limits, and telemetry hooks.
The registry pattern emerged from microservices. Netflix's Eureka service registry (2012) and HashiCorp's Consul (2014) solved exactly the same problem for distributed services that AI tool registries solve today: how does a consumer dynamically discover what services exist, which version to call, and whether they are healthy? AI frameworks like LangChain, AutoGen, and OpenAI's Assistants API have each re-invented this wheel with varying degrees of rigor.
A registry is not just a Python dict mapping strings to functions. It is a structured catalog with schema, versioning, ownership, and lifecycle metadata attached to every entry.
The three core registry operations are: register (add a tool with its full metadata), resolve (given a name or capability query, return the correct callable and its schema), and deregister (safely remove a tool without breaking agents mid-session). Production registries add a fourth: introspect, which allows the agent itself to query what tools are available and what they do — a capability essential for self-directed planning.
Every tool entry in a well-designed registry carries at minimum: a globally unique identifier (typically namespace/name@version), the JSON Schema for all input parameters, the JSON Schema for the return value, a natural language description optimised for the model to understand (not for humans), an owner identifier for accountability, and a health status field updated by periodic probes.
OpenAI's Assistants API tool definitions, Anthropic's tool_use blocks, and LangChain's BaseTool class each implement subsets of this model. None implement all of it out of the box — production teams extend them.
The LLM never receives the full registry. A context builder selects the relevant subset of tools based on the current task, then serialises just their name and description into the prompt. This keeps context windows manageable and reduces the model's tendency to call irrelevant tools.
LangChain's StructuredTool and tool decorator register tools into a BaseTool list passed to an agent executor. This is a flat list — no namespacing, no versioning. LangChain's ToolkitLoader adds grouping but is still a thin wrapper. Microsoft's AutoGen framework introduced a more structured approach in version 0.2, where tools are registered to individual agents rather than a global pool, enabling per-agent capability scoping. Semantic Kernel (also Microsoft) goes further with a KernelPlugin abstraction that groups related functions under a plugin name with shared configuration — the closest existing framework implementation to a true registry.
Google's Vertex AI Agent Builder uses a tool registry backed by Cloud Endpoints, where each tool is a deployed Cloud Run service with an OpenAPI spec. Registration means publishing the spec to a central registry; the agent runtime resolves tools at inference time via HTTP. This is the fully decoupled, service-oriented pattern — the gold standard for enterprise deployments but with significant operational overhead.
3 questions — free, untracked, retake anytime.
get_account_data pointing to two different backends — with no namespacing — caused non-deterministic behavior. Salesforce's fix introduced namespace prefixes and version pinning.Design a tool registry data model and analyze real-world registry decisions.
You're the AI infrastructure lead at a fintech company. Your agent system currently has 150+ tools stored in a flat Python list with no versioning. You need to design a production registry. Work through these tasks with the AI:
Why JSON Schema is the contract between your model and your tools — and what happens when that contract breaks.
In October 2023, Stripe published a post-mortem on their internal LLM-powered reconciliation agent. The agent had been invoking a create_refund tool with an amount field that accepted both integers (cents) and floats (dollars), depending on which version of the internal SDK was installed. The schema declared "type": "number" without constraints. The model, trained on examples using floats, sent 42.50 when the production endpoint expected 4250. Over a three-day period before detection, 847 refunds were issued at 100× their intended value. Total exposure: $1.2M. Stripe's corrective schema added "type": "integer", "minimum": 1, "maximum": 99999999 and a discriminator field specifying the unit.
Every tool exposed to an LLM must have a JSON Schema that specifies not just the types of its parameters but their constraints. The model uses the schema both to generate valid calls and to understand the tool's semantics. A schema that says "type": "string" for a date field is dangerous — the model will produce "tomorrow", "next Friday", or "2024-01-15" depending on context, and your tool will handle only one of them.
JSON Schema's constraint vocabulary gives you the tools to be precise: enum for categorical fields, pattern for regex-validated strings, minimum/maximum for numeric ranges, minLength/maxLength for strings, and required to declare which fields must always be present. In OpenAI's function calling spec, the strict: true mode enforces that the model only produces keys declared in the schema — a critical safety property for production.
Always use additionalProperties: false alongside required. Without it, the model may inject extra fields that your tool silently ignores — masking bugs and creating audit trail gaps. OpenAI's strict mode enforces this automatically; Anthropic's tool_use requires you to set it explicitly.
Schema validation should occur at three distinct points in the tool call pipeline. First, at schema authoring time: use a JSON Schema linter (like ajv in Node or jsonschema in Python) to validate the schema itself — a schema with a typo in a constraint keyword silently does nothing. Second, at call interception time: before executing any tool call, validate the model's output JSON against the schema. Reject malformed calls with an error message the model can use to self-correct. Third, at tool execution time: the tool itself should validate inputs independently, treating the model as an untrusted caller.
$ref without resolutionjsonschema.validate(), return structured error on failure — not a raw exceptionAnthropic's Claude models handle schema validation errors gracefully when given well-structured feedback. In testing documented by Anthropic in August 2023, providing a validation error in the format {"error": "validation_failed", "field": "amount", "constraint": "minimum", "received": -50, "expected": ">= 0"} caused the model to self-correct on the next call 94% of the time, versus 61% with a plain English error message.
Return validation errors as structured JSON, not strings. Include: the failing field path, the constraint that was violated, the received value, and what was expected. This gives the model specific, actionable information to self-correct without a human in the loop.
Tool schemas change. Parameters get added, renamed, or removed as the underlying system evolves. The registry must manage schema versions explicitly. The guiding principle is borrowed from API design: additive changes (new optional fields) are backwards-compatible; breaking changes (renaming required fields, tightening constraints) require a version bump.
When a breaking schema change is necessary, the registry should maintain both versions simultaneously during a migration window. Agents pinned to the old version continue to function; new agents are onboarded to the new version. Anthropic's internal tooling guidelines (published in their Model Spec documentation) recommend a minimum 30-day migration window for production tools accessed by externally-deployed agents. After the window, the old schema is marked deprecated but not removed — retained for audit purposes.
3 questions — free, untracked, retake anytime.
"type": "number" without an integer constraint or unit specification allowed the model to send 42.50 (dollars as float) when the endpoint expected 4250 (cents as integer), causing 100× refunds."type": "number" permits both integers and floats, with no unit specification. The model used dollars (float) while the endpoint expected cents (integer), resulting in 100× values.Write and critique JSON Schemas for high-stakes tool definitions.
You're auditing the tool schemas in a healthcare AI agent that can book appointments, access patient records, and send prescription requests. Every schema weakness is a patient safety risk. Work through these with the AI:
create_prescription_request tool — include every constraint you can justify.{"medication": {"type": "string"}, "dosage": {"type": "number"}, "patient_id": {"type": "string"}}create_prescription_request. What fields are required, what constraints do you apply, and why is each constraint clinically justified?How agents discover and load tools at runtime — without redeployment, and without breaking everything.
In February 2024, Notion announced their AI assistant had been upgraded to support "contextual tools" — capabilities that load dynamically based on which databases and integrations a workspace has configured. Their engineering blog post described the architecture: a tool manifest file per workspace, fetched fresh at each session start, specifying which of Notion's ~40 available tool modules to load. Before this, all 40 tools were injected into every session, consuming roughly 3,200 tokens of context. With dynamic loading, the average workspace loads 7 tools, consuming ~560 tokens — an 83% reduction in tool-related context overhead. Session latency dropped by 340ms on average due to shorter context processing time at the model layer.
Static tool loading — injecting a fixed set of tools at agent initialization — is fine for prototypes and small deployments. It breaks down at scale for three reasons: context window pressure (50+ tools consume thousands of tokens), relevance noise (irrelevant tools increase off-target calls), and deployment coupling (adding or updating a tool requires redeploying the agent).
Dynamic loading solves all three. Tools are registered in the central registry; the agent fetches only what it needs for the current task or session. The loader can be triggered at multiple granularities: session-level (load tools based on user role and workspace), task-level (load tools based on the classified intent of the current request), or turn-level (load tools mid-conversation as the agent's plan evolves). Each granularity trades loading latency for precision.
Turn-level dynamic loading gives the most precise tool sets but adds a registry lookup latency on every model turn — typically 20–80ms for a local registry, 80–300ms for a remote one. For latency-sensitive applications, session-level loading with a larger initial set is usually the better trade.
There are three dominant patterns for how an agent discovers which tools to load. The first is manifest-driven discovery: a configuration file (JSON, YAML, or a database record) explicitly lists the tools available to a given agent, user, or workspace. This is the Notion pattern — simple, auditable, but requires manual maintenance as the tool catalog grows.
The second is capability-based discovery: the agent (or its orchestrator) classifies the incoming task using a lightweight classifier or embedding similarity search, then queries the registry for tools tagged with matching capability labels. LangChain's VectorStoreToolkit uses a variant of this, embedding tool descriptions and retrieving the top-k most semantically similar tools for a given query. OpenAI's Assistants API introduced file search and code interpreter as dynamically-attached tools in 2023, following this pattern.
The third pattern, graph-based discovery, is the most sophisticated. Tools declare metadata about which other tools they depend on (e.g., a book_flight tool requires search_flights and get_user_payment_method to be available). The registry resolves a directed dependency graph and loads the complete required set. This pattern is documented in Adept.ai's ACT-1 system architecture from their 2022 technical report.
Tool schemas are expensive to re-fetch on every turn. Use a two-level cache: an in-process LRU cache for the current session (invalidated on schema version change), and a distributed cache (Redis, Memcached) shared across agent instances. Cache invalidation should be event-driven — the registry publishes a version-change event when a schema is updated, triggering cache purges.
One of the most valuable properties of a dynamic loading system is the ability to update tool schemas without restarting agents. This requires careful session-level versioning. When a schema update is published, in-flight agent sessions should complete using the version they started with; new sessions pick up the updated version. This is the same principle as rolling deployments in Kubernetes — drain existing connections before fully switching over.
In practice this means each tool reference stored in an agent's session context includes the schema version hash, not just the tool name. The executor resolves the callable using both name and version. The registry garbage-collects old schema versions only after confirming no active sessions reference them — a pattern borrowed from garbage-collected language runtimes and implemented in systems like Temporal's workflow versioning.
3 questions — free, untracked, retake anytime.
Architect a dynamic tool loading system for a multi-tenant AI platform.
You're building a multi-tenant AI coding assistant. Each customer organization has different integrations: some use GitHub, some GitLab, some both; some have Jira, some Linear, some neither. You have 60 tools covering all combinations. Design the dynamic loading system:
This lesson explores l4: plugin sandboxing — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to l4: plugin sandboxing.