In May 2023, Google announced function calling support in the Gemini API. The capability was not new — OpenAI had released it in June 2023 with GPT-4, and earlier research showed similar patterns — but the framing revealed something important. The model does not execute the function. It returns a structured object describing which function to call and with what arguments. Your application code does the actual work and returns the result.
This asymmetry is the entire foundation of safe, controllable tool use. The model reasons about intent; your infrastructure manages execution.
Function calling in Vertex AI follows a deterministic loop. You provide a set of function declarations alongside your prompt. These declarations are JSON Schema objects describing each tool's name, purpose, parameters, and parameter types. The model reads them as part of its context and decides — based on the user's message — whether a function call is appropriate.
If the model determines a function is needed, it returns a function call response instead of a text answer. This response contains the function name and a JSON object of arguments. Your code then executes the actual function, collects the result, and passes it back to the model as a function response in the next turn. The model uses that result to produce a final natural language answer.
The loop: user message → model requests function → your code runs function → result returned to model → model answers user. Nothing executes without your code in the middle.
The model selects which function to call — and what arguments to supply — based almost entirely on the description fields you write. A vague description like "gets product info" will produce inconsistent routing. A precise description like "Returns current list price in USD and available inventory count for a single product, identified by its 12-character SKU" gives the model enough signal to route correctly even in ambiguous situations.
Parameter descriptions matter equally. Specifying "The ISO 4217 three-letter currency code, e.g. USD, EUR, GBP" prevents the model from passing "dollars" when your API expects "USD". Every token in your function declaration is part of the model's decision context.
The Gemini model never touches your database, API key, or external service. It produces a structured intent. Your application code is the only thing that executes real actions. This design is intentional — it gives you complete control over authorization, validation, rate limiting, and error handling before anything external is called.
Vertex AI exposes three function calling modes through the tool_config parameter. AUTO (the default) lets the model decide whether to call a function or respond with text. ANY forces the model to always call one of the provided functions — useful when you need guaranteed structured output. NONE disables function calling entirely even if tools are declared, useful for testing baseline text responses.
The ANY mode with allowed_function_names is particularly powerful: you can constrain the model to call exactly one specific function, effectively using function calling as a structured extraction mechanism rather than autonomous tool selection.
Model decides based on context. May respond with text or call a function. Appropriate for general-purpose agents where either response type is valid.
Model must call one of the declared functions. Use for structured data extraction, form filling, or when you require machine-readable output every time.
Gemini 1.5 and later models support parallel function calling — the model can return multiple function call objects in a single response when it determines that several independent calls are needed. For example, if a user asks for a weather report and stock price simultaneously, the model may request both in one turn rather than sequentially. Your application handles each call in whatever order suits your infrastructure, then returns all results before the model generates its answer.
This capability significantly reduces latency in multi-tool agents. Google's internal benchmarks for Gemini 1.5 Pro showed that parallel function calling reduced average round-trip time by roughly 40% on tasks requiring three or more independent tool calls — compared to strictly sequential architectures.
Each function declaration consumes tokens in the model's context window. Keep declarations concise but complete. In production agents with large tool libraries (50+ functions), Google recommends using Vertex AI Extensions or dynamic tool selection to avoid exhausting context on unused declarations.
You're building an e-commerce agent that needs to call three backend services: inventory lookup, order status, and product recommendations. Your challenge is to write function declarations whose descriptions are precise enough to prevent misrouting.
Work with the AI instructor to design declarations, test edge cases, and understand what makes descriptions reliable vs. ambiguous.
At Google Cloud Next 2024, several enterprise customers presented production Vertex AI agents running with 20–40 declared tools. The most cited failure pattern was not hallucination — it was argument drift: the model correctly identified the right function but passed slightly wrong parameter formats, causing API errors that cascaded into unhelpful user responses. The fix in most cases was not model-side — it was stricter JSON Schema validation in the function declarations and explicit enum lists for constrained fields.
When an agent needs more than a handful of tools, the organization of your Tool objects becomes architecturally significant. Vertex AI allows passing multiple Tool objects in a single request, each containing up to 64 FunctionDeclarations. Grouping related functions within the same Tool object helps the model reason about toolsets as coherent capability clusters — e.g., all financial operations in one Tool, all customer data operations in another.
Keep in mind that every declaration in every Tool consumes tokens. Google recommends auditing your tool library regularly: if a function has never been called in 10,000 production turns, its description may be too similar to another function's, or the use case may not actually arise in your user base. Remove or merge it.
Your function execution code will fail. APIs time out, schemas change, permissions expire. The question is what you pass back to the model when execution fails. Returning a bare null or empty string produces vague model responses. Returning a structured error object — with an error code, a human-readable message, and ideally a suggested recovery action — gives the model enough context to respond helpfully.
A model in AUTO mode can enter a loop: it calls a function, gets a result, decides it needs another function call, gets another result, and so on indefinitely. Without guardrails, this consumes both tokens and money. Production Vertex AI agents should always implement a max_tool_calls counter in the orchestration loop — most teams set this between 5 and 15 depending on task complexity.
When the limit is reached, force the model to respond in text mode by setting tool_config to NONE for the final turn. This ensures users receive a response even if the agent couldn't fully complete the task.
Shopify's internal Vertex AI agents (described in their 2024 engineering blog) implemented a "tool budget" per conversation turn — maximum 8 function calls before forced text response. Combined with structured error returns, this reduced infinite-loop incidents by over 90% compared to early deployments without turn limits.
Never trust model-generated arguments directly. Even with excellent function declarations, models occasionally produce arguments that pass JSON Schema validation but fail at the semantic level — e.g., a date range where end_date precedes start_date, or a quantity that's negative. Your execution layer should perform semantic validation before calling external APIs.
Use enum fields liberally in your JSON Schema. If a parameter accepts only three valid values, declare them as an enum. The model will almost always respect enums, dramatically reducing the argument drift problem documented in production deployments.
Validate argument types, ranges, enums, and semantic constraints. Log the raw model-generated call object. Set a per-turn tool call budget. Check authorization for the requested action.
Return structured success or error objects — never null. Log the result. Check if the model is looping by comparing recent function call history. Enforce the tool call budget limit.
You're debugging a customer service agent that has 12 declared tools and is occasionally looping — calling functions repeatedly without producing a response. You need to design the orchestration logic to prevent this.
Work through the error handling strategy, tool budget design, and structured error return format with the AI instructor.
When Duet AI (now Gemini for Google Workspace) was extended with external API connectivity in late 2023, Google's engineering team published a technical post describing the "translation layer" problem: real-world APIs return data in shapes the model hasn't been trained to reason about efficiently. A weather API might return 47 fields; the model only needs 4 of them to answer the user's question. Without a transformation step, you're wasting context tokens on noise and potentially confusing the model with irrelevant data.
The solution — truncate and transform API responses before returning them to the model — became a standard pattern in Vertex AI agent architectures.
The cardinal rule: API keys and OAuth tokens must never appear in function declarations, system prompts, or any content visible to the model. The model's context window is logged and may be subject to various monitoring systems. Store credentials in Secret Manager on Google Cloud and access them in your execution layer — the code that runs between the model's function call response and the actual API request.
For Google Cloud APIs called from a Vertex AI agent, use Application Default Credentials via a service account. Assign the minimum IAM roles required. For third-party APIs (Stripe, Salesforce, Slack), store tokens in Secret Manager and retrieve them at execution time. Never interpolate tokens into the function declaration schema itself.
Most external APIs return far more data than the model needs. A REST endpoint for a customer record might return 80+ fields. Your transformation function should extract the 5–10 fields relevant to the agent's task and return a clean, flat JSON object. This has three benefits: it reduces token consumption, it prevents the model from latching onto irrelevant fields, and it protects sensitive data (PII, internal IDs, financial details) from entering the model's context.
Define your transformation functions as part of the same module as your function declarations — it forces you to think about input/output contracts explicitly when writing descriptions.
External APIs rate-limit requests. Your execution layer needs exponential backoff with jitter — not simple fixed-interval retries — to avoid thundering herd problems in multi-user agent deployments. For Vertex AI agents specifically, use the tenacity library or Cloud Tasks for retry orchestration rather than synchronous blocking retries, which degrade user experience and consume model turn budgets.
Return a structured rate_limited error with an estimated retry_after_seconds field when rate limits are hit. The model can then inform the user of the delay rather than silently failing.
In January 2024, a well-documented prompt injection attack against a commercial agent demonstrated that malicious content from an external API response could instruct the agent to call additional functions with attacker-controlled arguments. Always sanitize API responses before returning them to the model — strip markup, limit string lengths, and validate that response content matches expected schema types.
Many agent function calls within a single conversation retrieve the same data — user profile, account balance, product catalog. Implement an in-conversation cache keyed on function name plus argument hash. On Vertex AI, the recommended pattern uses a simple Python dict within the agent session scope, cleared at conversation end. For cross-session caching of slow/expensive lookups, use Cloud Memorystore (Redis).
Agentic tasks at Wayfair (documented in their 2024 ML engineering blog) found that simple in-session caching of product lookup calls reduced API spending by 34% in their customer service agent — because the same SKU was often queried 3–5 times within a single complex order-assistance conversation.
Reference data (product info, user profiles), slow external lookups, any data that doesn't change during the conversation scope. Cache key: function name + sorted argument hash.
Live inventory, real-time pricing, transaction results, anything that must reflect the current state of the world. Stale cache misses here cause incorrect agent responses.
Set aggressive timeouts on external API calls — 2–5 seconds for synchronous function calls within an agent turn. Users expect conversational-speed responses; a 30-second API call destroys that experience. If a needed API is reliably slow, move its invocation to an asynchronous pattern using Vertex AI Extensions with background task support, or restructure the agent to use a "processing" acknowledgment + webhook pattern.
You're integrating a Salesforce CRM API into a Vertex AI agent. The Salesforce contact endpoint returns 60+ fields. You need to design: (1) a response transformation that returns only what the model needs, (2) a caching strategy for in-conversation re-queries, and (3) confirm your credential storage approach.
Work through the design decisions with the AI instructor. You'll be challenged on edge cases — what if the user asks for a field you're not returning? What if the cache is stale?
At Google I/O 2024, Google announced the general availability of Grounding with Google Search for Vertex AI. The feature routes the model's information needs to live Google Search results, then cites sources in the response. For enterprise deployments in legal, finance, and healthcare, this addressed a critical gap: agents that need current, verifiable information rather than model knowledge with an arbitrary training cutoff date.
The announcement coincided with the release of Vertex AI Extensions — a framework for registering, versioning, and serving custom tools to agents without managing function declarations manually in application code.
Enabling Google Search grounding in Vertex AI is a one-line change to your generation config. When grounding is active, the model can retrieve and cite live search results for queries that require current information. The response includes a grounding_metadata field containing the search queries issued and the source URLs used.
Search grounding is not free — it incurs separate per-query pricing. At scale, implement grounding selectively: use a classifier or keyword filter to identify queries that require current information (news, prices, regulatory changes) versus queries answerable from model knowledge alone.
Vertex AI's built-in Code Interpreter tool allows the model to write and execute Python code in a sandboxed environment. The model generates code, the tool executes it, and the output is returned to the model for incorporation into its response. This is particularly powerful for data analysis, mathematical computations, chart generation, and any task requiring deterministic computation rather than language model estimation.
Code Interpreter runs in an isolated execution environment — it cannot access the internet, your filesystem, or external APIs. This constraint is a security feature. If you need the model to perform computation on data retrieved from your APIs, retrieve the data with a custom function call, then pass it to Code Interpreter for analysis.
Deutsche Bank's Vertex AI-powered financial analyst agent (described in a Google Cloud case study, Q1 2024) uses Code Interpreter to perform portfolio calculations after retrieving position data via custom function calls to their internal trading systems. The pattern: function call retrieves raw data → Code Interpreter performs computation → model formats and explains the result. This avoids floating-point errors that would occur if the model performed arithmetic in natural language generation.
The Extensions framework provides a managed way to register and version tool definitions centrally, rather than embedding FunctionDeclarations in application code. An Extension is a registered resource in your Vertex AI project — you define it once with an OpenAPI spec, and multiple agents can reference it by resource name without duplicating declaration code.
Extensions support authentication configs that handle OAuth 2.0, API key, and service account auth transparently — your agent code never touches credentials. Google manages the token refresh lifecycle. This is particularly valuable for enterprise deployments with dozens of tool integrations maintained by different teams.
Tool library exceeds 15 declarations. Multiple agents share the same tools. Different teams own different integrations. You need centralized versioning and rollback. Auth management should be handled by infrastructure, not application code.
Fewer than 10 tools. Single agent, single team. Rapid prototyping. Tool definitions change frequently during development. You need fine-grained control over which tools appear in which conversation contexts.
A single Vertex AI agent can use Google Search grounding, Code Interpreter, and custom FunctionDeclarations simultaneously. Pass all tools in a single list to the tools parameter. The model will select among them based on the task at hand — searching for current information, computing against retrieved data, or calling your custom APIs as needed.
The practical constraint is context length: each active tool declaration consumes tokens. Measure the token overhead of your tool set with count_tokens() before production deployment. If overhead exceeds 10% of your context budget for typical queries, prune the tool list or adopt the Extensions framework with dynamic tool selection.
Before finishing this module, map your planned agent's tool set across four categories: (1) built-in Google tools (Search, Code Interpreter), (2) Google Cloud service integrations, (3) third-party API function declarations, (4) internal API function declarations. Each category has different auth, versioning, and error-handling patterns. Treating them uniformly is a common source of production incidents.
You're designing a financial research agent for an investment firm. It needs to: (1) search for current market news, (2) retrieve portfolio positions from an internal API, (3) perform quantitative analysis on the retrieved data, and (4) summarize findings. This requires all three tool types — Search grounding, custom function calls, and Code Interpreter.
Work with the AI instructor to design the tool selection logic, token budget, and sequencing for a typical research query.