Module 2 · Lesson 1

BigQuery as an Agent Tool

How AI agents connect to Google's serverless data warehouse — architecture, authentication, and the anatomy of a tool call.

What does it actually mean for an agent to "query BigQuery," and how does the plumbing work under the hood?

When Google announced Duet AI for BigQuery at Cloud Next 2023, the headline feature was natural-language-to-SQL generation inside the console. Behind that feature sat a critical architectural decision: the LLM would never touch production data directly. Every data interaction passed through a typed, auditable tool interface — an early public demonstration of the agent-tool pattern at enterprise scale.

Why BigQuery is the Default Agent Data Backend

BigQuery processes over 110 petabytes of data per day across Google's internal and customer workloads. For agents built on Vertex AI, it is the natural first-choice analytical backend for three structural reasons: serverless execution (no connection pools to manage), columnar storage (fast aggregations over huge tables), and IAM-native access control (service accounts grant granular, auditable permissions without storing credentials in agent code).

Unlike a traditional database where a client holds a persistent connection, BigQuery operates via REST and gRPC APIs. Each query is a stateless job submission — which maps cleanly onto the stateless tool-call model agents use. An agent fires a tool call, BigQuery runs the job asynchronously, the agent polls or awaits the result, and the conversation continues.

The Tool Call Anatomy

In LangChain's BigQuery integration and in Vertex AI Agent Builder's native connectors, a BigQuery tool is registered with a JSON schema describing its inputs (SQL string, project ID, dataset, optional timeout) and outputs (rows as JSON, schema metadata, job statistics). The LLM never writes to this schema — it only reads the spec, produces a valid call, and interprets the structured response.

# Simplified Vertex AI / LangChain BigQuery tool registration
from langchain_google_community import BigQueryToolkit
from langchain_google_vertexai import ChatVertexAI

llm = ChatVertexAI(model_name="gemini-1.5-pro")
toolkit = BigQueryToolkit(
    project="my-project",
    dataset="analytics",
    llm=llm
)
tools = toolkit.get_tools()
# tools contains: QueryBigQueryTool, ListBigQueryTablesTool,
#                  GetBigQueryTableSchemaTool

Three tools are typically exposed. ListBigQueryTablesTool lets the agent discover what tables exist. GetBigQueryTableSchemaTool returns column names and types before the agent commits to a query. QueryBigQueryTool actually executes SQL and returns rows. This three-step discovery pattern — list, inspect, query — prevents the agent from hallucinating column names, a common failure mode when schema context is absent.

Authentication: Service Accounts and ADC

Agents running on Vertex AI infrastructure authenticate to BigQuery via Application Default Credentials (ADC). When deployed to Cloud Run, Vertex AI Workbench, or a Vertex AI Agent, the runtime environment automatically provides a service account identity. That service account must be granted roles/bigquery.dataViewer on the dataset and roles/bigquery.jobUser on the project — and nothing more. The principle of least privilege is enforced at the IAM layer, not inside agent code.

Architecture Pattern

In Google's own internal deployments of Duet AI for BigQuery (documented in the 2023 Cloud Next technical session "Duet AI: Under the Hood"), every agent-initiated query runs under a dedicated service account with row-level access controls enforced via BigQuery's column-level security and row access policies. The LLM receives only the rows it is authorized to see — the access boundary is set at the data layer, not the prompt layer.

Key Concepts

Tool SchemaJSON definition of a tool's inputs and outputs that the LLM uses to decide when and how to invoke the tool — never written to at runtime.

ADCApplication Default Credentials — Google Cloud's mechanism for resolving credentials from the runtime environment without hardcoding secrets.

Job-Based QueryBigQuery's async execution model where each SQL submission creates a trackable job with metadata, cost estimates, and row-level results.

Schema DiscoveryThe agent pattern of calling list-tables → get-schema before running SQL, reducing hallucinated column names.

Coming Up

Lesson 2 dives into SQL generation strategies — how Gemini constructs queries, what prompting patterns reduce errors, and how the agent handles multi-step queries that require intermediate results.

Lesson 1 Quiz

BigQuery as an Agent Tool — 4 questions

Which three tools does a standard BigQuery toolkit expose to a Vertex AI agent?

Correct. The standard trio is ListBigQueryTablesTool, GetBigQueryTableSchemaTool, and QueryBigQueryTool — enabling a discover-then-query pattern.

Not quite. The standard toolkit deliberately omits write tools (create/insert) and exposes: list tables, get schema, and execute query.

Why does BigQuery's job-based execution model map well onto the agent tool-call pattern?

Correct. Agent tool calls are stateless invocations, and BigQuery's REST/gRPC job submission is stateless — no persistent connections required on either side.

Incorrect. The key alignment is statelessness — BigQuery jobs are discrete submissions, just like tool calls. Agents do not hold persistent connections.

What two IAM roles are minimally required for an agent's service account to query BigQuery?

Correct. dataViewer grants read access to table data; jobUser grants the ability to run query jobs. Together they are the least-privilege baseline.

Not right. The least-privilege pair is roles/bigquery.dataViewer (on dataset) plus roles/bigquery.jobUser (on project).

In Google's Duet AI for BigQuery architecture, where is the access boundary enforced — at the prompt layer or the data layer?

Correct. Per Google's Cloud Next 2023 technical session, access is enforced at the data layer via BigQuery's native row access policies and column security — the LLM only receives authorized rows.

Incorrect. Google's documented approach places the access boundary at the data layer (BigQuery row/column security), not in the prompt.

Lab 1: BigQuery Tool Architecture

Chat with your AI lab assistant about registering and securing BigQuery tools in a Vertex AI agent

Lab Objective

You are designing a Vertex AI agent that will query a BigQuery dataset containing customer transaction records. Work through the architecture decisions: tool registration, IAM setup, and schema discovery flow.

Starter prompt: "I need to register a BigQuery tool for my Vertex AI agent. My dataset is called 'transactions' in project 'retail-analytics-prod'. Walk me through the setup, and what IAM roles should I assign to the service account?"

BigQuery Tools Lab

AI Assistant

Welcome to Lab 1. I'm your lab assistant for BigQuery tool architecture on Vertex AI. Ask me about tool registration, IAM configuration, schema discovery patterns, or the anatomy of a BigQuery tool call. What would you like to explore first?

Module 2 · Lesson 2

SQL Generation Strategies

How Gemini constructs BigQuery SQL — prompting patterns, schema injection, and multi-step query decomposition.

What makes the difference between an agent that writes correct SQL and one that confidently hallucinates column names?

At Google I/O 2024, the Gemini for Google Cloud team demonstrated a live natural-language analytics session in BigQuery Studio. An analyst typed: "Show me the top 10 products by revenue last quarter, broken down by region." The agent issued three sequential tool calls before writing a single byte of SQL: it listed available tables, fetched schemas for two candidate tables, then synthesized a JOIN query. The SQL was syntactically valid on the first attempt — a result the team attributed to schema-in-context, not model size alone.

Schema Injection: The Most Important Variable

The single biggest predictor of SQL correctness in LLM-generated queries is whether the full relevant schema is present in the context window at generation time. A study published by Google Research in 2023 ("Can LLMs Generate Correct SQL?") found that Gemini Pro's SQL accuracy on BigQuery benchmarks improved from 61% to 89% when schema context was injected — a 28-point gain without any fine-tuning.

In practice, agents inject schema by first calling GetBigQueryTableSchemaTool for each candidate table, then including the JSON schema response verbatim in the system prompt for the SQL-generation step. The token cost is modest — a typical BigQuery table schema is 200–800 tokens — and the accuracy gain is substantial.

# Schema injection pattern in a LangChain BigQuery agent
from langchain.prompts import ChatPromptTemplate

SYSTEM_PROMPT = """You are a BigQuery SQL expert.
Available tables and their schemas:
{schema_context}

Rules:
- Always qualify table names: `project.dataset.table`
- Use LIMIT on exploratory queries
- Prefer APPROX_COUNT_DISTINCT over COUNT(DISTINCT) for large tables
- Never use SELECT * — enumerate columns explicitly
"""

# schema_context is populated by GetBigQueryTableSchemaTool calls
# before the LLM generates any SQL

Multi-Step Query Decomposition

Complex analytical questions rarely map to a single SQL statement. Agents that attempt one-shot SQL for multi-step logic (e.g., "find customers who bought product A but not B in the last 90 days, then rank them by lifetime value") frequently produce syntactically valid but logically wrong queries. The more robust pattern is query decomposition: the agent breaks the question into sub-queries, executes each as a separate tool call, and passes intermediate results back into the context before constructing the final query.

Decompose — LLM identifies the logical sub-problems (e.g., "first find the product cohort, then join to LTV table")

Execute sub-queries — each sub-problem becomes a targeted BigQuery tool call returning a small result set

Re-inject results — intermediate row counts and sample data are added to context before the final synthesis step

Synthesize — LLM writes the final query (or directly answers the question from the intermediate data)

BigQuery-Specific SQL Patterns Agents Must Know

Standard SQL generation prompts trained on generic databases produce subtly wrong BigQuery SQL. The most common failures are use of backtick-free table references (valid in some engines, rejected by BigQuery's parser), missing project qualifiers, and using CURRENT_DATE() syntax from MySQL rather than BigQuery's CURRENT_DATE (no parentheses).

Pattern	Generic SQL (Wrong in BQ)	BigQuery Correct
Table reference	FROM orders o	FROM `proj.dataset.orders` o
Current date	WHERE date = CURRENT_DATE()	WHERE date = CURRENT_DATE
Approx distinct	COUNT(DISTINCT user_id)	APPROX_COUNT_DISTINCT(user_id)
Array unnest	UNNEST(items) AS item	UNNEST(items) AS item WITH OFFSET pos
Date truncation	DATE_TRUNC('month', dt)	DATE_TRUNC(dt, MONTH)

Prompting Tip

Include three to five BigQuery-specific syntax reminders in the system prompt whenever the agent may generate SQL. This costs roughly 100 tokens and eliminates the most common syntax errors observed in production Vertex AI agent deployments.

Key Concepts

Schema InjectionAdding full table schema to the LLM context before SQL generation — the highest-impact technique for query correctness.

Query DecompositionBreaking complex questions into sequential sub-queries, each executed independently, with results fed back into context.

Syntax GroundingIncluding BigQuery-specific syntax rules in the system prompt to override generic SQL patterns the model may have learned.

Lesson 2 Quiz

SQL Generation Strategies — 4 questions

According to Google Research (2023), how much did schema injection improve Gemini Pro's SQL accuracy on BigQuery benchmarks?

Correct. The documented improvement was 61% → 89% (28 points) without any fine-tuning — purely from schema context injection.

Incorrect. The documented figure is 61% → 89% — a 28-point accuracy gain from schema injection alone, with no fine-tuning.

In BigQuery SQL, how is the CURRENT_DATE function correctly written?

Correct. BigQuery uses CURRENT_DATE without parentheses — unlike MySQL's CURRENT_DATE() — a subtle difference that breaks agent-generated queries.

Incorrect. BigQuery's syntax is CURRENT_DATE (no parentheses). CURRENT_DATE() is MySQL syntax and will cause a parse error in BigQuery.

What is the recommended alternative to COUNT(DISTINCT col) for large BigQuery tables?

Correct. APPROX_COUNT_DISTINCT uses HyperLogLog++ sketches for near-exact counts at a fraction of the cost and query time.

Incorrect. The correct function is APPROX_COUNT_DISTINCT — it uses HyperLogLog++ internally and is dramatically faster than exact COUNT(DISTINCT) on large tables.

Why is one-shot SQL generation often insufficient for complex analytical questions?

Correct. Complex questions (e.g., cohort + LTV analysis) require validating intermediate results — decomposition into sub-queries gives the agent data to reason from before synthesizing a final answer.

Incorrect. The issue is logical correctness — complex multi-step logic needs intermediate validation. Breaking questions into sub-queries lets the agent check results before synthesizing a final query.

Lab 2: SQL Generation & Schema Injection

Practice schema injection patterns and multi-step query decomposition with your AI lab assistant

Lab Objective

You are building an agent that answers natural-language questions about an e-commerce BigQuery dataset. Work through schema injection setup and practice decomposing a complex analytical question into sub-queries.

Starter prompt: "My agent needs to answer: 'Which product categories had declining revenue for three consecutive months in 2023?' Walk me through how to decompose this into sub-queries and what schema context I need to inject before generating SQL."

SQL Generation Lab

AI Assistant

Welcome to Lab 2. I'll help you work through SQL generation strategies for Vertex AI BigQuery agents — schema injection, query decomposition, and BigQuery-specific syntax patterns. What analytical problem would you like to tackle?

Module 2 · Lesson 3

Cost Control and Query Safety

Preventing runaway query costs, enforcing dry-run validation, and building guardrails that protect production data from agent mistakes.

An autonomous agent submitting arbitrary SQL to a petabyte-scale warehouse — what could go wrong, and how do you prevent it?

In late 2022, a major US retailer (documented in a BigQuery community forum post by a Google Cloud engineer) suffered a $47,000 single-day BigQuery bill when an internal automation script submitted a full-table scan on a 12TB events table inside a feedback loop. The script had no byte-limit guard and no dry-run step. The incident directly influenced Google's decision to make dry-run validation a default step in the Duet AI for BigQuery architecture — before any agent-generated query executes, BigQuery's dry-run API estimates bytes scanned and the agent checks against a configured threshold.

BigQuery's Dry-Run API

Every BigQuery query job can be submitted in dry-run mode — the query is parsed and optimized but not executed, and the response includes an estimate of bytes that would be processed. This costs nothing. For agents, dry-run is the mandatory first step before submitting any LLM-generated SQL to production tables.

# Dry-run validation before execution
from google.cloud import bigquery

def safe_query(client, sql, max_bytes=10_737_418_240):  # 10 GB default
    job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    dry_run = client.query(sql, job_config=job_config)
    estimated = dry_run.total_bytes_processed
    
    if estimated > max_bytes:
        raise ValueError(
            f"Query would scan {estimated/1e9:.1f} GB — exceeds {max_bytes/1e9:.0f} GB limit."
        )
    # Only execute if within budget
    return client.query(sql).result()

Five Guardrail Layers for Agent-Generated SQL

Production deployments at Google Cloud enterprise customers typically implement five distinct guardrail layers, each catching a different failure class:

Structural Guards

Require LIMIT clause on all SELECT queries
Block DDL statements (CREATE, DROP, ALTER)
Block DML statements (INSERT, UPDATE, DELETE)
Reject queries referencing tables not in the approved list

Cost Guards

Dry-run byte estimation before execution
Per-agent daily byte quota via custom metadata
BigQuery project-level custom quotas (max bytes billed)
Partition filter requirements on large tables

Partition Filter Enforcement

BigQuery allows tables to be marked with requirePartitionFilter — any query that does not include a filter on the partition column is rejected at parse time, before any bytes are scanned. For agent-generated SQL over date-partitioned event tables (a common pattern in analytics), this is the single most effective cost control: an agent that forgets a WHERE clause on the date column gets a hard error rather than a $10,000 bill.

Google's own documentation recommends this setting for any partitioned table larger than 100GB that an agent may query. It is set at table creation or via ALTER TABLE DDL and is enforced by BigQuery's query planner, not by application code — meaning it cannot be bypassed even if the agent's guardrail logic has a bug.

Real Cost Data

BigQuery charges $5 per TB scanned (on-demand pricing as of 2024). A single full-table scan on a 12TB events table costs $60 — and agents in feedback loops can submit dozens of queries per minute. Dry-run + partition filters + project-level max bytes billed together cap worst-case agent costs to a predictable ceiling regardless of what SQL the LLM generates.

Query Result Size Controls

Even a syntactically correct, cost-efficient query can overwhelm an agent's context window if it returns millions of rows. The standard pattern is to enforce a maximum result row count in the QueryBigQueryTool wrapper — returning at most 1,000 rows by default — and to have the agent summarize or aggregate before returning large result sets to the user. LangChain's built-in BigQueryTool enforces a 1,000-row default and surfaces the total row count so the agent can request aggregated queries if needed.

Key Concepts

Dry-Run ModeBigQuery API flag that estimates bytes scanned without executing — zero cost, mandatory for agent-generated SQL validation.

requirePartitionFilterTable-level setting that forces all queries to include a partition column filter — enforced by BigQuery's query planner, not application code.

Max Bytes BilledProject or query-level quota that hard-kills any job exceeding the byte scan limit before charges accrue.

Lesson 3 Quiz

Cost Control and Query Safety — 4 questions

What does BigQuery's dry-run mode return, and what does it cost?

Correct. Dry-run returns an estimate of bytes processed (total_bytes_processed field) at zero cost — the query is parsed and planned but never executed.

Incorrect. Dry-run mode returns a bytes-scanned estimate at zero cost. No data is read and no charge is incurred.

Why is requirePartitionFilter superior to application-level guardrails for cost control?

Correct. requirePartitionFilter is enforced at the query planner level — no application code path can circumvent it, making it a true hard stop rather than a soft guardrail.

Incorrect. The key advantage is enforcement location: requirePartitionFilter is enforced by BigQuery itself, not application code, so agent bugs cannot bypass it.

At BigQuery's 2024 on-demand pricing, approximately how much does a full scan of a 12TB table cost?

Correct. At $5/TB, a 12TB full scan costs $60. In an agent feedback loop submitting dozens of queries per minute, this can compound rapidly into large bills.

Incorrect. At $5 per TB on-demand: 12TB × $5 = $60 per full table scan. An agent in a loop can compound this very quickly.

What is the standard maximum row count enforced by LangChain's built-in BigQueryTool to protect the agent's context window?

Correct. LangChain's BigQueryTool enforces a 1,000-row default limit and surfaces the total row count, prompting the agent to request aggregated queries when needed.

Incorrect. The standard default is 1,000 rows — enough for meaningful samples while protecting the context window from being overwhelmed.

Lab 3: Cost Control & Query Guardrails

Design dry-run validation, partition filter enforcement, and cost budgets for a production BigQuery agent

Lab Objective

Your agent will query a 15TB event tracking table partitioned by day. Design the complete guardrail stack: dry-run threshold, partition filter requirement, max bytes billed, and result row cap. Discuss trade-offs with your assistant.

Starter prompt: "I have a 15TB events table partitioned by event_date. My agent needs to answer daily trend questions. Design a guardrail stack that keeps worst-case daily query costs under $50 while still allowing meaningful analysis. Walk me through each layer."

Cost Control Lab

AI Assistant

Welcome to Lab 3. I'll help you design a production-grade cost and safety guardrail stack for BigQuery agents. We'll work through dry-run thresholds, partition filter enforcement, project quotas, and result size controls. What's your starting scenario?

Module 2 · Lesson 4

Streaming Results and Grounding Responses

Streaming BigQuery results into agent context, grounding LLM answers in actual data, and surfacing provenance so users can trust the output.

How does an agent turn a BigQuery result set into a trustworthy, verifiable answer — and how do you make sure the LLM doesn't embellish what it finds?

Wayfair, in a 2023 Google Cloud blog post co-authored with the Vertex AI team, described deploying a natural-language analytics agent over their BigQuery product catalog and sales data. The team's critical insight was that users stopped trusting the agent when it paraphrased query results — they wanted to see the exact rows that drove the answer. The solution was a grounding layer: every agent response that referenced data included a formatted table of the supporting rows, with the BigQuery job ID appended so analysts could re-run and audit the query directly in the console.

Streaming BigQuery Results into Context

For small result sets (under ~200 rows), the standard pattern is to convert BigQuery rows directly to JSON or a Markdown table and inject them into the LLM's context as a tool response. The LLM then synthesizes its answer from this grounded data rather than from its parametric knowledge — a technique called Retrieval-Augmented Generation (RAG) over structured data.

For larger result sets, the agent uses BigQuery's Storage Read API to stream results in Arrow format, applying aggregations or sampling before injecting into context. The BigQuery Storage Read API (launched 2019, now generally available) delivers data at 10x the throughput of the REST export API — enabling real-time agent workflows over large tables without staging to Cloud Storage first.

# Convert BigQuery result rows to grounded context
import json

def format_bq_result_for_context(rows, max_rows=50):
    """Format BigQuery rows as structured context for LLM grounding."""
    row_list = [dict(row) for row in rows][:max_rows]
    total = len(row_list)
    context = {
        "source": "bigquery",
        "row_count": total,
        "data": row_list,
        "note": f"Showing {total} rows from BigQuery result."
    }
    return json.dumps(context, default=str, indent=2)

# Inject into LLM system prompt as grounded evidence
# Agent instruction: "Answer ONLY from the data in the context.
#                    Do not extrapolate or add information."

The Grounding Instruction Pattern

Without explicit grounding instructions, LLMs frequently extrapolate beyond the returned data — inferring trends from small samples, adding context from training data, or smoothing over gaps. The documented production pattern is to add a strict grounding instruction to the synthesis step:

Grounding System Prompt Template

Answer ONLY using the data provided in the [DATA] section below. If the data does not contain enough information to answer the question, say so explicitly. Do not use your training knowledge to fill gaps. Cite the number of rows your answer is based on. If asked for a trend, only assert a trend if at least three data points support it.

Provenance: BigQuery Job IDs as Citations

Every BigQuery query job returns a unique job ID (format: project:region.job_XXXXXXXX). Including this ID in the agent's response gives analysts a direct audit trail — they can paste the job ID into the BigQuery console and see the exact SQL, execution time, bytes processed, and result set. This is the structured-data equivalent of citation links in a RAG document answer.

The Wayfair implementation referenced above exposed job IDs in a collapsible "Data Sources" section in their internal analytics chat UI, allowing data engineers to immediately audit any agent-generated insight. According to their blog post, this single feature reduced analyst trust questions from approximately 40% of sessions to under 5% — users with audit access simply checked the job rather than questioning the output.

Handling No-Data and Partial-Data Cases

Agents must be explicitly instructed on two failure modes. Empty result sets — where the query executes successfully but returns zero rows — should prompt the agent to acknowledge the absence of data and suggest why (wrong date range, mis-spelled filter value, data not yet loaded). Partial result sets — where the row limit was hit — should prompt the agent to disclose that the answer is based on a sample, not the full data, and offer to run an aggregated query instead.

Good Agent Response Pattern

States the data source (table name)
Cites row count used for analysis
Includes BigQuery job ID
Flags if result was sampled/truncated
Declines to extrapolate beyond data

Anti-Patterns to Avoid

Asserting trends from 2–3 data points
Adding context from training knowledge
Rounding numbers without disclosure
Presenting sampled results as complete
Ignoring empty result sets silently

Key Concepts

RAG over Structured DataInjecting BigQuery result rows into LLM context as grounded evidence, causing the LLM to answer from data rather than parametric knowledge.

Job ID ProvenanceIncluding the BigQuery job ID in agent responses, giving analysts a direct audit link to the exact SQL and data that drove the answer.

Grounding InstructionExplicit system prompt directive restricting the LLM to answer only from provided data, suppressing extrapolation from training knowledge.

Storage Read APIBigQuery's high-throughput Arrow-format data streaming API, enabling real-time agent ingestion of large result sets without Cloud Storage staging.

Lesson 4 Quiz

Streaming Results and Grounding Responses — 4 questions

According to Wayfair's 2023 Google Cloud blog post, what feature reduced analyst trust questions from ~40% of sessions to under 5%?

Correct. Per Wayfair's documented case study, exposing job IDs in a collapsible "Data Sources" section allowed analysts to audit queries directly, dramatically reducing trust challenges.

Incorrect. The documented trust-building feature was BigQuery job ID provenance — analysts with audit access could verify the exact SQL rather than questioning the output.

What data format does BigQuery's Storage Read API use for high-throughput streaming?

Correct. The BigQuery Storage Read API delivers data in Apache Arrow format, enabling 10x the throughput of the REST export API for real-time agent workflows.

Incorrect. The BigQuery Storage Read API uses Apache Arrow columnar format — this enables the high-throughput streaming needed for real-time agent result ingestion.

What should an agent do when a BigQuery query executes successfully but returns zero rows?

Correct. Empty result sets are informative — the agent should explicitly acknowledge the zero-row result and help the user diagnose why (date range, filters, pipeline latency).

Incorrect. The agent should explicitly acknowledge empty results and suggest diagnostic causes. Using training knowledge to fill gaps violates the grounding principle.

What is the core purpose of a grounding instruction in the agent's system prompt for the SQL synthesis step?

Correct. Grounding instructions explicitly constrain the LLM to the provided data — without them, models frequently extrapolate trends or add context from parametric training knowledge.

Incorrect. Grounding instructions tell the LLM: answer ONLY from the data provided. Their purpose is suppressing training-knowledge extrapolation, not SQL formatting.

Lab 4: Grounding and Provenance

Build grounding instructions, format result sets for LLM context, and design job-ID provenance for an analytics agent

Lab Objective

You are building the synthesis layer of a BigQuery analytics agent. Design the grounding instruction, result formatting function, and provenance output format. Consider how the agent should handle empty and truncated result sets.

Starter prompt: "I want my analytics agent to show users the data behind its answers and include a way to audit the query. Design the grounding instruction and the response format — including how to expose the BigQuery job ID and disclose when results were truncated."

Grounding & Provenance Lab

AI Assistant

Welcome to Lab 4. I'll help you design grounding instructions, result formatting, and provenance patterns for BigQuery agents. We'll cover how to constrain LLM synthesis to actual query data, how to surface job IDs, and how to handle edge cases like empty or truncated results. What would you like to work through?

Module 2 Test

Querying BigQuery from Agents — 15 questions · Pass at 80% (12/15)

1. Which BigQuery tool in the standard LangChain toolkit allows an agent to discover available tables before writing SQL?

Correct. ListBigQueryTablesTool is the first tool in the discover-then-query pattern.

Incorrect. The correct tool is ListBigQueryTablesTool — it lists available tables before the agent commits to writing SQL.

2. Application Default Credentials (ADC) resolve credentials from which source when an agent runs on Cloud Run?

Correct. ADC automatically uses the service account attached to the Cloud Run service — no credential files or environment variables needed.

Incorrect. ADC on Cloud Run uses the service account attached to the service at deploy time — automatically, without credential files.

3. In Google's Duet AI for BigQuery architecture, does the LLM directly access production data?

Correct. The LLM never touches production data directly — all access flows through the tool interface, providing type safety and an audit trail.

Incorrect. The Duet AI architecture explicitly prevents direct LLM data access — all interactions go through a typed tool interface.

4. What does schema injection primarily prevent in agent-generated SQL?

Correct. When the LLM lacks schema context, it invents plausible-sounding column names that don't exist — schema injection eliminates this failure mode.

Incorrect. Schema injection's primary purpose is preventing hallucinated column names — the LLM needs real schema context to generate valid SQL.

5. What is the correct BigQuery Standard SQL syntax for date truncation to month?

Correct. BigQuery's DATE_TRUNC takes the date expression first, then the date part — opposite of some other SQL dialects.

Incorrect. BigQuery's syntax is DATE_TRUNC(date_expression, date_part) — note the argument order differs from MySQL/PostgreSQL.

6. In query decomposition, what happens between executing sub-queries and synthesizing the final answer?

Correct. Re-injecting intermediate results into context is the key step — it gives the LLM real data to reason from before constructing the final synthesis.

Incorrect. The critical step is re-injection: intermediate row counts and sample data go back into the LLM context before final synthesis.

7. Which IAM role allows a service account to submit query jobs in a BigQuery project?

Correct. roles/bigquery.jobUser grants the ability to run query jobs. It is required alongside roles/bigquery.dataViewer for a read-only agent setup.

Incorrect. roles/bigquery.jobUser specifically grants the ability to submit and run query jobs in a project.

8. What happens when requirePartitionFilter is set on a table and a query omits the partition column filter?

Correct. BigQuery's query planner rejects the query immediately — no data is scanned, no cost is incurred, and no application-level guardrail needs to be correct.

Incorrect. requirePartitionFilter causes BigQuery to reject the query at parse time — before execution, before scanning, before any cost.

9. At $5/TB on-demand pricing, what is the cost of ten full scans of a 5TB table?

Correct. 5TB × $5/TB = $25 per scan × 10 scans = $250. An agent in a feedback loop without byte guards can accumulate this in minutes.

Incorrect. 5TB × $5 = $25 per scan × 10 = $250 total. This illustrates why byte-limit guards are critical for agents.

10. What is the purpose of including the BigQuery job ID in an agent's response?

Correct. Job IDs are the provenance link — analysts can audit the exact SQL, timing, cost, and data that produced the agent's answer.

Incorrect. Job IDs serve as provenance citations — paste one into the BigQuery console to see the exact query, execution details, and results.

11. The BigQuery Storage Read API delivers data approximately how much faster than the REST export API?

Correct. Google's documentation states the Storage Read API delivers ~10x the throughput of the REST export API, enabling real-time agent workflows.

Incorrect. Google's Storage Read API documentation cites approximately 10x throughput improvement over the REST export API.

12. When a BigQuery query returns a truncated result set (row limit hit), what should a well-designed agent do?

Correct. Transparency about sampling is a core grounding principle — the agent should disclose the truncation and offer an aggregated alternative.

Incorrect. Presenting truncated results as complete violates grounding principles. The agent must disclose the truncation and suggest aggregated alternatives.

13. Which SQL generation issue is addressed by including BigQuery-specific syntax rules in the system prompt?

Correct. Models trained on mixed SQL datasets carry MySQL/PostgreSQL syntax patterns that break in BigQuery — syntax grounding overrides these with BigQuery-specific rules.

Incorrect. The issue is cross-dialect contamination — models learn MySQL/Postgres syntax patterns that are invalid in BigQuery. Syntax grounding corrects this.

14. In the Wayfair analytics agent case study, what was the primary reason analysts stopped trusting the agent before the fix?

Correct. Users wanted to see the exact rows driving the answer — paraphrased responses without data provenance eroded trust regardless of accuracy.

Incorrect. The documented issue was paraphrased results without raw data visibility. Users couldn't verify whether the summary accurately represented the underlying data.

15. What is the recommended minimum number of data points required before an agent asserts a trend, per the grounding instruction template in Lesson 4?

Correct. The grounding template specifies: "only assert a trend if at least three data points support it" — preventing trend claims from two-point extrapolation.

Incorrect. The grounding instruction template specifies at least three data points before the agent may assert a trend — two points define a line, not a trend.