Module 5 · Lesson 1

The Multi-Cloud Reality

Why enterprises operate across AWS, Azure, and Google Cloud simultaneously — and what that means for agentic data pipelines.

How do organizations actually structure data flows when their stack spans three different cloud providers?

In 2022, Spotify's engineering blog documented a concrete tension: the company's ML training infrastructure ran on Google Cloud, its recommendation serving layer sat in AWS, and a legacy licensing data system lived in Azure. Every model refresh required orchestrating three separate authentication contexts, three billing boundaries, and three sets of networking constraints — all before a single prediction reached a listener.

This was not unusual. By 2024, Flexera's State of the Cloud report found that 87% of enterprises used multiple public clouds simultaneously. The question was never whether multi-cloud would happen — it was how to build agentic workflows that could treat cross-cloud data sources as first-class inputs rather than awkward exceptions.

Why Multi-Cloud Happens (Not Why You'd Choose It)

Multi-cloud architectures almost never result from a clean architectural decision. They accumulate. An acquisition brings an Azure-native analytics stack. A machine learning team standardizes on SageMaker before the platform team commits to Vertex AI. A regulatory requirement mandates that certain data never leave a specific geography where only one provider has compliant regions.

Understanding this history matters for agentic pipelines because the data sources your agents must reach were designed for different principals, different auth systems, and different network models. The agent can't assume any shared control plane.

Common Pattern

Acquisition-Driven

Acquired company retains its cloud. AWS S3 buckets or Azure Data Lake Gen2 stores land in the portfolio without migration budget.

Common Pattern

Best-of-Breed Services

Teams choose Redshift for warehousing, BigQuery for analytics, Azure Synapse for Power BI integration — each independently justified.

Common Pattern

Regulatory Isolation

GDPR data residency rules or FedRAMP requirements force specific workloads to specific clouds regardless of architecture preference.

Common Pattern

Vendor Risk Hedge

Procurement teams diversify deliberately to prevent single-vendor lock-in on critical infrastructure. Both AWS and GCP contracts are maintained.

What Agentic Pipelines Face

A Vertex AI agent reading from BigQuery can rely on a single IAM model, unified logging, and shared VPC networking. The moment that agent also needs to read from Amazon S3, it encounters a parallel universe: IAM roles are AWS-native, credentials are signed with SigV4, and access policies live in a completely separate console.

The agent itself doesn't care — it calls a tool, the tool returns data. But the engineering work that makes that tool safe, monitored, and auditable must bridge two fundamentally different identity and access frameworks. This lesson establishes the conceptual model; subsequent lessons address each connection pattern in depth.

Workload Identity Federation

Google Cloud mechanism allowing a GCP service account to assume identities in external systems (AWS, Azure AD) without storing long-lived credentials. Central to cross-cloud agent auth.

BigQuery Omni

Google-managed compute deployed in AWS and Azure regions that runs BigQuery SQL against data in S3 or ADLS Gen2 without moving the data to GCP. Introduced in 2021, GA since 2022.

Data Gravity

The tendency for applications and processing to migrate toward where data already lives, due to egress costs and latency. A key reason cross-cloud query (rather than copy) matters.

OIDC Token Exchange

The protocol underlying Workload Identity Federation — GCP issues a short-lived OIDC token that external cloud IAM services accept in exchange for cloud-native credentials.

The Three Architectural Approaches

When connecting an agentic workflow on Google Cloud to external cloud data sources, three architectural patterns emerge. They are not mutually exclusive — most production deployments combine all three depending on data volume, latency requirements, and cost constraints.

Federated Query (No Data Movement)

BigQuery Omni or Vertex AI Pipelines query data in AWS S3 / Azure ADLS Gen2 directly via managed compute in the source cloud region. Data never crosses cloud boundaries. Best for large, rarely-updated datasets where egress costs would be prohibitive.

Scheduled Replication (Controlled Movement)

Dataflow or Datastream pipelines copy data from S3 or Azure Event Hub into Google Cloud Storage or BigQuery on a cadence. The agent always reads from GCP-native sources. Best for low-latency queries and transformation-heavy workloads.

Real-Time Event Bridge (Streaming Movement)

Events from AWS Kinesis or Azure Event Hubs are forwarded in near-real-time to Pub/Sub, then processed by Dataflow and made available to agents. Best for operational data where freshness within seconds matters.

Real-World Reference

When Twitter (now X) migrated parts of its infrastructure in 2022, engineering teams publicly documented the challenge of maintaining data pipelines that spanned both AWS and GCP simultaneously during the transition. Kafka clusters on AWS fed Dataflow jobs on GCP — the event bridge pattern at production scale.

Identity Is the Hard Problem

Network connectivity between clouds is solved: VPN, Interconnect, or simply HTTPS over the public internet. The genuinely hard problem is identity. How does a Vertex AI agent, running as a GCP service account, prove to AWS IAM or Azure AD that it has permission to read a specific S3 bucket or Azure Blob container?

The naive answer — store AWS access keys or an Azure client secret in Secret Manager — works but creates a credential management burden and eliminates the auditability benefits of federated identity. The preferred answer is Workload Identity Federation, which we explore in depth in Lesson 2.

Key Insight

Multi-cloud data access for agents is fundamentally an identity problem dressed as a networking problem. Once you have credential federation working, the actual data retrieval is standard API calls. Most of the engineering work lives in the auth layer.

Lesson 1 Quiz

The Multi-Cloud Reality · 4 questions

According to Flexera's 2024 State of the Cloud report cited in the lesson, what percentage of enterprises used multiple public clouds simultaneously?

Correct. Flexera's report found 87% of enterprises operating across multiple public clouds — confirming that multi-cloud is the norm, not the exception.

Not quite. The figure was 87%, from Flexera's State of the Cloud 2024 report, underlining that multi-cloud is nearly universal at enterprise scale.

What is the primary reason the lesson argues that multi-cloud architectures are difficult for agentic pipelines — even when network connectivity is available?

Correct. Identity is the core challenge. Each cloud has its own IAM, credential signing, and policy model — the agent must operate across all of them without a shared control plane.

The lesson frames identity — not latency or cost — as the hard problem. Network connectivity between clouds is considered solved; credential federation is the genuine engineering challenge.

BigQuery Omni enables cross-cloud querying by:

Correct. BigQuery Omni deploys Google-managed compute inside AWS and Azure regions, running BigQuery SQL against data stored there without moving that data to GCP.

BigQuery Omni avoids copying data. It deploys managed compute inside the source cloud region, so data never crosses cloud boundaries during the query.

Which architectural pattern is best suited for a Vertex AI agent that needs financial transaction data from Azure Event Hub with freshness requirements of under 10 seconds?

Correct. When freshness within seconds is required, the real-time event bridge pattern — forwarding from Azure Event Hub to Pub/Sub — is the appropriate choice.

A 10-second freshness requirement rules out scheduled replication and federated batch queries. The real-time event bridge pattern (Event Hub → Pub/Sub → Dataflow) is designed for exactly this scenario.

Lab 1: Mapping a Multi-Cloud Architecture

Conversational lab · discuss cross-cloud patterns with an AI assistant

Scenario

Your organization runs analytics on Google Cloud (BigQuery + Vertex AI) but acquired a company whose entire data estate lives in AWS S3, plus a legacy HR system that writes to Azure SQL Database. You need to design an agentic pipeline that can pull from all three sources to generate a weekly workforce analytics report.

Discuss your architectural choices with the AI assistant. Consider: which pattern fits each source, how you'll handle identity, and what the egress cost implications are. Aim for at least 3 exchanges to complete the lab.

Architecture Advisor

Vertex AI Data Agents · M5 Lab 1

Hello! I'm your cross-cloud architecture advisor for this lab. You're designing a Vertex AI agentic pipeline that needs to read from three sources: AWS S3 (acquired company data), Azure SQL Database (HR system), and BigQuery (your existing analytics). Let's work through the architecture together. Which source would you like to tackle first, and what's your initial instinct on the right connectivity pattern for it?

Module 5 · Lesson 2

Workload Identity Federation

How Google Cloud service accounts acquire short-lived credentials in AWS and Azure without storing static secrets.

What does the token exchange actually look like — and where do the security boundaries sit?

In 2023, HashiCorp's Vault team published detailed documentation on the shift enterprises were making away from static cloud credentials toward OIDC-based federation. Their telemetry showed that organizations using federated identity for cross-cloud access reduced credential rotation incidents by roughly 70% — not because federation is inherently more secure, but because eliminating stored secrets eliminates the entire class of secret-leakage incidents that come with them.

Google Cloud's Workload Identity Federation, launched in GA in late 2021, was Google's answer to this requirement. It allows a service account running in GCP to present an OIDC token to AWS STS or Azure AD and receive cloud-native, short-lived credentials in return — no AWS access key stored in Secret Manager, no Azure client secret to rotate.

The Token Exchange Flow

Workload Identity Federation works through a three-party exchange. Understanding each step clarifies where to configure permissions and where failures surface when debugging.

GCP Issues a Service Account OIDC Token

Your Vertex AI agent (running as a GCP service account) calls the Google metadata server to obtain a short-lived OIDC ID token. This token is signed by Google and contains the service account email as the subject claim. Token lifetime: 1 hour by default.

GCP Security Token Service Exchanges for External Credentials

The GCP STS endpoint (sts.googleapis.com) accepts the OIDC token and a credential configuration file specifying the target external provider. It exchanges the GCP token for a short-lived federated token scoped to the external cloud's role.

External Cloud Validates and Issues Native Credentials

AWS STS (AssumeRoleWithWebIdentity) or Azure AD (token exchange) validates the GCP-issued token against a configured trust relationship, then issues AWS temporary credentials or an Azure access token scoped to the permitted resources.

Configuring Federation for AWS

On the AWS side, you create an IAM Identity Provider that trusts GCP's OIDC endpoint, then an IAM Role with a trust policy allowing the specific GCP service account to assume it. The critical constraint: the GCP service account email must match the condition in the AWS trust policy, otherwise AssumeRoleWithWebIdentity returns an access denied error.

# AWS IAM Trust Policy (JSON) — allows specific GCP service account
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"Federated": "arn:aws:iam::ACCOUNT:oidc-provider/accounts.google.com"},
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "accounts.google.com:sub": "SERVICE_ACCOUNT_NUMERIC_ID"
      }
    }
  }]
}

On the GCP side, you generate a credential configuration JSON using gcloud iam workload-identity-pools create and store it alongside your agent code (not in Secret Manager — it contains no secrets, only configuration). The Google Cloud client libraries automatically handle the token exchange when this file is set as the Application Default Credentials source.

# Generate credential config for AWS federation (gcloud CLI)
gcloud iam workload-identity-pools create-cred-config \
  projects/PROJECT_NUM/locations/global/workloadIdentityPools/POOL_ID/providers/aws-provider \
  --service-account=agent-sa@PROJECT.iam.gserviceaccount.com \
  --aws \
  --output-file=aws-cred-config.json

# Set as Application Default Credentials
export GOOGLE_APPLICATION_CREDENTIALS=aws-cred-config.json

Configuring Federation for Azure

Azure federation requires registering a Workload Identity Pool provider in GCP that points to Azure AD's OIDC discovery endpoint (https://login.microsoftonline.com/TENANT_ID/v2.0). The Azure-side configuration creates an app registration with a federated credential that specifies the GCP issuer, pool, and subject.

A key difference from AWS: Azure issues access tokens scoped to specific resources (e.g., Azure SQL, Azure Blob Storage), not to a generic role. Your agent must request tokens for the specific Azure resource it intends to access, which means the credential configuration is resource-type-specific rather than account-level.

Workload Identity Pool

A GCP resource that manages external identity provider relationships. Each pool can have multiple providers (one for AWS, one for Azure). Agents reference the pool when requesting federated credentials.

AssumeRoleWithWebIdentity

The AWS STS API call that exchanges an external OIDC token for temporary AWS credentials (AccessKeyId, SecretAccessKey, SessionToken). Credentials expire in 1-12 hours.

Attribute Mapping

GCP feature allowing claims from the external identity token to be mapped to Google attributes (e.g., mapping AWS ARN to google.subject). Used for fine-grained conditional access policies.

Security Boundary

The GCP service account numeric ID (not email) is the identifier used in AWS trust policies. If you grant trust to the service account email without pinning to the numeric ID, account renaming could allow a different account to assume your AWS role. Always use the numeric ID in the sub condition.

Debugging Federation Failures

The three most common federation failures in production and how to diagnose them:

1. "Token audience mismatch" — The OIDC token was issued for a different audience than what the AWS or Azure provider expects. Check that the credential configuration file's audience field matches the OIDC provider configuration in the target cloud.

2. "No matching statement" (AWS) or "AADSTS70021" (Azure) — The service account subject claim doesn't match the trust policy condition. Verify the numeric ID (not email) is in the AWS condition, or that the federated credential subject on the Azure app registration matches the pool subject pattern.

3. "Token expired" — Federation tokens have short lifetimes. Ensure your agent code uses the Google Cloud client libraries rather than manually caching credentials. The libraries handle automatic refresh.

Operational Note

Workload Identity Federation eliminates secret rotation but introduces a new audit surface: Workload Identity Pool activity logs in Cloud Audit Logs and AWS CloudTrail AssumeRole events in parallel. Monitor both — a compromised agent shows up in both logs as unusual AssumeRole frequency from an unexpected source IP.

Lesson 2 Quiz

Workload Identity Federation · 4 questions

Which AWS STS API call does Workload Identity Federation use to exchange a GCP OIDC token for temporary AWS credentials?

Correct. AssumeRoleWithWebIdentity is the STS call designed specifically for exchanging external OIDC tokens (including GCP-issued ones) for temporary AWS credentials.

The correct call is AssumeRoleWithWebIdentity — the STS endpoint specifically designed for OIDC token exchange, which is distinct from AssumeRole (which requires existing AWS credentials).

In the AWS IAM trust policy for Workload Identity Federation, which claim should be used in the StringEquals condition to prevent impersonation via service account renaming?

Correct. The sub claim contains the numeric service account ID, which is immutable. Using the email claim is less secure because account naming could change.

The sub claim with the numeric ID is the correct choice. Email-based conditions are riskier because service account emails could theoretically be reused if an account is deleted and recreated.

What is stored in the credential configuration JSON file generated by the gcloud workload-identity-pools create-cred-config command?

Correct. The credential configuration file contains no secrets — only the pool ID, provider configuration, and target role ARN. This is what makes it safe to store in source control.

The credential configuration file contains no secrets — it is pure configuration (pool IDs, provider endpoints, role references). Secrets are never written to disk; they are obtained dynamically via token exchange at runtime.

A key difference between configuring Workload Identity Federation for Azure versus AWS is that Azure requires:

Correct. Azure issues access tokens scoped to specific resources (Azure SQL, Blob Storage, etc.), so the credential configuration is resource-type-specific — unlike AWS which issues account-level role credentials.

The key Azure distinction is resource-scoped tokens. Unlike AWS (which issues credentials for a broad IAM role), Azure access tokens are issued for a specific resource, so your agent must know which Azure resource it's targeting before requesting credentials.

Lab 2: Debugging a Federation Configuration

Conversational lab · troubleshoot Workload Identity Federation errors

Scenario

Your Vertex AI agent is returning an error when attempting to access an S3 bucket via Workload Identity Federation. The error message is: "An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRole". Your credential configuration file looks syntactically correct. Walk through the debugging process with the assistant.

Ask the assistant to help diagnose the error. Consider: trust policy conditions, subject claims, audience matching, and IAM Role permissions. Complete at least 3 exchanges.

Federation Debugger

Vertex AI Data Agents · M5 Lab 2

I see you're getting an AccessDenied on AssumeRoleWithWebIdentity — a classic federation debugging scenario. This error means AWS STS received the OIDC token but rejected it at the authorization stage, not the token validation stage. That narrows things down significantly. Can you share: (1) whether you used the service account email or numeric ID in the AWS trust policy condition, and (2) which field you used — accounts.google.com:email or accounts.google.com:sub?

Module 5 · Lesson 3

BigQuery Omni and Federated Queries

Querying data in AWS S3 and Azure ADLS Gen2 directly from BigQuery — without egress costs or data movement.

When does it make more sense to query data where it lives rather than move it to GCP?

When Wayfair's data engineering team published their 2023 multi-cloud analytics architecture, they highlighted a specific problem: raw clickstream data accumulated in AWS S3 at petabyte scale. Moving it to BigQuery for analysis would cost over $80,000 monthly in AWS egress fees alone. BigQuery Omni — which they adopted in late 2022 — eliminated that cost by running BigQuery compute directly in the AWS us-east-1 region where the data lived.

The tradeoff they documented: Omni queries ran 30-40% slower than equivalent queries on data stored natively in BigQuery, and cross-cloud JOIN operations (joining S3 data with BigQuery-native tables) required careful query planning because data had to be physically moved at join time. But for 80 million dollars per year, slower was acceptable.

How BigQuery Omni Works

BigQuery Omni is Google's managed compute deployed in specific AWS and Azure regions. When you create an Omni dataset, you choose a region that maps to an AWS or Azure region (e.g., aws-us-east-1 maps to AWS us-east-1). BigQuery tables in that dataset are backed by data in S3 or ADLS Gen2 — specifically, by external tables or BigLake tables pointing to files in those storage services.

When you run a query against an Omni dataset, BigQuery routes the query to the Omni compute cluster in the corresponding cloud region. That compute reads from S3 or ADLS Gen2 over local networking (no cross-cloud egress), executes the query, and returns results to your BigQuery project. The result data does cross cloud boundaries — but typically query results are much smaller than the underlying data.

Supported AWS Regions (2024)

AWS Locations

aws-us-east-1 (N. Virginia), aws-us-west-2 (Oregon), aws-ap-southeast-1 (Singapore), aws-eu-west-1 (Ireland). Additional regions added periodically.

Supported Azure Regions (2024)

Azure Locations

azure-eastus2 (Virginia), azure-northeurope (Ireland). Azure Omni support was GA'd in 2023, with fewer regions than AWS.

Supported Formats

File Formats

Parquet, ORC, Avro, CSV, JSON, and Delta Lake (via BigLake). Parquet with column pruning gives the best query performance.

Pricing Model

Cost Structure

Omni queries are billed at BigQuery on-demand rates (per TB scanned) plus a surcharge for the Omni compute. No AWS or Azure egress charges for the scan — only for result transfer.

Creating an Omni External Table

The first step is creating a BigQuery connection to S3 or ADLS Gen2. This connection uses a dedicated Google-managed service account that must be granted read access to the source storage.

-- Create BigLake table over S3 data (BigQuery SQL)
CREATE EXTERNAL TABLE `my-project.aws_dataset.clickstream`
WITH CONNECTION `aws-us-east-1.my-s3-connection`
OPTIONS (
  format = 'PARQUET',
  uris = ['s3://my-bucket/clickstream/year=2024/*'],
  hive_partition_uri_prefix = 's3://my-bucket/clickstream'
);

# Grant the connection service account access to S3 (AWS CLI)
aws iam create-role \
  --role-name bq-omni-readonly \
  --assume-role-policy-document file://omni-trust-policy.json

aws iam attach-role-policy \
  --role-name bq-omni-readonly \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

# Bind GCP connection service account to the role (in trust policy)

Using Omni Data in Vertex AI Agents

From the agent's perspective, an Omni external table is indistinguishable from any other BigQuery table. The agent's BigQuery tool calls standard BigQuery APIs — it doesn't need to know that the underlying data lives in S3. This is the key architectural benefit: cross-cloud data complexity is absorbed by the BigQuery layer, not the agent layer.

In practice, agents should be aware of Omni query latency. A query that runs in 2 seconds on BigQuery-native data might take 6-8 seconds on Omni, because the compute cluster in AWS has a cold-start overhead and the query is routed cross-cloud. For synchronous agent flows, this matters; for batch or background analysis, it typically doesn't.

Cross-Cloud JOIN Consideration

When you JOIN an Omni table (data in S3) with a BigQuery-native table (data in GCP), BigQuery must physically broadcast one side of the join across clouds. For large tables, this can incur significant data transfer costs and latency. Structure your queries to filter Omni data heavily before joining — push predicates into the Omni scan, not the join.

BigLake vs. Standard External Tables

For Omni workloads, Google recommends BigLake tables over standard external tables. BigLake adds row- and column-level security policies that are enforced even when the data is accessed via third-party tools (Spark, Presto) outside of BigQuery. For agentic pipelines where the agent's output might be consumed by multiple downstream tools, BigLake's centralized access control is significantly easier to manage than per-tool S3 bucket policies.

BigLake

Google Cloud's managed lake service that adds IAM-based access control, row/column security, and a metadata layer on top of files in GCS, S3, or ADLS Gen2. Supersedes standard external tables for most production use cases.

Omni Connection

A BigQuery resource that holds the configuration for accessing external cloud storage, including the Google-managed service account that needs permissions in the external cloud.

Hive Partitioning

Directory structure convention (key=value/...) that BigQuery Omni can use to prune which partitions are scanned, dramatically reducing scan costs for time-partitioned datasets in S3.

When NOT to Use Omni

Omni is not appropriate when: (1) your S3 data is in a region without Omni coverage, (2) query latency under 3 seconds is required, (3) your queries require complex cross-cloud JOINs on large tables, or (4) data must be transformed before analysis (use Dataflow replication instead). Use Omni when data gravity and egress costs favor keeping data in the source cloud.

Lesson 3 Quiz

BigQuery Omni and Federated Queries · 4 questions

What is the primary reason Wayfair adopted BigQuery Omni for their S3 clickstream data according to the lesson's industry reference?

Correct. The egress cost of moving petabyte-scale S3 data to BigQuery was the decisive factor — Omni eliminated that cost by querying data in place.

The lesson explicitly states the motive was egress cost — over $80,000/month to move the S3 data. Omni queries actually ran 30-40% slower than native BigQuery, but cost savings justified the tradeoff.

When a Vertex AI agent queries an Omni external table, which layer handles the cross-cloud complexity — the agent, the BigQuery API, or the Omni compute cluster?

Correct. The cross-cloud complexity is fully absorbed by the BigQuery/Omni layer. The agent calls standard BigQuery APIs and is unaware that data lives in S3.

From the agent's perspective, Omni tables are indistinguishable from native BigQuery tables. Standard BigQuery APIs work without modification — the routing to Omni compute is handled transparently by the BigQuery service.

What is the recommended file format for Omni queries to achieve the best query performance through column pruning?

Correct. Parquet's columnar format enables column pruning — Omni only reads the columns referenced in the query, dramatically reducing scan volume and cost.

Parquet is the recommended format. Its columnar structure allows Omni to read only the columns needed by the query (column pruning), which is critical for cost efficiency when scanning large datasets in S3.

When performing a JOIN between an Omni table (data in S3) and a BigQuery-native table, what query optimization technique is most important?

Correct. Because cross-cloud JOINs require data movement at join time, pushing predicates into the Omni scan reduces the data volume that must cross cloud boundaries before joining.

The lesson specifically recommends pushing predicates into the Omni scan — filtering aggressively before the join to minimize the data that must be broadcast across cloud boundaries at join execution time.

Lab 3: Designing an Omni Query Strategy

Conversational lab · optimize cross-cloud queries for cost and performance

Scenario

You have 5TB of daily Parquet files in S3 (us-east-1) representing e-commerce order events. Your Vertex AI agent needs to produce a daily summary joining this S3 data with a 10M-row customer dimension table in BigQuery. Current query runtime is 45 seconds and costs $2.50 per execution. Your target is under 15 seconds and under $0.75.

Work with the assistant to redesign the query strategy. Discuss: partition pruning with Hive directories, column selection, pre-filtering before JOINs, and whether any data should be replicated to BigQuery instead. Complete at least 3 exchanges.

Query Optimizer

Vertex AI Data Agents · M5 Lab 3

Good optimization challenge. At 5TB daily with a 45-second runtime and $2.50 cost, you're scanning too much data. Let's diagnose the bottlenecks before proposing solutions. First: is the S3 data organized with Hive-style date partitioning (e.g., date=2024-01-15/), and is your current query filtering on a date range? Also, how many columns does the order events Parquet file contain — and how many does your summary actually need?

Module 5 · Lesson 4

Streaming Data Across Clouds

Bridging AWS Kinesis and Azure Event Hubs to Google Cloud Pub/Sub for real-time agentic pipelines.

How do you maintain sub-second event delivery when your producers are in one cloud and your agents are in another?

When Lyft announced its data infrastructure architecture in 2023, engineers described maintaining Kafka clusters in AWS while analytics and ML pipelines ran on Google Cloud. The bridge between them — a set of Kafka MirrorMaker 2 connectors replicating topics from AWS MSK to Confluent Cloud, then Pub/Sub subscribers pulling from Confluent — added roughly 800ms of median latency compared to a single-cloud Kafka setup.

That latency was acceptable for their fraud scoring pipeline, which operated on 5-second windows. But it was unacceptable for driver location updates, which required sub-200ms propagation. The lesson: cross-cloud streaming latency is bounded by internet RTT plus serialization overhead — typically 400-900ms — and architectural decisions must account for whether use cases can tolerate this floor.

Cross-Cloud Streaming: The Latency Floor

All cross-cloud streaming approaches share a fundamental latency constraint: data must physically traverse the internet between cloud providers. The minimum round-trip time between major cloud regions in the same geography is approximately 10-30ms for the network hop alone. When you add serialization, protocol overhead, and buffering, the practical minimum end-to-end latency for cross-cloud streaming is 200-500ms under good conditions and 800-1500ms when including buffering for reliability.

For most agentic workflows — fraud analysis on 10-second windows, demand forecasting, log anomaly detection — this latency is acceptable. For HFT or real-time gaming, it is not. Know your latency budget before choosing cross-cloud streaming.

Option 1: Kinesis → Pub/Sub Bridge (AWS to GCP)

The most direct path from AWS Kinesis to Google Cloud Pub/Sub uses the Dataflow Kinesis to Pub/Sub template, which Google provides as a managed Dataflow job. The Dataflow job runs in GCP, polls Kinesis shards using the Kinesis Consumer Library (KCL), and publishes each record as a Pub/Sub message.

Authentication for the Dataflow job to read from Kinesis uses Workload Identity Federation (covered in Lesson 2). The Dataflow worker service account federates into an AWS IAM role with Kinesis:GetRecords and Kinesis:GetShardIterator permissions.

# Launch Dataflow Kinesis → Pub/Sub bridge job (gcloud CLI)
gcloud dataflow jobs run kinesis-bridge \
  --gcs-location gs://dataflow-templates/latest/Kinesis_to_PubSub \
  --region us-central1 \
  --parameters \
    kinesisDataStream=my-kinesis-stream,\
    awsRegion=us-east-1,\
    pubSubTopic=projects/my-project/topics/kinesis-events,\
    awsRoleArn=arn:aws:iam::ACCOUNT:role/dataflow-kinesis-reader,\
    awsCredentialsProvider=WORKLOAD_IDENTITY_FEDERATION,\
    credentialConfigFile=gs://my-bucket/aws-cred-config.json

The Dataflow job checkpoints shard progress in Kinesis (using the sequence number) and in Cloud Spanner or GCS, providing at-least-once delivery semantics. Messages may be duplicated across the bridge — downstream Dataflow jobs reading from Pub/Sub should implement deduplication logic using the Kinesis sequence number preserved in the Pub/Sub message attributes.

Option 2: Azure Event Hubs → Pub/Sub Bridge (Azure to GCP)

Azure Event Hubs exposes an AMQP 1.0 interface that is compatible with Apache Kafka's wire protocol. This means a Kafka consumer running in GCP can read from Event Hubs using the standard Kafka Java client, configuring Event Hubs as if it were a Kafka broker. The Dataflow KafkaIO source supports this directly.

// Dataflow pipeline: Azure Event Hubs → Pub/Sub (Java, Apache Beam)
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);

Map<String, Object> kafkaConfig = new HashMap<>();
kafkaConfig.put("bootstrap.servers",
  "my-namespace.servicebus.windows.net:9093");
kafkaConfig.put("security.protocol", "SASL_SSL");
kafkaConfig.put("sasl.mechanism", "PLAIN");
kafkaConfig.put("sasl.jaas.config", azureConnectionString);

p.apply(KafkaIO.<String, byte[]>read()
  .withBootstrapServers(...)
  .withTopic("my-event-hub")
  .withConsumerConfigUpdates(kafkaConfig)
  .withoutMetadata())
 .apply(PubsubIO.writeMessages().to(
   "projects/my-project/topics/azure-events"));

Option 3: Pub/Sub Ingestion via Push Endpoints

For simpler scenarios — particularly when the source system supports webhooks or HTTP push — you can configure AWS Lambda or Azure Functions to publish events directly to a Pub/Sub REST endpoint. The source function calls the Pub/Sub publish API using a service account token obtained via Workload Identity Federation.

This approach adds no persistent infrastructure (no Dataflow job to manage) but limits throughput. The Pub/Sub HTTP API handles millions of messages per second globally, so the bottleneck is typically the Lambda/Function concurrency limit at the source. For event rates above ~10,000/second from a single source, prefer the Dataflow bridge approach.

Bridge Type

Dataflow Kinesis Template

Best for: high-volume Kinesis streams, production workloads. Managed, auto-scaling, at-least-once delivery. Requires WIF setup.

Bridge Type

Dataflow KafkaIO (Event Hubs)

Best for: Azure Event Hubs via Kafka protocol compat. Reuses existing Beam expertise. Requires SASL/SSL config with Event Hubs connection string.

Bridge Type

Lambda/Function → Pub/Sub REST

Best for: low-volume events, webhook-compatible sources. No persistent infrastructure. Limited to ~10K events/sec per function.

Bridge Type

Confluent Cloud (Managed Kafka)

Best for: organizations already on Confluent. Confluent's managed connectors bridge Kinesis/Event Hubs to Kafka, then Pub/Sub Source Connector pulls from Kafka.

Agents Consuming Streamed Cross-Cloud Events

Once events reach Pub/Sub, the agent pipeline is identical to a native GCP streaming scenario. A Vertex AI agent tool can subscribe to a Pub/Sub subscription, read messages, parse the payload, and take action — completely unaware that the events originated in Kinesis or Event Hubs.

The design consideration for agents in this context is windowing. Because cross-cloud bridges introduce variable latency, events may arrive out of order relative to their source timestamps. Agents performing time-windowed analysis should use Dataflow's event-time windowing (which handles late arrivals via watermarks) rather than processing-time windows. This is particularly important for fraud detection and anomaly detection use cases where event ordering matters.

KCL (Kinesis Client Library)

AWS-provided Java library for consuming Kinesis streams. Used by the Dataflow Kinesis template to handle shard enumeration, checkpointing, and consumer group coordination.

AMQP 1.0 / Kafka Protocol Compatibility

Azure Event Hubs exposes a Kafka-compatible endpoint, allowing standard Kafka consumers (including Dataflow KafkaIO) to read from Event Hubs without Azure-specific client libraries.

Event-Time Windowing

Apache Beam/Dataflow windowing mode that uses the timestamp embedded in the event itself (not arrival time) to assign events to windows. Critical for cross-cloud streaming where arrival order is non-deterministic.

Cost Warning

Pub/Sub message ingestion is priced per TiB of message payload. For high-volume cross-cloud streams, compress messages before publishing. Avro with snappy compression typically reduces Pub/Sub costs by 60-75% compared to JSON, at the cost of requiring schema management via Schema Registry or Pub/Sub Schemas.

Architecture Summary

Cross-cloud streaming is a solved problem — the bridge patterns are well-documented and managed templates exist. The non-trivial design decisions are: (1) latency budget analysis against your use case, (2) deduplication strategy for at-least-once delivery, (3) out-of-order event handling via event-time windowing, and (4) compression strategy to control Pub/Sub ingestion costs at volume.

Lesson 4 Quiz

Streaming Data Across Clouds · 4 questions

What was the measured cross-cloud streaming latency overhead in Lyft's 2023 architecture (AWS MSK → Confluent → Pub/Sub), as documented in the lesson?

Correct. Lyft's architecture added ~800ms of median latency — acceptable for 5-second windowed fraud scoring, but unacceptable for sub-200ms driver location updates.

The lesson cites ~800ms median latency for Lyft's cross-cloud Kafka bridge. This was within their fraud pipeline's budget (5-second windows) but not for driver location updates requiring under 200ms.

Why can Dataflow KafkaIO read directly from Azure Event Hubs without Azure-specific client libraries?

Correct. Azure Event Hubs exposes a Kafka protocol-compatible endpoint, so any standard Kafka consumer — including Dataflow KafkaIO — can read from it using standard SASL/SSL configuration.

The key is protocol compatibility. Azure Event Hubs exposes a Kafka-compatible wire protocol, allowing standard Kafka clients (including Dataflow KafkaIO) to consume from it without Azure-specific SDKs.

At what approximate event rate does the lesson recommend switching from the Lambda/Function → Pub/Sub REST bridge to the Dataflow Kinesis template for higher throughput?

Correct. The lesson recommends the Dataflow bridge approach for event rates above ~10,000 events/second from a single source, where Lambda/Function concurrency limits become the bottleneck.

The lesson specifies that above ~10,000 events/second from a single source, Lambda/Function concurrency limits become the bottleneck and the Dataflow bridge pattern is preferred.

Why should agents performing time-windowed analysis on cross-cloud event streams use event-time windowing rather than processing-time windowing?

Correct. Variable cross-cloud bridge latency means events arrive out of order by arrival time. Event-time windowing uses the embedded source timestamp and handles late arrivals via watermarks, preserving correct windowing semantics.

The issue is out-of-order arrival. Cross-cloud latency is variable, so an event from 10:00:00 may arrive after an event from 10:00:05. Event-time windowing uses the source timestamp (not arrival time) and handles late events via watermarks, making it correct for this scenario.

Lab 4: Designing a Cross-Cloud Stream Pipeline

Conversational lab · architect a Kinesis-to-Pub/Sub fraud detection pipeline

Scenario

A payment processor runs their transaction event stream on AWS Kinesis (us-east-1, 50,000 events/sec peak). A Vertex AI agent on Google Cloud performs fraud scoring using a custom model. The agent needs transaction events within 2 seconds of occurrence at the source. Design the complete pipeline from Kinesis to the Vertex AI agent, including bridge architecture, deduplication, windowing, and failure handling.

Design the pipeline with the assistant. Address: which bridge pattern handles 50K events/sec, how you handle at-least-once delivery deduplication, what window size works for fraud scoring given 800ms bridge latency, and how you handle Dataflow job failure. Complete at least 3 exchanges.

Pipeline Architect

Vertex AI Data Agents · M5 Lab 4

50,000 events/second with a 2-second end-to-end latency budget from Kinesis to your Vertex AI fraud agent — that's a tight but achievable spec. Let's design this systematically. First question: at 50K events/sec, the Lambda push approach is immediately off the table (concurrency limits). That leaves us the Dataflow Kinesis template as the primary bridge. Given your 2-second budget, and accounting for ~800ms of cross-cloud latency, what window size are you thinking for fraud scoring — and is your fraud model stateful (needs sequence of events per card) or stateless (scores each event independently)?

Module 5 Test

Cross-Cloud Data — Connecting to AWS and Azure Sources · 15 questions · 80% to pass

1. According to Flexera's 2024 State of the Cloud report, what percentage of enterprises operate across multiple public clouds?

Correct. 87% of enterprises use multiple public clouds simultaneously.

The figure is 87% from Flexera's 2024 report.

2. The lesson argues that the core difficulty of multi-cloud agentic pipelines is primarily a problem of:

Correct. Identity is the hard problem — network connectivity is solved, but bridging IAM systems requires significant engineering.

Identity — not latency or serialization — is the core challenge. Network connectivity between clouds is considered a solved problem.

3. Which architectural pattern is most appropriate for a Vertex AI agent that needs data from an S3 bucket with strict freshness requirements of under 5 seconds?

Correct. Sub-5-second freshness requires the real-time event bridge pattern.

For sub-5-second freshness, the real-time event bridge (Kinesis → Pub/Sub) is the appropriate choice.

4. In Workload Identity Federation, what does the GCP Security Token Service (sts.googleapis.com) do?

Correct. GCP STS performs the token exchange — accepting the GCP OIDC token and returning a federated token accepted by the external cloud.

GCP STS performs the token exchange: it accepts the GCP-issued OIDC token and returns a short-lived federated credential for the external cloud.

5. What AWS IAM claim should be used in the StringEquals condition of a trust policy to prevent impersonation via service account renaming?

Correct. The sub claim contains the immutable numeric ID — more secure than the email which could be reassigned.

The sub claim with the numeric ID is correct — it's immutable, unlike the email address.

6. What does the credential configuration JSON file created by gcloud iam workload-identity-pools create-cred-config contain?

Correct. The file is pure configuration with no secrets — safe to store in source control or on disk.

The credential config file contains no secrets — only configuration references. Credentials are obtained dynamically at runtime via token exchange.

7. BigQuery Omni enables cross-cloud querying by:

Correct. Omni deploys managed compute inside AWS and Azure regions, reading from S3/ADLS Gen2 over local networking without data movement.

BigQuery Omni runs managed compute inside the source cloud — no data movement, no dedicated interconnect required.

8. According to the lesson's Wayfair industry reference, what was the primary factor driving their adoption of BigQuery Omni for petabyte-scale S3 data?

Correct. Egress costs exceeding $80,000/month made Omni — despite slower queries — economically compelling.

Egress cost — over $80K/month — was the decisive factor. Omni queries actually ran 30-40% slower, but cost savings justified the tradeoff.

9. What file format provides the best query performance for BigQuery Omni against S3 data, and why?

Correct. Parquet's columnar structure enables Omni to read only the queried columns, dramatically reducing scan cost and time.

Parquet is recommended for its columnar structure, which enables column pruning — Omni reads only the columns your query actually references.

10. When should you NOT use BigQuery Omni? Select the best answer.

Correct. Omni is inappropriate for sub-3-second latency requirements or when data is in unsupported regions.

Omni is inappropriate when latency under 3 seconds is required, when the S3 region lacks Omni coverage, or when complex cross-cloud JOINs on large tables are needed.

11. In Lyft's 2023 multi-cloud architecture (AWS MSK → Confluent → Pub/Sub), what was the approximate median cross-cloud streaming latency overhead?

Correct. ~800ms median latency, acceptable for 5-second fraud windows but not for sub-200ms driver location updates.

~800ms was the documented median latency. The practical cross-cloud streaming floor is 400-900ms including buffering.

12. Why can Dataflow KafkaIO consume Azure Event Hubs messages without Azure-specific client libraries?

Correct. Azure Event Hubs exposes a Kafka-compatible wire protocol endpoint, making it consumable by any standard Kafka client.

Protocol compatibility is the key — Event Hubs exposes a Kafka-compatible endpoint, so standard Kafka clients (KafkaIO) can consume from it with SASL/SSL configuration.

13. At what event rate does the lesson recommend switching from Lambda/Function → Pub/Sub REST to the Dataflow Kinesis template bridge?

Correct. Above ~10,000 events/second, Lambda/Function concurrency limits become the bottleneck and Dataflow is preferred.

The lesson specifies ~10,000 events/second as the threshold where Lambda/Function concurrency limits become the bottleneck.

14. Why should Vertex AI agents performing time-windowed analysis on cross-cloud streams use event-time windowing rather than processing-time windowing?

Correct. Variable bridge latency means arrival order ≠ source timestamp order. Event-time windowing with watermarks handles late arrivals correctly.

Variable cross-cloud latency causes out-of-order arrival. Event-time windowing uses source timestamps and watermarks to handle late events correctly.

15. What is the recommended compression approach for high-volume cross-cloud Pub/Sub messages to minimize ingestion costs, and what typical cost reduction does it achieve?

Correct. Avro with snappy compression typically reduces Pub/Sub ingestion costs by 60-75% compared to JSON at the cost of schema management.

The lesson recommends Avro with snappy compression for Pub/Sub messages, achieving 60-75% cost reduction versus JSON — at the cost of requiring schema management via Schema Registry or Pub/Sub Schemas.