Module 4 · Lesson 1

Foundation Models & API Access

How the frontier labs distribute intelligence — and what it costs to tap in.

Which model should your business actually call — and why does the answer change every six months?

When OpenAI released the GPT-4 API on March 14, 2023, Stripe's engineering team integrated it into their fraud-detection pipeline within seventy-two hours. They didn't build a model — they made an API call. The distinction matters enormously. Every AI-first business in this course starts the same way: with a key, an endpoint, and a pricing page.

What a Foundation Model Is

A foundation model is a large neural network trained on massive, general-purpose data that can be adapted to many downstream tasks. Unlike the narrow classifiers businesses built in the 2010s, foundation models transfer knowledge across domains: the same model that summarizes legal contracts can also write marketing copy or debug Python.

The frontier is currently dominated by four providers: OpenAI (GPT-4o, o1, o3), Anthropic (Claude 3.5 Sonnet, Claude 3 Opus), Google DeepMind (Gemini 1.5 Pro, Gemini 2.0 Flash), and Meta AI (Llama 3.1, open weights). Each exposes capabilities through an API measured in tokens — chunks of roughly four characters.

The API-First Access Model

API access means you send text in, pay per token, and receive generated text out. In 2024, OpenAI's GPT-4o cost $5 per million input tokens and $15 per million output tokens. A 500-word customer support reply consumes roughly 700 tokens — about half a cent. At scale, a business handling 100,000 support queries per day would spend approximately $500/day on generation alone, before infrastructure.

The key business decision is which tier to call. Every major provider now offers a speed/cost/quality spectrum:

Frontier / Flagship

Maximum Quality

Highest reasoning, best for complex analysis, legal review, multi-step planning. Most expensive.

GPT-4o, Claude 3 Opus, Gemini 1.5 Pro

Balanced / Mid

Speed + Quality

80–90% of flagship quality at 20–40% of the cost. Best for most production workloads.

GPT-4o-mini, Claude 3.5 Sonnet, Gemini Flash

Nano / Edge

Speed + Cost

Ultra-low latency, near-zero cost. Classification, routing, simple extraction tasks.

GPT-3.5-turbo, Claude 3 Haiku, Gemini Flash Lite

Open Weights

Self-Hosted

Download and run on your own compute. No per-token cost; fixed infrastructure cost. Data stays on premise.

Llama 3.1 70B, Mistral Large, Mixtral 8x7B

Real Decision: Klarna's Model Routing (2024)

Klarna's AI assistant — announced in February 2024 as handling the work of 700 customer service agents — does not call a single model. It routes requests by complexity. Simple account-status queries hit a small, fast model. Disputes requiring policy reasoning escalate to a flagship model. This cascade routing pattern cut per-query cost by an estimated 60% while maintaining quality on hard cases. By September 2024 Klarna reported the assistant had handled 2.3 million conversations, with 79% resolved without a human agent.

Key Concept — Context Window

Every API call has a context window: the maximum tokens the model can "see" at once, including both your input and its output. GPT-4o supports 128,000 tokens (~96,000 words). Gemini 1.5 Pro supports 1 million tokens. Larger windows enable full-document analysis but cost proportionally more. Match window size to task, not to maximum.

Key Terms

Token~4 characters of text; the unit of measurement for API pricing and context limits.

Context WindowTotal tokens a model can process in a single call (input + output combined).

Cascade RoutingSending simple requests to cheap/fast models and escalating complex ones to frontier models.

Open WeightsA model whose parameters are publicly released, allowing self-hosting without per-token fees.

Bottom Line

The foundation model layer is a commodity market moving toward price parity on quality. Your competitive advantage comes not from which model you pick but from how well you architect the calls around it — routing, prompting, caching, and feedback loops.

Lesson 1 Quiz

Foundation Models & API Access — 4 questions

1. In API pricing, what is a "token"?

Correct. Tokens are ~4-character chunks. A 500-word document is roughly 700 tokens. All API pricing and context limits are expressed in tokens.

Not quite. A token is approximately 4 characters of text — the fundamental unit by which APIs measure both cost and the context window size.

2. Klarna's February 2024 AI assistant announcement demonstrated which architectural pattern?

Correct. Klarna's system routes by complexity, reducing per-query cost ~60% while maintaining quality on hard cases — a textbook cascade routing architecture.

Klarna used cascade routing: simple queries go to cheaper, faster models and complex ones escalate to flagship models. This cut per-query cost by ~60%.

3. What is the primary business advantage of open-weight models like Llama 3.1?

Correct. Open-weight models eliminate per-token fees and allow you to keep sensitive data on-premise — critical for regulated industries.

The main advantages are cost (no per-token fees once deployed) and data sovereignty (no sending data to a third-party API). Quality is strong but typically trails frontier commercial models at the same parameter size.

4. According to the lesson, where does an AI-first business's competitive advantage primarily come from?

Correct. Foundation models are commoditizing. The moat is in how you architect around them — routing complexity, prompt engineering, caching repeated queries, and building feedback loops that improve over time.

Foundation model quality is converging. The competitive moat comes from how well you architect around calls: routing, prompting, caching, and feedback loops — not from picking one provider over another.

Lab 1: Model Selection Advisor

Practice choosing the right model tier and architecture for real business scenarios.

Your Task

Describe a specific business use case — the task type, expected volume, latency requirements, and data sensitivity. The AI advisor will recommend a model tier, explain the routing strategy, and estimate rough costs. Push back, ask for alternatives, or give it edge cases.

Example: "We run a legal SaaS and want to auto-summarize 500-page court filings. We process about 200 per day. Latency isn't critical — we batch overnight. Data is confidential client files."

AI Model Selection Advisor

Lab 1

Welcome to the Model Selection Lab. Describe your business use case — task type, daily volume, latency tolerance, and any data-sensitivity constraints — and I'll walk you through the right model tier, routing strategy, and rough cost estimate. What are you building?

Module 4 · Lesson 2

Orchestration, Agents & Workflow Automation

Single API calls are table stakes. The value is in what happens between them.

How do you turn a language model that answers one question at a time into a system that completes multi-step business processes?

In July 2024, Salesforce launched Einstein Copilot with agent capabilities — an AI that could autonomously query CRM records, draft follow-up emails, schedule meetings, and log activities across multiple systems without a human in each loop. The enabling architecture was not a smarter model. It was an orchestration layer that broke goals into steps, called tools, and recovered from failures.

From Single Calls to Agentic Systems

A bare API call is stateless: you send context, get a response, done. Real business workflows are stateful: approve this, then notify that person, then update the database, then check if a condition is met before proceeding. Orchestration layers add state, memory, and tool use on top of raw model calls.

The dominant frameworks in 2024 were LangChain (launched 2022, >100k GitHub stars), LlamaIndex (focused on data ingestion and retrieval), and AutoGen (Microsoft, multi-agent conversations). In parallel, every major cloud provider released managed orchestration: AWS Bedrock Agents, Google Vertex AI Agents, Azure AI Studio.

The Agent Loop Architecture

An agent follows a Reason → Act → Observe loop, often called ReAct (from the 2022 Princeton/Google paper by Yao et al.). The model reasons about the current state, selects a tool to call, observes the result, and reasons again — repeating until the goal is complete or a stop condition fires.

Step 1

Reason

Model receives goal + current state + available tools. It outputs a plan or selects the next action to take.

Step 2

Act

Orchestrator calls the selected tool: API, database query, web search, code execution, file read/write.

Step 3

Observe

Tool result is returned to the model as new context. The model updates its understanding of the state.

Step 4

Repeat or Stop

Loop continues until goal is achieved, max steps exceeded, or model signals completion.

Real Case: Harvey AI's Legal Workflow (2023–2024)

Harvey AI — backed by OpenAI and valued at over $1.5 billion by late 2024 — built a legal workflow automation system for firms including Allen & Overy. Their architecture chains multiple steps: ingest a contract, extract clauses, compare against a precedent database, flag deviations, draft a redline, and route to the relevant attorney's queue. None of this is a single prompt. It is an orchestrated pipeline of model calls, retrieval steps, and business logic. Harvey's reported productivity gain at Allen & Overy was a 50% reduction in contract review time for participating associates.

Pitfall — Infinite Agent Loops

Without hard stop conditions, agents can loop indefinitely, burning tokens and budget. Every production agent system needs maximum step limits, timeout guardrails, and human escalation paths when confidence is low. This is not optional — it's a production requirement.

Workflow Automation vs. True Agents

Not every multi-step AI task needs a self-directed agent. Workflow automation uses a predefined DAG (directed acyclic graph) of steps — deterministic, auditable, easy to debug. Agents choose their own steps dynamically — flexible but harder to audit and predict. The right choice depends on how variable and unpredictable the inputs are.

Dimension	Workflow Automation	Autonomous Agent
Step sequence	Fixed, predefined	Dynamic, model-chosen
Auditability	Easy to trace	Harder to explain
Flexibility	Rigid to new inputs	Adapts to novel situations
Cost predictability	Predictable token use	Variable (can over-loop)
Best for	Structured, repetitive tasks	Open-ended research, planning

ReActReason-Act-Observe loop; the architectural pattern underlying most LLM agents.

Orchestration LayerSoftware (e.g., LangChain) that manages state, tool calls, and memory across multiple model interactions.

DAGDirected Acyclic Graph; a workflow where steps flow in one direction with no cycles — the structure for deterministic pipelines.

Lesson 2 Quiz

Orchestration, Agents & Workflow Automation — 4 questions

1. What does the ReAct pattern stand for, and where was it introduced?

Correct. ReAct (Reason-Act-Observe) was introduced in a 2022 paper by Yao et al. from Princeton and Google, and it became the foundational pattern for LLM agents.

ReAct stands for Reason-Act-Observe, introduced in a 2022 paper by Yao et al. (Princeton/Google). It's the loop underlying most production agent architectures.

2. Harvey AI's legal workflow at Allen & Overy demonstrated what key architectural principle?

Correct. Harvey chains: ingest → extract → retrieve precedents → compare → draft redline → route. Each step is a model call or retrieval operation, orchestrated together — delivering a reported 50% reduction in review time.

Harvey's value came from orchestrating multiple steps: ingestion, clause extraction, precedent retrieval, deviation flagging, and routing. No single prompt achieves that. The orchestration layer is where the value lives.

3. When is a deterministic workflow (DAG) preferable to an autonomous agent?

Correct. DAG workflows excel at structured, repetitive tasks where audit trails matter. They're predictable in cost and behavior. Agents shine when inputs vary wildly and dynamic step selection is needed.

Deterministic DAG workflows are best for structured, repetitive tasks where you need auditability and cost predictability. Agents are better when inputs are open-ended and dynamic adaptation is required.

4. What is a critical production requirement for autonomous agent systems that is described as "not optional"?

Correct. Without hard stop conditions and human escalation paths, agents can loop indefinitely burning budget. These guardrails are a non-negotiable production requirement.

Production agents require maximum step limits, timeout guardrails, and human escalation when confidence is low. Without these, agents can loop indefinitely — burning tokens and causing downstream errors.

Lab 2: Agent Design Workshop

Design a multi-step agent workflow for a real business process.

Your Task

Describe a multi-step business process you want to automate. The advisor will help you decide: agent vs. fixed workflow, what tools to expose, what the ReAct loop looks like, and where to put guardrails. Be specific about the steps, edge cases, and what happens when something goes wrong.

Example: "I want to automate our vendor invoice approval process — receive invoice email, extract line items, match against PO, flag discrepancies, route for approval if <$10k, escalate to CFO if over. About 150 invoices/week."

Agent Design Advisor

Lab 2

Welcome to the Agent Design Workshop. Tell me about the business process you want to automate — describe the steps, the inputs, the decisions involved, and what a failed run looks like. I'll help you design the right architecture: agent loop or deterministic workflow, tool set, and guardrails. What process are we tackling?

Module 4 · Lesson 3

Retrieval-Augmented Generation & the Data Layer

A model knows what it was trained on. RAG makes it know what you know.

How do you give a language model access to your proprietary data without fine-tuning — and without hallucinating?

When Morgan Stanley deployed an internal GPT-4-powered assistant in March 2023, they faced a problem shared by every enterprise: the model knew nothing about their 100,000 internal research documents, advisor guidelines, or product inventory. The solution was not fine-tuning — it was Retrieval-Augmented Generation. They embedded their entire content library, built a vector search layer, and injected relevant chunks into every prompt. By Q4 2023, over 200 Morgan Stanley advisors were using the system daily.

Why RAG Exists

Foundation models have a knowledge cutoff — they know nothing that happened after training ended. More critically, they know nothing about your business: your contracts, your SOPs, your product catalog, your customer history. Fine-tuning can address some of this, but it is expensive (thousands of dollars and days of compute), requires labeled data, and needs retraining every time data changes.

RAG solves the same problem more cheaply and dynamically: at query time, retrieve the most relevant documents from a vector database, inject them as context, and let the model answer against that context. The model's "knowledge" updates the moment you update your documents — no retraining required.

The RAG Pipeline

Phase 1 — Indexing

Chunk & Embed

Split documents into chunks (typically 256–512 tokens). Run each chunk through an embedding model (e.g., OpenAI text-embedding-3-large) to produce a numeric vector representing its meaning.

Chunking strategies: fixed-size, sentence, semantic

Phase 2 — Storage

Vector Database

Store embeddings in a purpose-built vector DB that supports approximate nearest-neighbor (ANN) search. Pinecone, Weaviate, Qdrant, pgvector (Postgres extension).

Pinecone, Weaviate, Qdrant, Chroma

Phase 3 — Retrieval

Semantic Search

Embed the user's query. Find the top-k most similar chunks by cosine similarity. Optionally rerank with a cross-encoder for higher precision.

Top-k=5 to 10 is common; reranking with Cohere Rerank

Phase 4 — Generation

Augmented Prompt

Inject retrieved chunks into the prompt context. Model answers using retrieved information + its pretrained knowledge. Cite sources to enable verification.

System prompt + retrieved context + user query

Real Case: Notion AI & Perplexity AI (2023–2024)

Notion AI, launched November 2022 and expanded in 2023, used RAG over a user's own Notion workspace to answer questions about their notes, documents, and databases — turning the entire workspace into a queryable knowledge base. The model never "learned" any user's content; it retrieved it at query time. Perplexity AI, which raised $73.6 million in January 2024 and was valued at $520 million, built its entire product on real-time web RAG — embedding live search results into every answer, with citations. By mid-2024 Perplexity reported over 10 million monthly active users.

RAG vs. Fine-Tuning — When to Use Which

Use RAG when your data changes frequently, you need source citations, or you want to avoid retraining costs. Use fine-tuning when you need to change the model's tone, style, or format of outputs, or when a specific skill must be deeply embedded (e.g., a clinical coding model trained on thousands of labeled examples). Many production systems use both: RAG for knowledge, fine-tuning for behavior.

Chunking Strategy Matters More Than You Think

Poor chunking is the most common cause of RAG quality failures. If chunks are too small, they lose context — a number without the sentence that gives it meaning. If chunks are too large, retrieval becomes imprecise — you pull an entire chapter when you needed one paragraph. The 2024 research consensus favors semantic chunking: splitting at natural topic boundaries detected by embedding similarity, not arbitrary character counts. Overlap between chunks (e.g., 50-token overlap) prevents context from being lost at boundaries.

EmbeddingA high-dimensional numeric vector representing the semantic meaning of text, produced by an embedding model.

Vector DatabaseA database optimized for storing and searching embeddings by semantic similarity (ANN search).

RAGRetrieval-Augmented Generation — dynamically injecting retrieved document context into a model prompt at query time.

RerankingA second-pass scoring step that reorders retrieved chunks by relevance using a cross-encoder model, improving precision.

Stack Reality

By 2024, the minimal RAG stack had become: LlamaIndex or LangChain for orchestration → OpenAI embeddings or open alternatives → Pinecone or pgvector for storage → GPT-4o or Claude for generation. Total setup time for a proof of concept: under one day. Total monthly cost for a small internal tool: $50–$300 depending on query volume.

Lesson 3 Quiz

Retrieval-Augmented Generation & the Data Layer — 4 questions

1. What key advantage does RAG have over fine-tuning for enterprise knowledge management?

Correct. RAG retrieves at query time, so updating your knowledge base means just updating the document index. No model retraining, no days of compute, no labeled data required.

RAG's key advantage is dynamic updating: add or change a document in your vector DB and the system immediately "knows" it — no retraining cycle. Fine-tuning bakes knowledge into weights and requires retraining to update.

2. In the RAG pipeline, what is an "embedding"?

Correct. Embeddings are dense numeric vectors — typically 1,536 or 3,072 dimensions — that encode semantic meaning. Similar texts produce nearby vectors, enabling similarity search.

An embedding is a high-dimensional numeric vector encoding the semantic meaning of text. It's produced by an embedding model (e.g., OpenAI's text-embedding-3-large) and stored in a vector database for similarity search.

3. Perplexity AI's core product differentiator, which helped it reach over 10 million monthly active users by mid-2024, was:

Correct. Perplexity built its entire product on real-time web RAG — retrieving live search results, embedding them, and generating cited answers. The model itself was not proprietary; the retrieval + UX layer was the product.

Perplexity's moat was real-time web RAG with citations. They didn't build or own a foundation model — they built a retrieval + generation pipeline that pulls live web content and produces cited answers, which differentiated them from static chatbots.

4. What is the most common cause of RAG quality failures identified in 2024 research?

Correct. Chunking is the most overlooked RAG design decision. Too small and you lose meaning; too large and retrieval becomes imprecise. Semantic chunking with overlap is the 2024 best-practice consensus.

Poor chunking is the top culprit. Chunks too small lose the context that gives a number or phrase meaning. Chunks too large reduce retrieval precision. The 2024 consensus favors semantic chunking with ~50-token overlap at boundaries.

Lab 3: RAG Architecture Planner

Design a retrieval-augmented system for your specific knowledge base.

Your Task

Describe your organization's knowledge base — document types, volume, update frequency, and who queries it. The advisor will help you design the full RAG pipeline: chunking strategy, embedding model choice, vector DB selection, retrieval parameters, and generation setup. Ask about tradeoffs and edge cases.

Example: "We have 3,000 PDF policy documents updated monthly, a 200-page product manual revised quarterly, and employees ask ~500 questions/day. Sensitive HR data is included. We need citations on every answer."

RAG Architecture Advisor

Lab 3

Welcome to the RAG Architecture Lab. Tell me about your knowledge base: what types of documents, how many, how often they update, who queries them, and any sensitivity or compliance requirements. I'll design your chunking strategy, embedding approach, vector DB, retrieval parameters, and generation setup. What are we indexing?

Module 4 · Lesson 4

Evaluation, Observability & Cost Control

You cannot improve what you cannot measure. AI systems in production need the same rigor as any software.

How do you know if your AI feature is actually working — and how do you stop it from quietly burning your infrastructure budget?

In May 2024, Shopify's VP of Engineering publicly described their AI infrastructure bill as "the fastest-growing line item in the company" — and credited their cost-control work with keeping it manageable. That work included aggressive prompt caching (reducing repeated context tokens), output length limits, and model-tier routing. Companies that shipped AI features without these controls reported 3–5× higher-than-expected monthly API bills within sixty days.

The Evaluation Problem

Traditional software has deterministic outputs — you can write unit tests. AI outputs are probabilistic and often subjective. A customer support reply might be technically accurate but too terse; a summary might miss the key point without being factually wrong. Measuring this requires a different approach: LLM-as-judge, human evaluation pipelines, and golden dataset regression testing.

Method 1

LLM-as-Judge

Use a powerful model (GPT-4o, Claude 3 Opus) to score your production model's outputs against rubrics: accuracy, helpfulness, tone, citation quality. Scales to thousands of samples. Used by Anthropic's internal evals team.

Method 2

Golden Dataset

Maintain 100–500 hand-labeled input/output pairs. Run your pipeline against them after every prompt or model change. Catch regressions before users do. Analogous to regression test suites in traditional software.

Method 3

Human-in-Loop Sampling

Route 1–5% of production outputs to human reviewers. Tag errors, capture edge cases, and continuously expand the golden dataset. Closes the feedback loop from deployment to improvement.

Method 4

User Signal Proxies

Track downstream behavior: did the user edit the AI draft? Did they accept the suggestion? Did they ask a follow-up that implies confusion? Behavioral signals are imperfect but free and scalable.

Observability: Tracing Every Call

AI observability is the practice of logging, tracing, and monitoring model calls in production — not just catching errors but understanding behavior. The leading dedicated tools in 2024 were LangSmith (LangChain's observability product, launched 2023), Weights & Biases Weave, Helicone, and Arize AI. Each captures: the full prompt sent, the model's response, latency, token counts, cost, and any retrieved RAG context.

Without observability, debugging a hallucination in production is nearly impossible — you don't know what the model actually received as input, what context was retrieved, or whether the error was in retrieval, prompting, or generation.

Real Case — Scale AI's Evaluation Infrastructure (2023)

Scale AI, which provides data labeling and evaluation services, published that their enterprise AI evaluation pipeline runs LLM-as-judge at 95% of the scale of human evaluation — but at 1/40th the cost and 100× the speed. Their finding: for most quality dimensions (accuracy, format, completeness), GPT-4-class models as judges correlate at 0.85+ with expert human raters. For safety-critical dimensions, human review remains essential.

Cost Control in Production

Uncontrolled AI API spend is one of the most common operational failures at AI-first companies in 2024. The five most effective cost controls, used across Shopify, Notion, Intercom, and others:

Control	Mechanism	Typical Savings
Prompt Caching	Cache repeated system prompt tokens (OpenAI Prompt Cache, Anthropic Cache). Pay once, reuse thousands of times.	50–90% on system prompt tokens
Output Length Limits	Set max_tokens explicitly. Many tasks need 100 tokens; without limits, models can output 1,000+.	30–60% reduction
Model Tier Routing	Route simple tasks to mini/flash models. Reserve flagship for complex queries only.	40–70% on per-query cost
Semantic Caching	Cache full responses for semantically similar queries (GPTCache, Redis). Same question, same answer — don't call the API twice.	20–50% at scale
Budget Alerts	Set hard spend limits and daily alerts via provider dashboards (OpenAI, Anthropic) or middleware. Kill switches prevent runaway costs.	Prevents catastrophic overruns

The Eval-Deploy-Monitor Loop

Production AI requires a continuous loop: evaluate a change on your golden dataset → deploy with observability instrumented → monitor quality and cost metrics in real time → sample for human review → update golden dataset → repeat. Teams that skip evaluation before deployment consistently regress quality on edge cases that weren't tested. Teams that skip monitoring miss cost spikes and silent quality degradation.

LLM-as-JudgeUsing a powerful model to score another model's outputs against defined rubrics, scaling evaluation cheaply.

Golden DatasetA curated set of hand-labeled input/output pairs used for regression testing after every pipeline change.

ObservabilityLogging and tracing every AI call — full prompt, retrieved context, response, latency, cost — to enable debugging and monitoring.

Semantic CachingStoring AI responses and returning cached answers for semantically equivalent future queries without calling the API again.

The Professional Standard

An AI-first business that ships without evaluation, observability, and cost controls is not a mature AI company — it is running blind. The infrastructure for these is now cheap (LangSmith free tier, Helicone free tier, OpenAI usage dashboards). There is no excuse for skipping it. Instrument on day one.

Lesson 4 Quiz

Evaluation, Observability & Cost Control — 4 questions

1. Scale AI found that LLM-as-judge evaluation correlates with expert human raters at 0.85+ for most quality dimensions at what cost and speed advantage?

Correct. Scale AI's published findings: LLM-as-judge runs at 100× the speed and 1/40th the cost of human evaluation, with 0.85+ correlation for most quality dimensions. Safety-critical dimensions still require human review.

Scale AI found LLM-as-judge operates at 100× speed and 1/40th the cost of human evaluation, correlating at 0.85+ with expert raters on most dimensions. The exception: safety-critical assessments still need humans.

2. What is "prompt caching" and what savings does it typically produce?

Correct. OpenAI's Prompt Cache and Anthropic's prompt caching let you pay for system prompt tokens once and reuse across thousands of calls. Savings of 50–90% on those tokens are typical.

Prompt caching stores the KV-cache of your system prompt tokens so you pay to process them once, then reuse across many calls. The saving is on system prompt tokens specifically: 50–90% in typical deployments.

3. What does an AI observability tool like LangSmith capture that makes production debugging possible?

Correct. Observability tools capture the complete picture: full prompt, retrieved context, model response, latency, token counts, and cost. Without this, debugging a hallucination is nearly impossible — you don't know what the model actually saw.

LangSmith and similar tools capture: the full prompt sent (including system prompt + retrieved context), the model's response, latency, token counts, and cost per call. This complete trace is what makes production debugging possible.

4. A company sets no max_tokens limit in their production API calls. What is the most likely consequence?

Correct. Without max_tokens limits, models tend to generate verbose outputs. A task needing 100 tokens might produce 800+. At scale, this drives 30–60% unnecessary token spend. Always set explicit output limits matched to the task.

Without max_tokens limits, models often produce far more output than needed — a 100-token task might produce 800+ tokens. At production scale, this translates to 30–60% higher-than-necessary output token costs. Always set explicit limits.

Lab 4: Eval & Cost Audit

Audit your AI stack for evaluation gaps and cost inefficiencies.

Your Task

Describe an AI feature or product you're building or have shipped — or a hypothetical scenario. The advisor will conduct a structured audit: evaluation coverage, observability instrumentation, and cost-control mechanisms. You'll get specific recommendations with tooling and estimated savings.

Example: "We shipped a GPT-4o-powered email drafting assistant. It processes 5,000 emails/day. We have no evaluation pipeline — just user thumbs up/down. Our monthly API bill is $8,000 and growing 20% per month. We have no idea why."

Eval & Cost Audit Advisor

Lab 4

Welcome to the Eval & Cost Audit Lab. Describe an AI feature you've built or are planning — include the model(s) used, query volume, current evaluation approach (if any), and what you know about your cost breakdown. I'll audit for evaluation gaps, observability blind spots, and cost inefficiencies, with specific tooling recommendations and estimated savings. What are we auditing?

Module 4 Test

The AI-First Tech Stack — 15 questions · Score 80% or higher to pass

1. What is the approximate token count of a 500-word document?

Correct. At ~4 characters per token, 500 words (~2,500–3,000 characters) equals roughly 700 tokens.

At ~4 characters per token, a 500-word document (~2,500 characters) equals roughly 700 tokens.

2. Klarna's AI assistant, announced February 2024, handled what percentage of conversations without a human agent?

Correct. Klarna reported 79% of 2.3 million conversations resolved without a human agent by September 2024.

Klarna reported 79% of conversations resolved without human agents, across 2.3 million conversations by September 2024.

3. "Nano" or "edge" model tiers (e.g., GPT-3.5-turbo, Claude 3 Haiku) are best suited for:

Correct. Nano models excel at simple, high-volume tasks: intent classification, query routing, basic extraction — where speed and cost matter more than frontier-level reasoning.

Nano tier models are optimized for simple, high-volume tasks — classification, routing, extraction — where ultra-low latency and minimal cost are paramount, not complex reasoning.

4. The ReAct agent loop was introduced in a paper by which institution(s)?

Correct. ReAct (Reason+Act) was introduced in "ReAct: Synergizing Reasoning and Acting in Language Models" by Yao et al. from Princeton and Google in 2022.

ReAct was published in a 2022 paper by Yao et al. from Princeton University and Google Research, titled "ReAct: Synergizing Reasoning and Acting in Language Models."

5. In an agentic system, what is a DAG?

Correct. A DAG (Directed Acyclic Graph) defines a fixed workflow with no loops — steps execute in a deterministic order, making it auditable and cost-predictable.

DAG stands for Directed Acyclic Graph. In AI workflows, it describes a deterministic pipeline where steps flow in a fixed sequence with no cycles — the opposite of an autonomous agent's dynamic step selection.

6. Harvey AI's reported impact at Allen & Overy was:

Correct. Harvey AI's orchestrated legal pipeline at Allen & Overy delivered a reported 50% reduction in contract review time for associates using the system.

Harvey AI reported a 50% reduction in contract review time for participating associates at Allen & Overy — achieved through an orchestrated pipeline, not a single model call.

7. Morgan Stanley's internal GPT-4 assistant, deployed March 2023, solved the proprietary data problem by:

Correct. Morgan Stanley built a RAG pipeline over their 100,000 documents — embedding the library and retrieving relevant chunks at query time, without fine-tuning.

Morgan Stanley used RAG: they embedded their entire 100,000-document content library into a vector database and injected relevant chunks into each query's context. No fine-tuning was required.

8. What is the recommended chunk overlap in a RAG pipeline, and why?

Correct. A ~50-token overlap between adjacent chunks ensures that context split at a boundary is preserved in both neighboring chunks, preventing meaning loss.

A ~50-token overlap between adjacent chunks prevents context from being lost at boundaries. A sentence split across two chunks would lose meaning without overlap — the neighboring chunks share that boundary text.

9. Perplexity AI's core product was built on which architectural approach?

Correct. Perplexity's entire product is real-time web RAG: retrieve live search results, embed them, generate a cited answer. The foundation model is not proprietary — the retrieval pipeline is the product.

Perplexity built on real-time web RAG: every query retrieves live search results, embeds them, and generates a cited answer. They reached 10M+ monthly active users without owning a foundation model.

10. What does "LLM-as-judge" evaluation mean?

Correct. LLM-as-judge uses a separate, typically more powerful model to evaluate production outputs against rubrics — scaling evaluation to thousands of samples at low cost.

LLM-as-judge uses a separate powerful model (not the one being evaluated) to score outputs against rubrics like accuracy, helpfulness, and citation quality — scaling cheaply to thousands of samples.

11. What is semantic caching, and what savings does it produce at scale?

Correct. Semantic caching detects when a new query is semantically equivalent to a previous one and returns the cached response — no API call. Saves 20–50% on query costs at scale.

Semantic caching stores responses and serves them for semantically similar future queries without calling the API. Tools like GPTCache implement this. At scale, savings of 20–50% on per-query costs are typical.

12. A "golden dataset" in AI evaluation refers to:

Correct. A golden dataset is your regression test suite for AI: 100–500 hand-labeled examples that let you detect quality degradation after any change to prompts, models, or retrieval.

A golden dataset is a curated, hand-labeled collection of input/output pairs used as a regression test suite. After any change to prompts, models, or retrieval, you run the pipeline against it to catch quality regressions before users do.

13. Notion AI's approach to giving users access to their own workspace data was architecturally identical to:

Correct. Notion AI used RAG over each user's workspace — embedding content, retrieving relevant chunks at query time. The model itself was never updated with user data.

Notion AI used RAG: embed the workspace, retrieve relevant chunks at query time, generate against them. No model retraining or per-user fine-tuning. The knowledge base updates immediately when the user adds content.

14. Which of the following is NOT an AI observability tool mentioned in Lesson 4?

Correct. Pinecone is a vector database, not an observability tool. Observability tools mentioned include LangSmith, Weights & Biases Weave, Helicone, and Arize AI.

Pinecone is a vector database for RAG — not an observability tool. The observability tools covered are LangSmith, Weights & Biases Weave, Helicone, and Arize AI.

15. What is the correct order of the Eval-Deploy-Monitor loop described in Lesson 4?

Correct. The full loop: evaluate change on golden dataset → deploy with observability → monitor quality/cost → sample for human review → update golden dataset → repeat. Skipping evaluation before deployment is the most common failure mode.

The correct loop is: evaluate on golden dataset → deploy with observability instrumented → monitor quality and cost → sample production outputs for human review → update golden dataset → repeat. Always evaluate before deploying.