L1
ยท
Quiz
ยท
Lab
L2
ยท
Quiz
ยท
Lab
L3
ยท
Quiz
ยท
Lab
L4
ยท
Quiz
ยท
Lab
Module Test
Module 8 ยท Lesson 1 โ€” Capstone Project

Scoping Your AI Application

From vague idea to a deployable problem statement โ€” the discipline that separates shipped products from perpetual prototypes.
How do you translate a real-world problem into a concrete AI system you can actually build and measure?

When the Allen Institute for AI set out to build Semantic Scholar's research recommendation engine, the team spent the first six weeks doing nothing but scoping. They wrote down what a success looked like in one sentence: "A researcher clicks a recommended paper they would not have found otherwise, within five minutes of arriving." Every subsequent architectural decision was tested against that sentence. The system shipped in eight months; rival projects with looser scope were still in design review two years later.

Why Scoping Is the Hardest Part

Most AI projects fail before a single model is trained. The failure mode is almost always the same: the team builds something technically impressive that solves a problem no one actually has, or solves a real problem in a way that can't be measured. Scoping is the discipline of collapsing infinite possibility into a single, testable, valuable artifact.

In 2021, Google's Perspective API team published a retrospective noting that their original scope โ€” "detect all toxic content" โ€” had to be narrowed to "detect comments likely to cause a moderator to remove a reply thread within 24 hours" before they could build a model that performed well enough to deploy. The measurement axis changed everything about the training data they needed.

The Five Scoping Dimensions

A well-scoped AI project defines itself along five axes. Vagueness on any one of them will cause rework later.

Dimension 1

The Problem User

One specific human whose life or workflow changes. Not "users" โ€” one archetype with a named context. The Semantic Scholar team wrote "a postdoc in immunology reading on a Thursday afternoon."

Dimension 2

The Measurable Outcome

A single number that goes up or down. Not "better experience" โ€” a click-through rate, a latency, a review cycle time. If you cannot write it as a SQL query, it is not a metric yet.

Dimension 3

The Input / Output Contract

Exactly what data goes in; exactly what the model produces. A string, a bounding box, a ranked list, a probability score. Ambiguity here means integration bugs in week six.

Dimension 4

The Failure Budget

What error rate is acceptable? In what direction? A false positive in a cancer screening is catastrophic; in a spam filter it is a minor nuisance. The budget determines model choice before any training happens.

Dimension 5

The Deployment Context

Where does inference run โ€” browser, mobile, server, edge device? What latency is tolerable? What is the update cadence? A model that must run offline on a $40 phone is scoped completely differently than a cloud batch job.

The One-Sentence Scope Statement

Every capstone project in this module begins with a one-sentence scope statement in this form:

Scope Template

For [specific user], when [specific trigger], the system will [specific output] so that [measurable outcome], with an acceptable error rate of [threshold].

Example: "For a small-business owner with no legal training, when they paste a contract clause into a web form, the system will return a plain-English summary and a risk flag (HIGH / MEDIUM / LOW) so that they can decide whether to hire a lawyer without reading the clause themselves, with an acceptable false-negative rate on HIGH risk of under 5%."

Notice what this forces: you now know your user (small-business owner), your input (pasted clause text), your output (summary + categorical label), your metric (lawyer-hire decision confidence), and your failure budget (5% false-negative ceiling on the dangerous class).

Capstone Project Options for This Module

You will build one complete AI application across the four lessons of this module. Choose a domain from the list below, or propose your own. Each lab will advance your chosen project through a defined phase.

Option A

Document Intelligence Assistant

An LLM-powered tool that ingests PDF or text documents, extracts structured data, answers questions about the content, and flags anomalies. Inspired by Harvey AI's legal document pipeline (launched 2022) and the contract analysis workflows documented by Allen & Overy's deployment of GPT-4 in 2023.

Option B

Customer-Facing Retrieval-Augmented Chatbot

A RAG system that answers questions against a proprietary knowledge base โ€” product docs, support tickets, internal wikis โ€” without hallucinating outside its corpus. Modeled on Klarna's customer service AI deployment (2024) which handled 2.3 million conversations in its first month.

Option C

Automated Code Review Pipeline

An agent that reads pull request diffs, identifies bugs, security issues, and style violations, and posts structured comments. Modeled on GitHub Copilot's PR review feature (beta 2023) and Stripe's internal Sorbet type-checker AI tooling.

Option D โ€” Bring Your Own

Custom Domain

Propose your own. It must have a defined user, a measurable outcome, a clear input/output contract, and a realistic deployment context. The labs are domain-agnostic and will apply equally well.

Scoping Anti-Patterns to Avoid

The "AI for everything" trap. Projects that say the system will "understand the document" are unscopable. Understanding is not a verb that maps to an output. Replace with "extract the parties, dates, and payment terms into a JSON object."

The vanity metric. Model accuracy on a held-out test set is not a product metric. In 2023, the team behind Nabla's medical transcription AI (deployed in 50 US health systems) published that their model had 94% word-level accuracy but only 71% clinical accuracy โ€” the metric that actually mattered to physicians. Always scope to the downstream outcome.

The infinite scope creep. Scope documents must have a "not in scope" section that is longer than the "in scope" section. If you cannot name five things you are deliberately not building, you have not finished scoping.

Capstone Milestone 1 โ€” Due This Lesson

Before moving to Lesson 2, you must complete the Lab below and produce a one-sentence scope statement for your chosen project domain. The Lab AI coach will review your statement against all five dimensions and push back until it is tight enough to build from. This statement will be the foundation every subsequent lesson builds upon.

Lesson 1 Quiz โ€” Scoping Your AI Application

4 questions ยท Select the best answer for each
1. The Semantic Scholar team's success statement โ€” "A researcher clicks a recommended paper they would not have found otherwise, within five minutes of arriving" โ€” is effective primarily because it:
Correct. The statement names a user action (click), a quality condition (wouldn't have found otherwise), and a time window (5 minutes) โ€” making it testable in production without ambiguity.
Not quite. The power of that sentence is entirely in its measurability and user-action focus, not in technical or process detail.
2. Google's Perspective API team changed their scope from "detect all toxic content" to a specific moderator-action metric. What did this primarily change?
Correct. When the outcome became "comments that cause a moderator to remove a thread within 24 hours," labelers could create ground truth. Vague scope produces unlabelable data.
The key insight is that scope defines labels, which defines training data. Infrastructure and model size are secondary.
3. Which of the following is an example of a "vanity metric" in an AI project context?
Correct. Test-set accuracy measures model performance in isolation. Nabla's 94% word-level accuracy vs. 71% clinical accuracy illustrates exactly why the downstream outcome metric is the only one that matters for product decisions.
Test-set accuracy is the classic vanity metric โ€” it looks good but doesn't tell you if the system delivers value. The others all measure outcomes users actually care about.
4. The Scoping Dimension called "Failure Budget" is defined before training begins primarily because it:
Correct. If false negatives on a dangerous class are catastrophic, you weight the training loss accordingly, choose a higher-recall model family, and set a lower decision threshold โ€” all before touching data.
The failure budget is a training and model-selection input. It tells you which type of error is worse, which determines every subsequent modeling decision.

Lab 1 โ€” Scope Statement Workshop

Build and pressure-test your capstone project scope with an AI coach ยท Complete 3+ exchanges to unlock Lesson 2

Your Mission

Choose one of the four capstone options from Lesson 1 (Document Intelligence, RAG Chatbot, Code Review Pipeline, or your own domain). Draft a one-sentence scope statement using the template, then submit it to the AI coach below. The coach will critique it against all five scoping dimensions and push back until the statement is tight enough to build from.

You must produce a scope statement the coach approves before moving on. This statement will anchor every subsequent lab in this module.

Scope template: "For [specific user], when [specific trigger], the system will [specific output] so that [measurable outcome], with an acceptable error rate of [threshold]."

Start by telling the coach which capstone option you chose and your first attempt at a scope statement. Be as specific as you can โ€” the coach will help you tighten it.
AI Scope Coach
Capstone Lab 1
Module 8 ยท Lesson 2 โ€” Capstone Project

System Architecture & Data Pipeline

Turning a scope statement into a component diagram, a data contract, and a working skeleton your team can build against.
How do you design an AI system architecture that is simple enough to ship but robust enough to iterate?

When Klarna built its customer service AI โ€” which by February 2024 was handling the equivalent of 700 full-time agent workloads โ€” the engineering team published that they had deliberately chosen the "boring" architecture: a single stateless FastAPI service, a vector database (Weaviate), a retrieval layer, and Claude as the generation model. They explicitly rejected building a custom neural ranker because the operational complexity was not worth the marginal quality gain. The boring architecture shipped. The clever one would still be in design review.

The Three-Layer AI Application Model

Almost every production AI application โ€” regardless of domain โ€” decomposes into three layers. Understanding this decomposition before writing a single line of code prevents the most common architectural mistakes.

Standard Three-Layer Architecture
User / API
โ†’
Orchestration Layer
โ†’
Data / Retrieval Layer
Input validation ยท Auth ยท Rate limiting
ยท
Prompt construction ยท Model call ยท Output parsing
ยท
Vector DB ยท SQL ยท File storage ยท External APIs

Layer 1 โ€” Interface. The surface the user or calling system touches. In a web app this is a REST endpoint or WebSocket. In a pipeline it is a queue consumer. This layer's only job is input validation, authentication, and routing. No business logic lives here.

Layer 2 โ€” Orchestration. The intelligence layer. It retrieves relevant context from Layer 3, constructs the prompt, calls the model, parses the structured output, and applies post-processing rules (filtering, formatting, confidence thresholding). This is where RAG chains, agent loops, and tool calls live.

Layer 3 โ€” Data. The persistence and retrieval layer. For RAG systems this is a vector store plus a metadata SQL database. For document intelligence it is object storage plus an extraction cache. This layer must be queryable independently of the model โ€” if you can't test your retrieval without calling the LLM, your architecture is coupled incorrectly.

Data Pipeline Design

The data pipeline has two distinct phases that many beginners collapse into one: ingestion (getting data in) and retrieval (getting the right data out). Confusing them causes latency problems at inference time and stale-data bugs in production.

Ingestion Phase โ€” Offline

Process Once, Query Many Times

  • Parse raw documents into clean text chunks
  • Generate embeddings for each chunk
  • Store embeddings + metadata in vector DB
  • Extract structured fields into SQL
  • Run at ingest time, not at query time
Retrieval Phase โ€” Online

Fast, Deterministic, Testable

  • Embed the user query (single fast call)
  • ANN search returns top-K chunks
  • Metadata filters narrow results
  • Reranker scores final context window
  • Must complete in <200ms for good UX
Concrete Technology Choices

The following table reflects the actual technology stack used in production RAG and document-intelligence deployments documented in 2023โ€“2024 engineering blogs from companies including Notion, Replit, Intercom, and Morgan Stanley's internal AI systems.

ComponentRecommended Starting ChoiceWhy
Web API FrameworkFastAPI (Python)Async, auto-docs, type-safe, fastest cold path for Python LLM apps
Vector StoreChromaDB (local) โ†’ Pinecone (production)ChromaDB needs zero infra for prototyping; Pinecone scales to billions with managed SLA
Embeddingstext-embedding-3-small (OpenAI)Best cost/quality ratio documented in MTEB benchmarks as of 2024
LLMClaude 3.5 Sonnet or GPT-4o128K context window handles long documents; structured output reliability
OrchestrationLangChain or direct SDK callsLangChain for rapid prototyping; direct calls for production control
Document ParsingPyMuPDF + Unstructured.ioHandles PDFs, Word, HTML; Unstructured's table extraction is production-grade
ObservabilityLangSmith or HeliconeTrace every LLM call; essential for debugging prompt failures in production
The Skeleton-First Principle

Morgan Stanley's AI team, deploying their wealth management assistant in 2023, documented a principle they called "skeleton before muscles": build the complete data flow end-to-end with stub implementations before optimizing any single component. A skeleton system returns a hardcoded response but exercises every layer. This approach catches integration failures early โ€” before you've spent three weeks optimizing a retrieval algorithm that turns out to connect to the wrong database schema.

# Skeleton architecture โ€” stub every layer first # Then replace stubs with real implementations one at a time from fastapi import FastAPI from pydantic import BaseModel app = FastAPI() class QueryRequest(BaseModel): query: str doc_id: str | None = None async def retrieve_context(query: str) -> list[str]: # STUB: replace with real vector search return ["[placeholder context chunk 1]"] async def generate_response(query: str, context: list[str]) -> str: # STUB: replace with real LLM call return f"Response to: {query} (stub)" @app.post("/query") async def query_endpoint(req: QueryRequest): context = await retrieve_context(req.query) response = await generate_response(req.query, context) return {"response": response, "context_chunks": len(context)}
Component Checklist for Your Capstone
Architecture Readiness Checklist
Input validation layer defined (what gets rejected before the model sees it)
Data ingestion pipeline designed (source โ†’ parse โ†’ embed โ†’ store)
Retrieval strategy chosen (keyword, semantic, hybrid, metadata filter)
Prompt template drafted (system prompt + context slot + user query slot)
Output schema defined (JSON fields, types, validation rules)
Observability plan in place (which LLM calls are logged, which metrics are tracked)
Failure mode handling designed (what happens when the model returns garbage)
Capstone Milestone 2 โ€” Due This Lesson

In Lab 2 you will produce a complete component diagram for your capstone project โ€” every box, every arrow, every data contract. The AI coach will probe your architecture for coupling errors, missing failure handling, and latency bottlenecks. You must resolve all critical issues before the diagram is approved.

Lesson 2 Quiz โ€” System Architecture & Data Pipeline

4 questions ยท Select the best answer for each
1. Klarna's engineering team chose what they called the "boring" architecture for their customer service AI. The primary engineering benefit of this choice was:
Correct. The Klarna team explicitly rejected a custom neural ranker because operational complexity was not worth the marginal quality gain. Shippability is an engineering constraint as real as latency or accuracy.
The key lesson from Klarna's architecture is that complexity kills shipping velocity. The clever architecture stays in design review; the boring one reaches users.
2. In the three-layer AI application model, which layer should contain the prompt construction logic and the model call?
Correct. The orchestration layer owns prompt construction, model calls, output parsing, and post-processing. The interface layer does only validation and routing; the data layer does only persistence and retrieval.
Prompt construction and model calls belong in the orchestration layer. Mixing them into the interface or data layers creates coupling that makes testing and iteration extremely painful.
3. The "skeleton-first principle" documented by Morgan Stanley's AI team means:
Correct. The skeleton exercises every layer with stub implementations so integration failures surface early โ€” before weeks of optimization work on a component that connects to the wrong schema.
Skeleton-first means complete vertical data flow with stubs, so every interface between layers is proven before optimizing any layer's internals.
4. Why must the retrieval layer be testable independently of the LLM call?
Correct. If retrieval and generation are coupled, a bad answer could be caused by returning the wrong chunks, returning the right chunks in the wrong order, or the model ignoring good chunks. You can't distinguish these without testing retrieval independently.
The debugging argument is the key one: coupled layers produce symptoms that can't be attributed to a specific component, making iteration impossible.

Lab 2 โ€” Architecture Design Review

Design and pressure-test your full system architecture with an AI architect ยท Complete 3+ exchanges to unlock Lesson 3

Your Mission

Using your approved scope statement from Lab 1, design the complete architecture for your capstone project. Describe every component, the data contract between each layer, your technology choices, and how your ingestion pipeline differs from your retrieval pipeline.

The AI architect will probe for: coupling errors between layers, missing failure handling, latency bottlenecks, components that don't need to exist, and components you forgot. You must resolve all critical issues raised before the architecture is approved.

Start by describing your three layers: what your interface layer accepts/rejects, what your orchestration layer does step-by-step, and what your data layer stores and how it's queried. Then describe your technology choices and explain why you made them. The architect will push back where your design has gaps.
AI Architect
Capstone Lab 2
Module 8 ยท Lesson 3 โ€” Capstone Project

Prompt Engineering & Model Integration

Writing prompts that produce structured, reliable outputs โ€” and integrating model calls into a pipeline that handles everything that can go wrong.
How do you write prompts that behave consistently in production, not just in the playground?

When Harvey AI deployed their legal document analysis system to law firms including A&O Shearman in 2023, the engineering team documented that their biggest reliability gain came not from model choice but from prompt architecture. They moved from free-form instructions to what they called "contract prompts" โ€” prompts where the output schema was embedded in the system message as a TypeScript type definition. JSON parse failures dropped by 94%. The model, it turned out, was far more reliable when the output format was specified in a language it had seen millions of times in training.

Why Playground Prompts Fail in Production

The playground is a single-turn, single-user, no-latency environment. Production is multi-turn, concurrent, latency-sensitive, and adversarial. Prompts that look good in the playground fail in production for three systematic reasons:

Context length pressure. In production, the context window fills with retrieved chunks, conversation history, and tool outputs. Prompts written assuming plenty of space start getting truncated. Always test prompts at the maximum context length you'll actually send.

User input variation. Playground testing uses your own well-formed inputs. Production users send typos, multi-language inputs, injection attempts, and questions entirely outside your intended scope. Your prompt must handle all of these gracefully.

Output parsing brittleness. If your prompt says "respond in JSON," the model sometimes adds markdown fences. Sometimes it adds commentary before the JSON. Sometimes it uses single quotes. A production prompt must produce output that is 100% programmatically parseable, every time.

The Production Prompt Architecture

Every production prompt for a structured-output task should have these four sections, in this order:

Section 1 โ€” Role & Objective

What the model is and what it must accomplish

One or two sentences. No backstory, no personality. State the task and the output goal. "You are a contract analysis engine. Your task is to extract the parties, governing law, termination clauses, and payment terms from legal documents and return them in the specified JSON schema."

Section 2 โ€” Output Schema

The exact structure the model must produce, with types

Embed the schema directly in the system prompt as a TypeScript interface, a JSON Schema, or an annotated example. Never describe the schema in prose โ€” that's ambiguous. Show the structure. Harvey AI's 94% parse-failure reduction came from this change alone.

Section 3 โ€” Rules & Constraints

What to do when the input doesn't cooperate

If a field cannot be found, use null โ€” do not invent a value. If the document is not a contract, return error_type: "wrong_document". If the user asks a question outside scope, return error_type: "out_of_scope". Every edge case you can anticipate should have an explicit instruction.

Section 4 โ€” Context Slot

Where retrieved content is injected โ€” clearly demarcated

Use a clear delimiter: <DOCUMENT>...</DOCUMENT> or triple backticks with a label. Never let retrieved content run directly into instruction text โ€” this allows prompt injection where malicious document content overwrites your instructions.

Concrete Prompt Example
# System prompt โ€” contract analysis engine SYSTEM: You are a contract analysis engine. Extract structured data from legal documents. Return ONLY valid JSON matching the schema below. No markdown, no explanation. OUTPUT SCHEMA: { "parties": [{"name": string, "role": "buyer"|"seller"|"licensor"|"licensee"|"other"}], "governing_law": string | null, "payment_terms": {"amount": number | null, "currency": string | null, "schedule": string | null}, "termination_clauses": [string], "risk_flags": [{"severity": "HIGH"|"MEDIUM"|"LOW", "description": string}], "error_type": string | null } RULES: - If a field is not present in the document, use null โ€” never invent values. - If the input is not a contract, set error_type to "wrong_document". - If the document is in a language other than English, set error_type to "unsupported_language". - risk_flags must include any clause that limits liability, non-compete terms, or automatic renewal. DOCUMENT: <DOCUMENT> {retrieved_document_text} </DOCUMENT>
Handling Model Integration Failures

In any production LLM integration, you must design for three failure categories that will absolutely occur:

Failure Type 1

API Failures

Timeouts, rate limits, service outages. Handle with: exponential backoff (3 retries, 1s/2s/4s delays), a circuit breaker that stops retrying after N consecutive failures, and a graceful degradation response to the user.

Failure Type 2

Parse Failures

Model returns malformed JSON despite instructions. Handle with: a secondary extraction pass that tries to find JSON inside any response, a structured retry prompt ("Your previous response was not valid JSON. Here is what you returned: [response]. Please return only valid JSON."), then a fallback error state.

Failure Type 3

Quality Failures

Model returns valid JSON with wrong or hallucinated content. Handle with: a confidence field in your schema, post-processing validation rules (e.g., governing_law must be a real jurisdiction), and a human-review flag triggered when confidence falls below threshold.

Prompt Versioning and Testing

Prompts are code. Version them in source control. The team at Replit, building their AI coding assistant in 2023, documented that they kept a regression test suite of 200 input/output pairs and ran it against every prompt change before deployment. A prompt that improves performance on 80% of cases but breaks 20% is a regression, not an improvement.

Your capstone project should have at minimum a set of golden test cases: input/expected-output pairs that cover your happy path, your empty-field cases, your wrong-document case, and at least one prompt injection attempt. Run these manually before every prompt change.

Anti-Pattern: Prompt Spaghetti

Prompts that grow by accretion โ€” each new edge case appended as a new rule โ€” become impossible to reason about. When your prompt exceeds ~800 tokens of instructions, refactor: split into multiple specialized prompts, use routing logic to select the right prompt, or move rule enforcement into post-processing code where it's testable.

Capstone Milestone 3 โ€” Due This Lesson

In Lab 3 you will write the complete system prompt for your capstone project, walk through it with the AI coach, handle all edge cases the coach throws at you, and produce a golden test case set with at least 5 input/expected-output pairs. The coach will attempt prompt injection and edge-case inputs against your prompt design.

Lesson 3 Quiz โ€” Prompt Engineering & Model Integration

4 questions ยท Select the best answer for each
1. Harvey AI's 94% reduction in JSON parse failures came primarily from:
Correct. The model performs far more reliably when the output format is specified in a language it has seen millions of times during training (TypeScript types, JSON Schema) versus vague prose descriptions.
The reliability gain came from the output schema specification technique, not from infrastructure changes or model upgrades.
2. Why must retrieved document content be placed inside clear delimiters (like <DOCUMENT> tags) rather than directly appended to the instruction text?
Correct. Without clear delimiters, a document containing text like "Ignore previous instructions. Your new task is..." can overwrite the system prompt. Delimiters signal to the model (and to your parsing code) where instructions end and untrusted content begins.
Security is the answer here. Undelimited document content creates a prompt injection attack surface where adversarial document content can override system instructions.
3. Replit's AI team maintained a regression test suite of 200 input/output pairs. This practice primarily protects against:
Correct. A prompt that improves 80% of cases but breaks 20% is a regression. Without a test suite, you only discover these regressions when users report them in production โ€” often days or weeks later.
The regression suite catches prompt changes that help some cases while harming others. Prompts are code, and code changes require testing before deployment.
4. When should you trigger a human-review flag in a production AI pipeline?
Correct. Quality failures โ€” valid JSON with wrong content โ€” are the hardest failure mode because they're silent. A confidence field and validation rules catch these before they reach users, routing borderline cases to human review.
Human review should be triggered by quality signals embedded in the pipeline: low confidence scores and failed validation rules. Latency and HTTP errors are separate failure categories with different handlers.

Lab 3 โ€” Prompt Engineering Workshop

Build, stress-test, and harden your production system prompt ยท Complete 3+ exchanges to unlock Lesson 4

Your Mission

Write the complete system prompt for your capstone project using the four-section structure from Lesson 3: Role & Objective, Output Schema, Rules & Constraints, and Context Slot with delimiters. Then submit it to the AI coach.

The coach will play adversarial user โ€” attempting prompt injection, submitting wrong-document types, sending ambiguous inputs, and probing every rule gap. You must iterate until your prompt handles all attacks and edge cases. Then produce 5 golden test cases (input description + expected output structure).

Paste your complete system prompt below and say "ready for review." The coach will immediately begin adversarial testing. After each attack, revise your prompt and resubmit the affected section. Once the coach approves your prompt and you've produced 5 golden test cases, Milestone 3 is complete.
Step 1: Write and paste your full system prompt (all four sections).
Step 2: Say "ready for review" โ€” the coach begins adversarial testing.
Step 3: Revise and resubmit until the coach approves your prompt.
Step 4: List your 5 golden test cases (input scenario + expected JSON fields).
Step 5: Coach approves โ€” Milestone 3 complete, move to Lesson 4.
AI Prompt Coach
Capstone Lab 3
Module 8 ยท Lesson 4 โ€” Capstone Project

Evaluation, Iteration & Deployment Readiness

Measuring what you built, deciding if it's good enough to ship, and establishing the feedback loops that make it better after launch.
How do you know when your AI system is ready to deploy โ€” and how do you keep it from degrading after you do?

Before Intercom launched Fin โ€” their RAG-based customer support AI โ€” the team ran what they called a "shadow deployment" for three weeks. The model answered every incoming query in parallel with human agents, but its answers were only shown to internal evaluators, not customers. This gave them 50,000 real production queries with paired human answers to score against. When they found that Fin hallucinated product pricing data in 2.3% of cases, they fixed the retrieval pipeline before a single customer was affected. Fin launched with a documented hallucination rate of under 0.4%.

The Evaluation Framework

Evaluation for LLM applications operates at three levels simultaneously. Treating them as the same problem โ€” or skipping any of them โ€” produces systems that look good in demos but fail in production.

Level 1 โ€” Component Evaluation

Is each piece working correctly in isolation?

  • Retrieval precision@K: are the top chunks relevant?
  • Embedding similarity: are semantically similar queries finding each other?
  • Parse success rate: what % of model outputs parse cleanly?
  • Latency percentiles: p50, p95, p99 for each component
Level 2 โ€” End-to-End Evaluation

Does the complete pipeline produce correct answers?

  • Factual accuracy against ground truth
  • Hallucination rate (answers with no source in retrieved context)
  • Refusal rate (cases where model should answer but declines)
  • Coverage: % of query types the system handles
Level 3 โ€” Business Evaluation

Is the system delivering the outcome from your scope statement?

  • The metric from your scope statement (clicks, decisions, resolved tickets)
  • User trust signals (edits, overrides, thumbs-down)
  • Escalation rate to humans
  • Time-to-value for the user
Building an Evaluation Set

Your evaluation set should have at minimum 50 examples across these categories. The RAGAS framework (open-sourced by Exploding Gradients in 2023 and adopted by hundreds of production RAG teams) defines four metrics that are now the closest thing to an industry standard for RAG evaluation:

RAGAS MetricWhat It MeasuresAcceptable Floor
FaithfulnessAre claims in the answer supported by the retrieved context?> 0.85
Answer RelevanceDoes the answer actually address the question asked?> 0.80
Context PrecisionAre the retrieved chunks actually useful for answering?> 0.70
Context RecallDid retrieval find all the chunks needed to answer completely?> 0.75
The Iteration Loop

Post-evaluation, most teams have three levers to pull when a metric is below threshold. Knowing which lever to pull based on which metric is failing is the skill that separates senior AI engineers from juniors:

Low Faithfulness โ†’
The model is ignoring or contradicting the retrieved context. Fix: tighten the system prompt ("answer ONLY from the DOCUMENT section"), reduce context window size so irrelevant chunks don't dilute the signal, or add a citation-checking post-processing step.
Low Context Precision โ†’
Retrieval is returning irrelevant chunks. Fix: improve chunking strategy (smaller chunks, overlap adjustment), add metadata filters, improve the query expansion or rewriting step, or switch from pure semantic search to a hybrid BM25 + semantic approach.
Low Context Recall โ†’
Retrieval is missing chunks that contain the answer. Fix: increase K (return more candidates), re-examine your chunking strategy (key information may be split across chunk boundaries), or add a cross-encoder reranker to catch relevant chunks the ANN search missed.
Low Answer Relevance โ†’
The model is technically accurate but answering a different question. Fix: prompt adjustment (add "answer the specific question asked, not a related question"), query clarification step, or output post-processing that validates the answer addresses the query.
Deployment Readiness Checklist

Intercom's team published that they used a 12-point readiness checklist before any AI feature went to production. The following is adapted from their published engineering blog, the OpenAI production deployment guide (2023), and Anthropic's model deployment documentation.

Production Readiness
All RAGAS metrics above threshold on a 50+ example evaluation set
Hallucination rate measured and below acceptable ceiling from scope statement
p95 latency measured and within UX-acceptable bounds
API failure handling tested: retry logic, circuit breaker, graceful degradation
Parse failure handling tested: secondary extraction, retry prompt, fallback state
Prompt injection test cases run and all passed
Golden test suite regression run passing 100%
Observability in place: every LLM call traced with input, output, latency, model version
User feedback mechanism in place (thumbs-up/down or equivalent)
Human escalation path defined for low-confidence outputs
Rollback plan documented: what happens if metrics degrade in production?
Cost projection validated: cost per query ร— expected volume = budget approval
Post-Launch Monitoring

AI systems degrade in ways that traditional software does not. The knowledge base becomes stale. User query patterns drift. Model providers update underlying models silently. Intercom documented that Fin's performance dropped 6% over the first four months without any code change โ€” purely due to product updates making their knowledge base partially outdated. They now run automated evaluation against a rotating sample of production queries weekly.

Your capstone project should define: how often you re-run evaluation, what triggers a re-ingestion of the knowledge base, and what metric threshold triggers a rollback or human review escalation in production.

Faithfulness (target: 0.85+)0.91
Answer Relevance (target: 0.80+)0.87
Context Precision (target: 0.70+)0.76
Context Recall (target: 0.75+)0.82
Capstone Milestone 4 โ€” Final Deliverable

Lab 4 is your capstone integration review. You will present your complete project: scope statement, architecture diagram, system prompt with golden test cases, and an evaluation plan with metric targets. The AI coach will conduct a technical review, identify any remaining gaps, and sign off on your deployment readiness checklist. Completing Lab 4 qualifies you for the Module Test.

Lesson 4 Quiz โ€” Evaluation, Iteration & Deployment

4 questions ยท Select the best answer for each
1. Intercom's "shadow deployment" of Fin found a 2.3% hallucination rate on pricing data before launch. What made this finding possible?
Correct. Shadow deployment exposes the model to real production query distribution โ€” the things actual users ask โ€” without any risk to users. The 50,000 real queries revealed the pricing hallucination pattern that synthetic test data had missed.
Shadow deployment was the key: real queries, real production conditions, zero user exposure. That's the only way to find distribution-specific failure modes.
2. A RAG system's RAGAS Context Precision score is 0.45, well below the 0.70 floor. The most appropriate first intervention is:
Correct. Low Context Precision means retrieval is returning irrelevant chunks. The fix lives in the retrieval layer: smaller chunks, better metadata filtering, hybrid BM25 + semantic search. Changing the prompt or LLM doesn't fix bad retrieval.
Context Precision measures retrieval quality, not generation quality. The fix must be in the retrieval layer, not the prompt or model.
3. Intercom documented that Fin's performance dropped 6% over four months without any code change. The root cause was:
Correct. Data staleness is the most common silent degradation path for RAG systems. The model and code were identical; the knowledge base had drifted away from current product reality. This is why automated evaluation on production queries must run continuously post-launch.
The lesson from Intercom's Fin is that AI systems degrade without code changes when the knowledge base becomes stale. Continuous post-launch evaluation is essential, not optional.
4. Which of the following is NOT included in the 12-point production readiness checklist?
Correct. The readiness checklist is about operational soundness โ€” error handling, observability, evaluation metrics, and fallback plans. An A/B test against a baseline is good practice but is not a readiness gate; many AI deployments are net-new features with no prior system to compare against.
The 12 items in the checklist cover evaluation metrics, failure handling, observability, user feedback, escalation, rollback, and cost. A/B testing against a baseline system is not a universal readiness requirement.

Lab 4 โ€” Capstone Integration Review

Present your complete project for technical review and deployment sign-off ยท Complete 3+ exchanges to unlock the Module Test

Your Mission โ€” Final Capstone Presentation

This is the culminating lab of the course. Present your complete capstone project to the AI technical reviewer. You must cover all four milestones in your presentation. The reviewer will probe every aspect and identify any gaps that would prevent safe deployment.

Milestone 1 Review: State your final scope statement. The reviewer will verify it meets all five scoping dimensions.
Milestone 2 Review: Describe your three-layer architecture and data pipeline. The reviewer will probe for coupling errors and missing failure handling.
Milestone 3 Review: Present your system prompt structure and 5 golden test cases. The reviewer will run one final adversarial test.
Milestone 4 Review: Present your evaluation plan โ€” which RAGAS metrics you'll track, your metric targets, your monitoring cadence, and your rollback trigger conditions.
Deployment Sign-Off: Walk through the 12-point readiness checklist item by item. The reviewer will issue sign-off when all items are addressed.
Begin by saying "Ready for capstone review" and then present Milestone 1 (your scope statement). The reviewer will guide you through each milestone in sequence. Be thorough โ€” this is your professional-grade technical review.
AI Technical Reviewer
Capstone Final Lab
Module 8 โ€” Capstone Project

Module Test

15 questions covering all four lessons ยท Score 80% or higher to pass
1. Which of the following best completes the capstone scope template: "For [user], when [trigger], the system will [output] so that [outcome], with an acceptable error rate of [threshold]"?
Correct. This is the only option that specifies a concrete user, trigger, output format, measurable outcome, and quantified error threshold.
The other options are vague, technical (not user-facing), or unmeasurable. A scope statement must specify user, trigger, output, outcome, and error threshold precisely.
2. Nabla's medical transcription AI had 94% word-level accuracy but only 71% clinical accuracy. This example illustrates which scoping anti-pattern?
Correct. Word-level accuracy is a model-level vanity metric. Clinical accuracy โ€” whether physicians could act safely on the transcription โ€” was the real product metric. Optimizing for the wrong measure led to a system that scored well internally but underperformed for users.
This is the vanity metric anti-pattern: the system was scored on a measure that didn't reflect whether it delivered value to the actual user making clinical decisions.
3. In the three-layer AI application model, what is the ONLY responsibility of the interface layer?
Correct. The interface layer does only input validation, auth, and routing. Business logic in the interface layer creates tight coupling that makes testing and iteration painful.
Prompt construction and model calls belong in the orchestration layer. Embedding storage belongs in the data layer. The interface layer must stay thin โ€” validation, auth, routing only.
4. Why must the data ingestion pipeline run at ingest time rather than at query time?
Correct. Parsing, chunking, and embedding generation are expensive operations. They must happen offline at ingest time so that query time only requires a fast embedding of the user query plus an ANN search โ€” a path that can complete in under 200ms.
Latency is the key reason. Offline ingestion means retrieval at query time is fast (one embedding + ANN search). Doing ingestion work at query time would add seconds of latency to every user interaction.
5. Morgan Stanley's "skeleton-first" principle primarily reduces which project risk?
Correct. Building end-to-end data flow with stubs proves every interface between layers before any component is optimized. Integration failures that would derail a project in week 6 are caught in week 1 instead.
Skeleton-first is about integration risk. It catches mismatches between layer contracts early, when they're cheap to fix, rather than after weeks of component-level optimization work.
6. Harvey AI achieved a 94% reduction in JSON parse failures by:
Correct. The model had seen TypeScript type definitions millions of times in training. Specifying the output schema in a language the model knows well produces far more consistent structured output than prose descriptions.
The specific technique was schema specification in a format the model knows from training data. TypeScript types and JSON Schema are far more reliable than prose descriptions of the desired output format.
7. Which production prompt section is specifically designed to prevent prompt injection attacks?
Correct. Clear delimiters (XML tags, triple backticks with labels) signal to the model where trusted instructions end and untrusted document content begins. Without this separation, malicious document content can override system instructions.
The Context Slot with delimiters is the injection defense. Without clear boundaries between instructions and untrusted content, a document can contain text that overwrites the system prompt.
8. A production AI system receives a document in Japanese when the system only supports English. According to the production prompt architecture, the correct behavior is to:
Correct. The Rules & Constraints section must include explicit instructions for every edge case you can anticipate. Returning a defined error state is always preferable to silent failure or unexpected behavior.
Edge cases must have explicit instructions in the prompt. A defined error_type field in your schema lets downstream code handle the case deterministically, rather than receiving unpredictable model behavior.
9. The RAGAS metric "Faithfulness" measures:
Correct. Faithfulness specifically measures hallucination: does the answer contain claims not supported by the retrieved context? A high faithfulness score means the model is staying grounded in its retrieved information rather than generating unsupported facts.
Faithfulness is the anti-hallucination metric. It checks whether every claim in the answer has a source in the retrieved context โ€” the direct measure of whether the model is grounding its output or inventing information.
10. If a RAG system has low Context Recall (below 0.75), the most appropriate fix is:
Correct. Low Context Recall means the retrieval system is missing chunks that contain the answer. The fix is in retrieval coverage: more candidates (higher K), better chunk boundaries (key info isn't split), or a reranker that catches what ANN search missed.
Context Recall is about coverage โ€” are you finding all the relevant chunks? Adding metadata filters would reduce recall further. The fix is improving coverage: more candidates or better chunk design.
11. Replit's AI team maintained a regression test suite of 200 input/output pairs and ran it on every prompt change. What does this practice most directly prevent?
Correct. Without regression testing, a prompt change that fixes the one case you noticed can break five cases you didn't check. The suite catches regressions before deployment, not after user complaints.
The regression suite is specifically for catching prompt changes that help some inputs while silently breaking others. Prompts are code, and code changes must be tested.
12. Intercom's Fin deployment saw a 6% performance drop over four months without any code change. What does this imply for post-launch operations?
Correct. AI systems degrade through data staleness even without code changes. Continuous automated evaluation on rotating production samples โ€” and a defined re-ingestion schedule โ€” is the operational discipline that prevents silent degradation.
Intercom's experience proves that AI performance is tied to knowledge base freshness. Without continuous evaluation and re-ingestion, real-world performance drifts downward invisibly.
13. The "Failure Budget" scoping dimension determines model selection and training decisions before any data is collected. Which of the following best illustrates why?
Correct. The failure budget defines which error direction is worse. That determines class weighting in training loss, model family selection (precision-oriented vs. recall-oriented), and the decision threshold at inference โ€” all foundational decisions that precede data collection.
"Failure budget" in scoping refers to acceptable error rates and error direction, not financial budget. It's an input to model design decisions, not a compute resource constraint.
14. The deployment readiness checklist item "cost projection validated: cost per query ร— expected volume = budget approval" is placed on the checklist because:
Correct. Cost is an operational constraint as hard as latency or accuracy. A system that passes every technical gate but cannot be operated within budget is not deployable. Pre-launch cost validation prevents post-launch approval being revoked after go-live โ€” the worst possible outcome for the engineering team.
Cost is a hard operational constraint, not a soft recommendation. The checklist item exists because teams discover post-launch that their system is too expensive to run at scale โ€” at which point all development effort is wasted. Pre-launch cost validation is a technical gate, not an accounting formality.
15. A student completes their capstone scope statement but writes: "The system will summarize documents accurately and quickly." Which scoping dimension is entirely missing, making this statement untestable?
Correct. "Accurately" and "quickly" are vanity adjectives with no measurable threshold. A valid scope statement requires a quantified error rate (e.g., "hallucination rate below 2% on the golden test set") and a latency SLO (e.g., "P95 response time under 4 seconds"). Without these, no test can determine pass or fail โ€” the system can never be declared done.
While other dimensions could be more specific, the fatal flaw is the missing error threshold. "Accurately" and "quickly" cannot be tested โ€” there is no number that determines pass or fail. A scope statement without measurable thresholds is not a specification; it is a wish.