When the Allen Institute for AI set out to build Semantic Scholar's research recommendation engine, the team spent the first six weeks doing nothing but scoping. They wrote down what a success looked like in one sentence: "A researcher clicks a recommended paper they would not have found otherwise, within five minutes of arriving." Every subsequent architectural decision was tested against that sentence. The system shipped in eight months; rival projects with looser scope were still in design review two years later.
Most AI projects fail before a single model is trained. The failure mode is almost always the same: the team builds something technically impressive that solves a problem no one actually has, or solves a real problem in a way that can't be measured. Scoping is the discipline of collapsing infinite possibility into a single, testable, valuable artifact.
In 2021, Google's Perspective API team published a retrospective noting that their original scope โ "detect all toxic content" โ had to be narrowed to "detect comments likely to cause a moderator to remove a reply thread within 24 hours" before they could build a model that performed well enough to deploy. The measurement axis changed everything about the training data they needed.
A well-scoped AI project defines itself along five axes. Vagueness on any one of them will cause rework later.
One specific human whose life or workflow changes. Not "users" โ one archetype with a named context. The Semantic Scholar team wrote "a postdoc in immunology reading on a Thursday afternoon."
A single number that goes up or down. Not "better experience" โ a click-through rate, a latency, a review cycle time. If you cannot write it as a SQL query, it is not a metric yet.
Exactly what data goes in; exactly what the model produces. A string, a bounding box, a ranked list, a probability score. Ambiguity here means integration bugs in week six.
What error rate is acceptable? In what direction? A false positive in a cancer screening is catastrophic; in a spam filter it is a minor nuisance. The budget determines model choice before any training happens.
Where does inference run โ browser, mobile, server, edge device? What latency is tolerable? What is the update cadence? A model that must run offline on a $40 phone is scoped completely differently than a cloud batch job.
Every capstone project in this module begins with a one-sentence scope statement in this form:
For [specific user], when [specific trigger], the system will [specific output] so that [measurable outcome], with an acceptable error rate of [threshold].
Example: "For a small-business owner with no legal training, when they paste a contract clause into a web form, the system will return a plain-English summary and a risk flag (HIGH / MEDIUM / LOW) so that they can decide whether to hire a lawyer without reading the clause themselves, with an acceptable false-negative rate on HIGH risk of under 5%."
Notice what this forces: you now know your user (small-business owner), your input (pasted clause text), your output (summary + categorical label), your metric (lawyer-hire decision confidence), and your failure budget (5% false-negative ceiling on the dangerous class).
You will build one complete AI application across the four lessons of this module. Choose a domain from the list below, or propose your own. Each lab will advance your chosen project through a defined phase.
An LLM-powered tool that ingests PDF or text documents, extracts structured data, answers questions about the content, and flags anomalies. Inspired by Harvey AI's legal document pipeline (launched 2022) and the contract analysis workflows documented by Allen & Overy's deployment of GPT-4 in 2023.
A RAG system that answers questions against a proprietary knowledge base โ product docs, support tickets, internal wikis โ without hallucinating outside its corpus. Modeled on Klarna's customer service AI deployment (2024) which handled 2.3 million conversations in its first month.
An agent that reads pull request diffs, identifies bugs, security issues, and style violations, and posts structured comments. Modeled on GitHub Copilot's PR review feature (beta 2023) and Stripe's internal Sorbet type-checker AI tooling.
Propose your own. It must have a defined user, a measurable outcome, a clear input/output contract, and a realistic deployment context. The labs are domain-agnostic and will apply equally well.
The "AI for everything" trap. Projects that say the system will "understand the document" are unscopable. Understanding is not a verb that maps to an output. Replace with "extract the parties, dates, and payment terms into a JSON object."
The vanity metric. Model accuracy on a held-out test set is not a product metric. In 2023, the team behind Nabla's medical transcription AI (deployed in 50 US health systems) published that their model had 94% word-level accuracy but only 71% clinical accuracy โ the metric that actually mattered to physicians. Always scope to the downstream outcome.
The infinite scope creep. Scope documents must have a "not in scope" section that is longer than the "in scope" section. If you cannot name five things you are deliberately not building, you have not finished scoping.
Before moving to Lesson 2, you must complete the Lab below and produce a one-sentence scope statement for your chosen project domain. The Lab AI coach will review your statement against all five dimensions and push back until it is tight enough to build from. This statement will be the foundation every subsequent lesson builds upon.
Choose one of the four capstone options from Lesson 1 (Document Intelligence, RAG Chatbot, Code Review Pipeline, or your own domain). Draft a one-sentence scope statement using the template, then submit it to the AI coach below. The coach will critique it against all five scoping dimensions and push back until the statement is tight enough to build from.
You must produce a scope statement the coach approves before moving on. This statement will anchor every subsequent lab in this module.
When Klarna built its customer service AI โ which by February 2024 was handling the equivalent of 700 full-time agent workloads โ the engineering team published that they had deliberately chosen the "boring" architecture: a single stateless FastAPI service, a vector database (Weaviate), a retrieval layer, and Claude as the generation model. They explicitly rejected building a custom neural ranker because the operational complexity was not worth the marginal quality gain. The boring architecture shipped. The clever one would still be in design review.
Almost every production AI application โ regardless of domain โ decomposes into three layers. Understanding this decomposition before writing a single line of code prevents the most common architectural mistakes.
Layer 1 โ Interface. The surface the user or calling system touches. In a web app this is a REST endpoint or WebSocket. In a pipeline it is a queue consumer. This layer's only job is input validation, authentication, and routing. No business logic lives here.
Layer 2 โ Orchestration. The intelligence layer. It retrieves relevant context from Layer 3, constructs the prompt, calls the model, parses the structured output, and applies post-processing rules (filtering, formatting, confidence thresholding). This is where RAG chains, agent loops, and tool calls live.
Layer 3 โ Data. The persistence and retrieval layer. For RAG systems this is a vector store plus a metadata SQL database. For document intelligence it is object storage plus an extraction cache. This layer must be queryable independently of the model โ if you can't test your retrieval without calling the LLM, your architecture is coupled incorrectly.
The data pipeline has two distinct phases that many beginners collapse into one: ingestion (getting data in) and retrieval (getting the right data out). Confusing them causes latency problems at inference time and stale-data bugs in production.
The following table reflects the actual technology stack used in production RAG and document-intelligence deployments documented in 2023โ2024 engineering blogs from companies including Notion, Replit, Intercom, and Morgan Stanley's internal AI systems.
| Component | Recommended Starting Choice | Why |
|---|---|---|
| Web API Framework | FastAPI (Python) | Async, auto-docs, type-safe, fastest cold path for Python LLM apps |
| Vector Store | ChromaDB (local) โ Pinecone (production) | ChromaDB needs zero infra for prototyping; Pinecone scales to billions with managed SLA |
| Embeddings | text-embedding-3-small (OpenAI) | Best cost/quality ratio documented in MTEB benchmarks as of 2024 |
| LLM | Claude 3.5 Sonnet or GPT-4o | 128K context window handles long documents; structured output reliability |
| Orchestration | LangChain or direct SDK calls | LangChain for rapid prototyping; direct calls for production control |
| Document Parsing | PyMuPDF + Unstructured.io | Handles PDFs, Word, HTML; Unstructured's table extraction is production-grade |
| Observability | LangSmith or Helicone | Trace every LLM call; essential for debugging prompt failures in production |
Morgan Stanley's AI team, deploying their wealth management assistant in 2023, documented a principle they called "skeleton before muscles": build the complete data flow end-to-end with stub implementations before optimizing any single component. A skeleton system returns a hardcoded response but exercises every layer. This approach catches integration failures early โ before you've spent three weeks optimizing a retrieval algorithm that turns out to connect to the wrong database schema.
In Lab 2 you will produce a complete component diagram for your capstone project โ every box, every arrow, every data contract. The AI coach will probe your architecture for coupling errors, missing failure handling, and latency bottlenecks. You must resolve all critical issues before the diagram is approved.
Using your approved scope statement from Lab 1, design the complete architecture for your capstone project. Describe every component, the data contract between each layer, your technology choices, and how your ingestion pipeline differs from your retrieval pipeline.
The AI architect will probe for: coupling errors between layers, missing failure handling, latency bottlenecks, components that don't need to exist, and components you forgot. You must resolve all critical issues raised before the architecture is approved.
When Harvey AI deployed their legal document analysis system to law firms including A&O Shearman in 2023, the engineering team documented that their biggest reliability gain came not from model choice but from prompt architecture. They moved from free-form instructions to what they called "contract prompts" โ prompts where the output schema was embedded in the system message as a TypeScript type definition. JSON parse failures dropped by 94%. The model, it turned out, was far more reliable when the output format was specified in a language it had seen millions of times in training.
The playground is a single-turn, single-user, no-latency environment. Production is multi-turn, concurrent, latency-sensitive, and adversarial. Prompts that look good in the playground fail in production for three systematic reasons:
Context length pressure. In production, the context window fills with retrieved chunks, conversation history, and tool outputs. Prompts written assuming plenty of space start getting truncated. Always test prompts at the maximum context length you'll actually send.
User input variation. Playground testing uses your own well-formed inputs. Production users send typos, multi-language inputs, injection attempts, and questions entirely outside your intended scope. Your prompt must handle all of these gracefully.
Output parsing brittleness. If your prompt says "respond in JSON," the model sometimes adds markdown fences. Sometimes it adds commentary before the JSON. Sometimes it uses single quotes. A production prompt must produce output that is 100% programmatically parseable, every time.
Every production prompt for a structured-output task should have these four sections, in this order:
One or two sentences. No backstory, no personality. State the task and the output goal. "You are a contract analysis engine. Your task is to extract the parties, governing law, termination clauses, and payment terms from legal documents and return them in the specified JSON schema."
Embed the schema directly in the system prompt as a TypeScript interface, a JSON Schema, or an annotated example. Never describe the schema in prose โ that's ambiguous. Show the structure. Harvey AI's 94% parse-failure reduction came from this change alone.
If a field cannot be found, use null โ do not invent a value. If the document is not a contract, return error_type: "wrong_document". If the user asks a question outside scope, return error_type: "out_of_scope". Every edge case you can anticipate should have an explicit instruction.
Use a clear delimiter: <DOCUMENT>...</DOCUMENT> or triple backticks with a label. Never let retrieved content run directly into instruction text โ this allows prompt injection where malicious document content overwrites your instructions.
In any production LLM integration, you must design for three failure categories that will absolutely occur:
Timeouts, rate limits, service outages. Handle with: exponential backoff (3 retries, 1s/2s/4s delays), a circuit breaker that stops retrying after N consecutive failures, and a graceful degradation response to the user.
Model returns malformed JSON despite instructions. Handle with: a secondary extraction pass that tries to find JSON inside any response, a structured retry prompt ("Your previous response was not valid JSON. Here is what you returned: [response]. Please return only valid JSON."), then a fallback error state.
Model returns valid JSON with wrong or hallucinated content. Handle with: a confidence field in your schema, post-processing validation rules (e.g., governing_law must be a real jurisdiction), and a human-review flag triggered when confidence falls below threshold.
Prompts are code. Version them in source control. The team at Replit, building their AI coding assistant in 2023, documented that they kept a regression test suite of 200 input/output pairs and ran it against every prompt change before deployment. A prompt that improves performance on 80% of cases but breaks 20% is a regression, not an improvement.
Your capstone project should have at minimum a set of golden test cases: input/expected-output pairs that cover your happy path, your empty-field cases, your wrong-document case, and at least one prompt injection attempt. Run these manually before every prompt change.
Prompts that grow by accretion โ each new edge case appended as a new rule โ become impossible to reason about. When your prompt exceeds ~800 tokens of instructions, refactor: split into multiple specialized prompts, use routing logic to select the right prompt, or move rule enforcement into post-processing code where it's testable.
In Lab 3 you will write the complete system prompt for your capstone project, walk through it with the AI coach, handle all edge cases the coach throws at you, and produce a golden test case set with at least 5 input/expected-output pairs. The coach will attempt prompt injection and edge-case inputs against your prompt design.
Write the complete system prompt for your capstone project using the four-section structure from Lesson 3: Role & Objective, Output Schema, Rules & Constraints, and Context Slot with delimiters. Then submit it to the AI coach.
The coach will play adversarial user โ attempting prompt injection, submitting wrong-document types, sending ambiguous inputs, and probing every rule gap. You must iterate until your prompt handles all attacks and edge cases. Then produce 5 golden test cases (input description + expected output structure).
Before Intercom launched Fin โ their RAG-based customer support AI โ the team ran what they called a "shadow deployment" for three weeks. The model answered every incoming query in parallel with human agents, but its answers were only shown to internal evaluators, not customers. This gave them 50,000 real production queries with paired human answers to score against. When they found that Fin hallucinated product pricing data in 2.3% of cases, they fixed the retrieval pipeline before a single customer was affected. Fin launched with a documented hallucination rate of under 0.4%.
Evaluation for LLM applications operates at three levels simultaneously. Treating them as the same problem โ or skipping any of them โ produces systems that look good in demos but fail in production.
Your evaluation set should have at minimum 50 examples across these categories. The RAGAS framework (open-sourced by Exploding Gradients in 2023 and adopted by hundreds of production RAG teams) defines four metrics that are now the closest thing to an industry standard for RAG evaluation:
| RAGAS Metric | What It Measures | Acceptable Floor |
|---|---|---|
| Faithfulness | Are claims in the answer supported by the retrieved context? | > 0.85 |
| Answer Relevance | Does the answer actually address the question asked? | > 0.80 |
| Context Precision | Are the retrieved chunks actually useful for answering? | > 0.70 |
| Context Recall | Did retrieval find all the chunks needed to answer completely? | > 0.75 |
Post-evaluation, most teams have three levers to pull when a metric is below threshold. Knowing which lever to pull based on which metric is failing is the skill that separates senior AI engineers from juniors:
Intercom's team published that they used a 12-point readiness checklist before any AI feature went to production. The following is adapted from their published engineering blog, the OpenAI production deployment guide (2023), and Anthropic's model deployment documentation.
AI systems degrade in ways that traditional software does not. The knowledge base becomes stale. User query patterns drift. Model providers update underlying models silently. Intercom documented that Fin's performance dropped 6% over the first four months without any code change โ purely due to product updates making their knowledge base partially outdated. They now run automated evaluation against a rotating sample of production queries weekly.
Your capstone project should define: how often you re-run evaluation, what triggers a re-ingestion of the knowledge base, and what metric threshold triggers a rollback or human review escalation in production.
Lab 4 is your capstone integration review. You will present your complete project: scope statement, architecture diagram, system prompt with golden test cases, and an evaluation plan with metric targets. The AI coach will conduct a technical review, identify any remaining gaps, and sign off on your deployment readiness checklist. Completing Lab 4 qualifies you for the Module Test.
This is the culminating lab of the course. Present your complete capstone project to the AI technical reviewer. You must cover all four milestones in your presentation. The reviewer will probe every aspect and identify any gaps that would prevent safe deployment.