When OpenAI opened GPT-4 access through its API in March 2023, the first wave of developers discovered a jarring reality: the model itself was the easy part. Stripe's engineering team, among the earliest to integrate GPT-4 for their support documentation system, spent roughly three weeks on the model and eight weeks on the surrounding plumbing β rate-limit handling, token counting, streaming response parsing, and cost attribution per request. The lesson echoed across hundreds of early integrations: APIs are interfaces, and interfaces have rules that cost you before you ever hit the intelligence.
An Application Programming Interface (API) is a defined contract between two software systems. When a company like Anthropic, OpenAI, or Google exposes an AI model through an API, they are saying: "Send us a specific JSON payload at this HTTPS endpoint, and we will return a specific JSON response." Everything else β the model weights, the inference hardware, the safety filters β is abstracted away on their side.
For AI specifically, the request payload typically contains three things: a model identifier (e.g., claude-sonnet-4), a messages array describing the conversation so far, and a set of parameters (maximum tokens, temperature, stop sequences). The response returns the model's generated text plus metadata: token counts consumed, stop reason, and the model version that actually served the request.
Every AI API call is stateless by design. The server does not remember your previous request. Conversation context must be re-sent in full with each new request β the "messages array" is the application's responsibility to maintain and pass.
Access to AI APIs is gated by API keys β long cryptographic strings issued by the provider when you create an account. The key travels in the HTTP header of every request (typically as Authorization: Bearer sk-β¦) and serves two purposes: identifying who is making the request for billing, and authorizing that the caller has permission to use the service.
A leaked API key is a serious incident. In February 2023, a developer accidentally committed an OpenAI key to a public GitHub repository; within eleven minutes, automated scanners had found it and begun making requests. OpenAI's own systems detected the anomalous usage pattern and revoked the key β but the developer had already accumulated $1,200 in charges that OpenAI ultimately waived given the circumstances. Most providers now offer usage alerts and automatic key rotation precisely because of incidents like this.
A typical AI API integration follows a predictable lifecycle. The client application assembles a request object, attaches authentication headers, and sends an HTTPS POST to the provider's endpoint. On the provider's infrastructure, the request is authenticated, queued, dispatched to inference hardware, processed by the model, safety-filtered, and a response assembled β all within seconds. The response arrives as a JSON object the client application must parse.
Two response patterns exist: synchronous, where the full response arrives when complete, and streaming, where tokens arrive incrementally as the model generates them. Streaming dramatically improves perceived latency and is the default for most chat interfaces β it is why ChatGPT's text appears word by word rather than all at once.
Stripe's 2023 developer survey found that teams integrating AI APIs spent an average of 43% of integration time on error handling and retry logic β not on prompt design. Rate limits, network timeouts, and partial streaming failures are routine, not edge cases.
By 2024, developers could choose from multiple major AI API providers: OpenAI (GPT-4o, GPT-4o mini), Anthropic (Claude 3.5 Sonnet, Claude Haiku), Google (Gemini 1.5 Pro, Gemini Flash), Meta's Llama models via third-party hosts, and Mistral AI. Each differs in pricing per million tokens, context window size, rate limits, latency, and performance on specific task types.
The practical decision framework involves three axes: cost (smaller/faster models like GPT-4o mini or Claude Haiku cost 10β20Γ less than flagship models), context window (tasks involving long documents need models with 128k+ token windows), and capability floor (some tasks, particularly complex reasoning, genuinely require flagship models; others work fine on smaller ones). A/B testing with real production data, not benchmarks alone, is the reliable method for choosing.
In this lab you'll work with an AI assistant that understands AI API mechanics. Ask it to explain request structures, help you reason through parameter choices (temperature, max tokens, model selection), or walk through what happens during rate limiting and error handling.
Try asking it to help you design an API integration for a specific use case, or to explain the difference between synchronous and streaming responses in practical terms.
When GitHub shipped Copilot to general availability in June 2022, its most surprising engineering challenge was not the model β it was the system prompt. GitHub's team maintained what they internally called the "ghost layer": a carefully engineered prompt that instructed Copilot to behave as a pair programmer, not a code generator, to refuse certain categories of requests, and to format suggestions in ways that matched IDE constraints. Over 18 months the ghost layer grew from 200 tokens to over 1,000 as the team discovered edge cases β situations where without explicit instruction the model would produce plausible-sounding but subtly incorrect completions. Prompt engineering was not a workaround for them; it was the product.
Most AI APIs distinguish between three message roles: system, user, and assistant. The system message is sent before any user interaction and establishes the model's operating context β its persona, constraints, output format requirements, and behavioral rules. It is the part of the conversation the end user typically never sees.
A well-designed system prompt accomplishes several things: it defines scope ("You are a technical support agent for Acme Cloud Storage β answer questions only about Acme products"), establishes tone and format requirements ("Respond in three sentences or fewer unless the user explicitly asks for detail"), encodes safety behaviors ("Never share information about competitors"), and provides any persistent context the model needs ("The current date isβ¦ The user's account tier isβ¦").
Anthropic's documentation for Claude notes that system prompts are processed before user messages in every request and consume tokens from the context window. A 2,000-token system prompt at $3/million tokens costs $0.006 per conversation β trivial individually, but significant at millions of daily requests.
Few-shot prompting means including examples of desired input-output pairs directly in the prompt before asking the model to perform a task. Rather than describing what you want abstractly, you show it. This technique, introduced systematically in the GPT-3 paper by Brown et al. in 2020, consistently outperforms instruction-only prompting on structured tasks like classification, extraction, and format-specific generation.
In production, few-shot examples are often stored separately and assembled dynamically β selecting the most relevant examples for each query using semantic similarity. Notion AI, for instance, uses a retrieval step to select which few-shot examples of its writing assistant behavior to include based on the document type the user is editing.
Unstructured prose from an AI model is difficult to integrate into an application reliably. The solution is instructing the model to produce structured output β typically JSON β that can be parsed programmatically. OpenAI introduced a "JSON mode" in November 2023 guaranteeing valid JSON output; Anthropic's Claude follows explicit formatting instructions with high reliability when properly specified.
A typical structured output instruction might read: "Respond only with a JSON object with keys: 'category' (string), 'confidence' (float 0β1), 'reasoning' (string, max 50 words). No other text." The application then parses this JSON rather than trying to extract information from free-form prose.
When an AI application processes user-supplied text and that text is passed to a model that also has a system prompt, users can attempt prompt injection β crafting input that instructs the model to ignore or override its system prompt. In September 2023, security researcher Johann Rehberger demonstrated successful prompt injection against Bing Chat, causing it to reveal portions of its system prompt and change its behavior. Microsoft subsequently patched the attack surface.
Defenses include instructing the model in the system prompt to be skeptical of instructions in user text, separating user input clearly in the message structure, and validating outputs against expected schemas rather than trusting the model's self-reported compliance.
The most robust AI applications treat prompts as code: they are version-controlled, tested against regression suites, reviewed before deployment, and monitored in production. Klarna's AI team runs automated evaluations against 200+ test cases before any system prompt change goes live.
Practice designing effective system prompts with an AI advisor that specializes in prompt engineering. Describe a use case, and it will help you craft system prompts, identify weaknesses, suggest few-shot examples, and think through prompt injection risks.
Challenge it to critique a system prompt you write β or ask it to help you design one from scratch for a real application scenario you have in mind.
When Notion shipped its AI features in early 2023, the team faced a constraint that defines nearly every enterprise AI deployment: the model knew nothing about the user's own workspace. GPT-4 had no knowledge of a particular company's internal policies, project histories, or proprietary processes. Notion's solution was Retrieval-Augmented Generation β a system that, before calling the language model, searched the user's workspace for relevant documents, extracted the most pertinent passages, and injected them into the prompt as context. The model received a question plus a curated set of facts from the user's own data. It did not need to have been trained on that data; it simply received it as text and reasoned over it.
Retrieval-Augmented Generation (RAG) is an architectural pattern for AI applications that combines a retrieval system with a generative model. Instead of asking the model to answer from parametric memory alone (information encoded in its weights during training), RAG first retrieves relevant documents from an external corpus, then passes those documents β along with the user's question β to the model as context.
RAG solves three of the most critical limitations of pure language model deployments: knowledge cutoff (the model's training data ends at a fixed date, but your document store can be updated continuously), hallucination on specifics (models are more accurate when reasoning over retrieved facts than when generating from memory), and proprietary data access (you can expose company-specific information to the model at inference time without retraining).
Retrieval in RAG is almost universally done through vector embeddings. An embedding model converts text β a sentence, a paragraph, a document chunk β into a high-dimensional numerical vector (commonly 768 to 3,072 dimensions) that captures semantic meaning. Texts with similar meaning produce vectors that are geometrically close in this high-dimensional space.
The process works as follows: at index time, every document in your corpus is split into chunks, each chunk is converted to a vector by an embedding model, and those vectors are stored in a vector database (Pinecone, Weaviate, Chroma, or pgvector in PostgreSQL). At query time, the user's question is also embedded, and the database returns the N document chunks whose vectors are closest to the query vector β the most semantically relevant passages. These retrieved chunks are then passed to the language model as context.
Morgan Stanley deployed a RAG system in 2023 that indexed over 100,000 research reports and financial documents for its financial advisors. The system used GPT-4 as the generative layer. Internal evaluations showed the system reduced the time advisors spent searching for information by approximately 30%, with high citation accuracy because answers were grounded in retrieved source documents.
RAG is not a solved problem β the quality of generation is bounded by the quality of retrieval. If the retrieval step fails to surface the relevant passage, the model answers from parametric memory and may hallucinate. Common failure modes include: chunking mismatches (an answer requires context that spans chunk boundaries), query-document mismatch (the user's question uses different vocabulary than the document, reducing cosine similarity), and context window overflow (retrieving too many chunks exhausts the model's context window).
Advanced RAG implementations address these with reranking (a second model scores retrieved chunks for relevance before they go into the prompt), query rewriting (the model first rephrases the user's question to match document vocabulary), and hybrid search (combining semantic vector search with keyword-based BM25 search for better recall).
For most enterprise use cases, RAG outperforms fine-tuning for knowledge integration. Fine-tuning is expensive, slow to update, and does not improve factual grounding. RAG can be updated in real time by adding documents to the vector store β no retraining required. The 2023 meta-analysis by Lewis et al. at Meta AI, who originally proposed RAG, confirmed this pattern holds across most knowledge-intensive tasks.
Fine-tuning β adjusting model weights on domain-specific data β remains valuable for changing style and format behavior rather than factual knowledge. If your application requires the model to respond in a very specific tone, use particular jargon, or follow rigid output formats that prompt engineering alone cannot achieve consistently, fine-tuning helps. For factual knowledge and current information, RAG is almost always the better choice.
Several organizations have found that combining both works well: fine-tune for style and persona, then layer RAG on top for factual grounding. Intercom's Fin AI product used this approach in 2023 β fine-tuned on support conversation patterns for tone, with RAG over the company's help documentation for factual accuracy.
This lab advisor specializes in RAG architecture. Describe a use case β a company's internal knowledge base, product documentation, research archives β and explore the design decisions: chunking strategy, embedding model choice, vector database selection, retrieval parameters, and how to handle retrieval failures.
You can also ask it to compare RAG vs. fine-tuning for a specific scenario, or to explain how reranking and hybrid search improve retrieval quality.
In February 2024, the British Columbia Civil Resolution Tribunal ruled against Air Canada after its AI chatbot gave a passenger incorrect information about bereavement fare refund policies β information that contradicted Air Canada's own policies. The chatbot had hallucinated a policy that did not exist, the passenger had relied on it, and a court found Air Canada responsible for the chatbot's statements. Air Canada's lawyers had argued the chatbot was "a separate legal entity" β a position the tribunal dismissed. The absence of output monitoring had allowed an incorrect, confident answer to stand unchallenged until a customer acted on it. The case became a widely cited example in AI governance discussions of why evaluation and observability are not optional.
Evaluating an AI application requires different approaches at different stages. Offline evaluation happens before deployment: you build a test set of inputs with known expected outputs, run your system against them, and measure accuracy, format compliance, refusal rate, and latency. This is analogous to unit and integration testing in traditional software β a prerequisite for deployment, not a one-time exercise.
Evaluation metrics depend on task type. For classification tasks: accuracy, precision, recall, F1. For extraction tasks: exact-match rate, field-level accuracy. For generation tasks: human evaluation, or automated metrics like ROUGE (recall-oriented understudy for gisting evaluation), BERTScore, or LLM-as-judge (using a second model to evaluate outputs). OpenAI's evals framework, open-sourced in 2023, provides infrastructure for this.
Using a capable LLM to evaluate the outputs of another LLM has become a standard evaluation technique. The evaluator model is given a rubric and asked to score responses for accuracy, helpfulness, and safety. Anthropic published research in 2024 showing Claude-3-Opus as an evaluator correlates with human judgment at 85%+ on most rubrics β viable but not a replacement for human evaluation on high-stakes tasks.
Traditional software observability β logs, metrics, traces β applies to AI applications but must be extended. Every AI application should log: the full prompt sent to the model (including system prompt and retrieved context), the model's response, token counts, latency, the model version served, and any application-layer decisions made based on the output.
Specialized observability platforms for AI have emerged to address these needs. LangSmith (LangChain's observability product), Langfuse (open-source), Helicone, and Weights & Biases Weave all capture this data and provide dashboards for monitoring quality metrics over time. Klarna reported in 2024 that their AI observability stack allowed them to detect and roll back a prompt regression within four hours β a system prompt change that had subtly degraded response quality across a category of queries.
Relying solely on the model to follow safety and format instructions is insufficient for production systems. Guardrails are programmatic layers applied before and after the model that enforce hard constraints regardless of what the model generates. Input guardrails might strip or redact personally identifiable information before it reaches the API; output guardrails might validate that the model's response is valid JSON before passing it to downstream systems, or run a toxicity classifier on generated text before displaying it to users.
NVIDIA's open-source NeMo Guardrails framework (released 2023) provides infrastructure for this. Guardrails AI, a Python library, applies structured validation to model outputs. In March 2024, a financial services firm using Guardrails AI to validate model outputs reported catching and blocking malformed responses in 0.3% of production calls β a rate that would have caused thousands of downstream failures per day without the layer.
AI API costs at scale are significant and require active management. The primary levers are: model routing (sending simple queries to cheaper models and complex ones to flagship models β reducing costs 60β80% with minimal quality impact, as reported by Martian and Portkey in 2024), prompt caching (Anthropic and OpenAI both offer prompt caching that discounts repeated system prompt tokens by 90%), and output length control (tight max_tokens limits prevent verbose responses that add cost without value).
Monitoring cost per query, cost per user, and cost per successful task completion β rather than just aggregate API spend β allows teams to identify which features or query types are disproportionately expensive and optimize them specifically.
The Air Canada case established a legal precedent: organizations are responsible for their AI systems' outputs, regardless of whether those outputs were generated autonomously. This makes systematic evaluation and output monitoring not just engineering best practice but a legal necessity for customer-facing AI deployments.
A mature AI application deployment pipeline mirrors CI/CD practices from traditional software engineering but adds AI-specific stages. After a prompt or model change: run offline evaluations against the test suite, review any regressions, deploy to a staging environment with shadow traffic, compare quality metrics against the production baseline, run a canary deployment to 5β10% of traffic while monitoring quality and cost metrics, then gradually roll out. Rollback should be as simple as reverting to the previous prompt version β which requires system prompts to be version-controlled like code.
This advisor specializes in AI evaluation and production observability. Describe an AI application you're building or maintaining and work through: what evaluation metrics matter, how to build a test set, what to log in production, how to set up guardrails, and how to structure a deployment pipeline for safe prompt iteration.
You can also bring a specific scenario β an AI application that has produced bad outputs β and work through how you would have detected and prevented it.