Lesson 1 · Module 3

What Is an API and How Do AI APIs Work?

From raw HTTP requests to intelligent responses — understanding the plumbing behind AI-powered software.

How does a developer's code actually reach a language model, and what travels back?

When OpenAI released the GPT-3.5-turbo API in March 2023 at a price of $0.002 per 1,000 tokens, developer sign-ups surged by hundreds of thousands within days. The company's rate-limit infrastructure buckled under load. Startups building on the API — including Snap's My AI feature, which launched the same month — had to implement exponential back-off logic overnight simply to keep their products functional. The incident made one thing sharply clear: calling an AI API is not pressing a button; it is entering a contract with a system that has its own throughput, pricing, and failure modes.

APIs: The Universal Connectors

An Application Programming Interface (API) is a defined contract between two software systems. One system exposes a set of endpoints — addressable URLs with specified input formats and output formats — and any other system that knows that contract can send requests and receive structured responses. REST APIs, which dominate the web, use standard HTTP verbs (GET, POST, PUT, DELETE) and return data in JSON.

AI APIs follow this same pattern but the payload is richer. Instead of querying a database record or triggering a webhook, you are sending natural language text (and increasingly images, audio, or documents) to a model that runs inference — a computationally expensive forward pass through billions of parameters — and returns generated text, embeddings, or structured data.

Anatomy of an AI API Call

A typical call to a modern chat completion API (OpenAI, Anthropic, Google Gemini) involves three layers of structure:

POST https://api.openai.com/v1/chat/completions
Authorization: Bearer sk-...
Content-Type: application/json

{
  "model": "gpt-4o",
  "messages": [
    {"role": "system", "content": "You are a concise assistant."},
    {"role": "user",   "content": "Summarize the Turing Test in one sentence."}
  ],
  "max_tokens": 80,
  "temperature": 0.4
}

The Authorization header carries your API key — a credential that identifies your account for billing and rate limiting. The messages array encodes the conversation history in turn order. The model sees the entire array on every call; there is no server-side memory between separate API requests. max_tokens caps spending; temperature controls output randomness (0 = deterministic, 1+ = creative).

Key Vocabulary

EndpointA specific URL that accepts requests for a particular capability (e.g., /v1/chat/completions vs. /v1/embeddings).

API KeyA secret credential string passed in the Authorization header; never expose it client-side or commit it to source control.

TokenThe unit of text a model reads and generates — roughly 0.75 English words. Billing and context limits are denominated in tokens, not words or characters.

Context windowThe maximum number of tokens a model can process in one call, encompassing all messages. GPT-4o supports 128 K tokens; Claude 3.5 Sonnet supports 200 K.

Rate limitProvider-imposed caps on requests per minute (RPM) and tokens per minute (TPM) that vary by tier and model.

The Response Object

The server returns a JSON object containing the generated text nested inside choices[0].message.content, plus a usage block reporting prompt tokens, completion tokens, and total tokens consumed. Your code must parse this JSON and handle cases where finish_reason equals length (truncated due to max_tokens) versus stop (natural end).

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "finish_reason": "stop",
    "message": {
      "role": "assistant",
      "content": "The Turing Test proposes that a machine is intelligent if its text responses are indistinguishable from a human's."
    }
  }],
  "usage": {"prompt_tokens": 28, "completion_tokens": 22, "total_tokens": 50}
}

Real-world implication

GitHub Copilot, which uses OpenAI APIs under the hood, sends the entire visible file buffer plus surrounding context on every keystroke-triggered completion. At scale — GitHub reported 1 million paid subscribers in June 2023 — this generated enormous token throughput, which is why GitHub negotiated enterprise-tier rate limits and a dedicated capacity arrangement with OpenAI rather than using the public API tier.

Synchronous vs. Streaming Responses

By default, AI API calls are synchronous: your client waits, sometimes 5–30 seconds for long outputs, before receiving any data. Most providers also support streaming via Server-Sent Events (SSE), where the model sends tokens as they are generated. Streaming reduces perceived latency dramatically — users see words appearing rather than a blank screen — but requires your code to handle chunked responses and partial JSON fragments. ChatGPT's own interface uses streaming for exactly this reason.

Builder's takeaway

Before writing a single line of code using an AI API, confirm three things: your API key is stored in an environment variable (never hardcoded), you understand the model's token pricing and have set a billing cap, and you have written a handler for non-200 HTTP status codes (429 rate limit, 503 overload, 400 bad request). Skipping any of these produces applications that either leak credentials, generate surprise invoices, or fail silently in production.

Quiz — Lesson 1

What Is an API and How Do AI APIs Work?

1. In a REST API call to an AI provider, where is the API key typically placed?

Correct. The Authorization header carries "Bearer sk-..." — keeping the key out of the URL (which appears in logs) and out of the body (which may be logged by proxies).

Not quite. Standard practice is to transmit the API key in the HTTP Authorization header as a Bearer token.

2. What does a temperature setting of 0 produce in a language model API call?

Correct. Temperature 0 makes the model select the highest-probability token at every step, producing highly consistent and reproducible outputs — useful for classification and structured extraction tasks.

Incorrect. Temperature 0 suppresses randomness, yielding near-deterministic, consistent output — the opposite of creative randomness.

3. If an API response returns with finish_reason: "length", what happened?

Correct. finish_reason "length" means generation stopped because the max_tokens cap was reached, not because the model finished its thought. The output may be incomplete and should trigger a retry with a higher limit or a continuation prompt.

Incorrect. finish_reason "length" indicates the response was truncated by the max_tokens limit — the model did not reach a natural stopping point.

4. How does streaming (SSE) improve user experience in AI applications?

Correct. Streaming reduces perceived latency by delivering tokens as they are generated, so users experience a typing effect rather than a blank screen followed by a sudden complete response.

Incorrect. Streaming does not reduce tokens or eliminate rate limits — it improves perceived latency by delivering partial responses progressively.

5. An AI API has no server-side memory between separate requests. What is the practical consequence for developers?

Correct. Because AI APIs are stateless, developers must maintain conversation history in their own application and send the entire messages array — system prompt, all prior turns, and the new user message — with every request.

Incorrect. AI APIs are stateless. Each request must include the complete conversation history in the messages array; the provider stores nothing between calls.

Lab 1 — Anatomy of an API Call

Discuss API structure, tokens, and error handling with your AI lab assistant

Your Task

You are building a small app that uses an AI API. Use this lab to get hands-on with API concepts: ask about structuring a request, interpreting a response object, handling a 429 rate-limit error, or calculating token costs. Engage in at least 3 exchanges to complete the lab.

Suggested start: "I want to call an AI API and I need to understand how to structure the messages array for a multi-turn conversation. Can you walk me through it?"

API Concepts Lab

Lesson 1

Welcome to the API Concepts lab. I'm here to help you understand how AI APIs work — from structuring requests and managing tokens, to handling errors and parsing response objects. What would you like to explore?

Lesson 2 · Module 3

Authentication, Keys, and Security

How API credentials work, why they leak, and the engineering practices that prevent catastrophic exposure.

Why do so many developers accidentally publish their API keys — and what is the real-world cost when they do?

In January 2023, security researchers at GitGuardian reported that over 10 million secrets were exposed in public GitHub repositories in 2022 — a 67 percent increase year over year. OpenAI API keys were among the most common credentials found. One widely circulated case involved a developer who accidentally committed a .env file containing an active OpenAI key to a public repository; within 72 hours, unauthorized parties had consumed over $3,000 in API credits. OpenAI's abuse team confirmed they detected anomalous usage and revoked the key, but by then the charges had accrued. The incident became a recurring teaching example in security documentation from both OpenAI and Anthropic.

How API Key Authentication Works

AI API providers implement authentication via long-lived secret keys — typically 40–60 character random strings prefixed with a provider identifier (e.g., sk- for OpenAI). When you include this key in the Authorization header, the provider's gateway server validates it against its credential store, looks up your account, applies rate limits and quotas for that account tier, and routes the request to model infrastructure. There is no secondary factor by default; possession of the key is sufficient for full API access.

This design prioritizes developer convenience — machine-to-machine calls cannot do browser-based OAuth flows — but creates severe risk if keys are mishandled. Unlike a password, an API key does not require a username; it alone is sufficient to consume your quota and generate charges.

The Exposure Attack Surface

API keys are typically exposed through five failure modes, in roughly descending order of frequency:

Source control commits — developers hardcode keys in config files or directly in source code and commit to a public (or later-made-public) repository.
Client-side JavaScript bundles — keys embedded in frontend code are visible to anyone who inspects the page source or network traffic.
Shared .env files — environment files checked into version control, often accidentally when the .gitignore is misconfigured.
Log files — request logging that captures Authorization headers verbatim, then logs are written to accessible storage or monitoring services.
Screenshot and demo recordings — developers sharing terminal sessions or IDE screenshots in blog posts or Stack Overflow answers without redacting the key.

Correct Key Management Practice

The industry-standard approach separates secrets from code entirely. Environment variables — set in the deployment environment and read at runtime via os.environ (Python), process.env (Node.js), or equivalent — keep keys out of the codebase. Secret management services like AWS Secrets Manager, HashiCorp Vault, or Vercel environment variables extend this by encrypting secrets at rest and providing audit logs of access.

# WRONG — never do this
client = openai.OpenAI(api_key="sk-abc123realkey")

# CORRECT — read from environment
import os
client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

GitGuardian's 2023 finding

GitGuardian scanned over 1 billion new GitHub commits in 2022 and found that generic high-entropy strings (likely API keys) represented the single largest category of detected secrets at 45% of all exposures. They noted that even private repositories carry risk because employees leave companies, forks get made public, and repository visibility settings can be changed accidentally.

Scoped Keys and Least Privilege

Several providers now offer project-scoped keys — credentials that can only access specific models, have per-month spending caps, and are revocable without affecting other keys on the same account. OpenAI introduced project API keys with configurable spending limits in late 2023. Anthropic allows key revocation and creation per workspace. Google Cloud's Vertex AI uses IAM-based service accounts with per-API-method permissions. The principle of least privilege — giving each key only the permissions it actually needs — limits blast radius when a key is compromised.

Key Rotation and Revocation

A leaked key should be revoked immediately, before investigating how it leaked. Most provider dashboards offer one-click revocation. After revocation, audit usage logs (available in the provider dashboard) for the period of suspected exposure to assess what was consumed. Then rotate: generate a new key, update all deployment environments, verify the service is working, and only then investigate the root cause. The common mistake is investigating first and revoking second — which extends the exposure window.

Builder's takeaway

Add .env and *.env to your .gitignore before writing the first line of code in any project that uses API keys. Use a pre-commit hook (like git-secrets or truffleHog) that scans diffs for high-entropy strings before commits are made. Set a spending cap in the provider dashboard the day you generate your key — not after the first unexpected bill arrives.

Quiz — Lesson 2

Authentication, Keys, and Security

1. What is the correct way to provide an API key to your application at runtime?

Correct. Environment variables keep secrets outside the codebase. They are set in the deployment environment and read at runtime — never committed to source control.

Incorrect. The right approach is environment variables — secrets must never live in source code, config files in the repo, or URLs.

2. According to GitGuardian's 2023 State of Secrets Sprawl report, which was the most common type of exposed secret on GitHub?

Correct. GitGuardian found generic high-entropy strings — typically API keys — represented 45% of all detected secrets in their 2022 scan of over 1 billion GitHub commits.

Incorrect. GitGuardian's 2023 report identified generic high-entropy strings (API keys) as the single largest category of exposed secrets at 45%.

3. When a developer discovers their API key has been leaked publicly, what should be the first action?

Correct. Revoke first — this stops the exposure immediately. Deleting the repo does not help because keys found by scanners are already cached. Investigation and root cause analysis come after revocation.

Incorrect. Revocation must come first to stop ongoing unauthorized use. Investigation of root cause is important but secondary — every second of delay extends the exposure window.

4. Why is embedding an API key directly in client-side JavaScript a critical security risk?

Correct. JavaScript delivered to the browser is fully readable by the end user. Any key bundled in frontend code is effectively public — anyone can open DevTools and find it within seconds.

Incorrect. The issue is visibility: any visitor can view page source or network requests, immediately exposing any key embedded in client-side JavaScript.

5. What is the principle of least privilege as applied to API keys?

Correct. Least privilege means a key used only for text generation should not also have access to fine-tuning or file uploads. Scoped keys with spending caps limit damage if one key is compromised.

Incorrect. Least privilege means each credential gets the minimum permissions necessary — not shared master keys, not arbitrary rotation schedules.

Lab 2 — API Key Security Scenarios

Work through real key management decisions with your lab assistant

Your Task

You are a developer setting up an application that calls the OpenAI API. Your lab assistant will help you think through secure key management — storage, exposure risks, spending caps, and rotation procedures. Engage in at least 3 exchanges to complete the lab.

Suggested start: "I'm building a web app where users can submit text and get AI summaries. Should the API call happen on the frontend or the backend, and why does it matter for key security?"

API Security Lab

Lesson 2

Welcome to the API Security lab. I'll help you think through secure key management practices — where to store secrets, how to avoid common exposure mistakes, and what to do if a key is compromised. What scenario would you like to work through?

Lesson 3 · Module 3

Rate Limits, Cost Management, and Reliability

How providers throttle usage, what it costs when they don't, and how real teams build applications that survive quota limits.

When your AI feature goes viral overnight, why does it fail — and how do you engineer for that moment?

When Notion launched its AI writing assistant in private alpha in November 2022, demand exceeded every projection. The product required a live OpenAI API call for every AI action — summarize, fix grammar, continue writing — and the volume of concurrent users immediately collided with OpenAI's per-organization rate limits. Notion's engineering team publicly described implementing request queuing and exponential back-off within the first 72 hours of launch to prevent cascading 429 errors from crashing the feature entirely. They also negotiated an elevated rate-limit tier directly with OpenAI. The incident became a widely cited example in developer communities of why rate-limit handling is not optional scaffolding — it is a core feature of any production AI integration.

Understanding Rate Limits

AI API providers enforce two principal rate limits simultaneously. Requests per minute (RPM) caps how many API calls your application can make in a 60-second rolling window. Tokens per minute (TPM) caps the total input plus output tokens across all those calls. Both limits apply concurrently — you can hit the TPM cap while still under the RPM cap if each of your requests is large.

Limits vary dramatically by account tier. OpenAI's free tier offered 3 RPM on GPT-4 as of 2024; a Tier 5 (enterprise) account gets 10,000 RPM and 2,000,000 TPM on GPT-4o. The practical implication: a prototype that works fine on a personal account can fail catastrophically the moment it faces real user traffic, because production volume pushes past free-tier limits within seconds.

HTTP 429 — Rate Limited

The provider rejects your request because you have exceeded RPM or TPM. The response includes a Retry-After header indicating when you may send the next request. Your code must catch this and wait — not retry immediately in a tight loop, which makes the problem worse.

HTTP 503 — Service Unavailable

The provider is overloaded (distinct from your personal rate limit). This is a transient infrastructure issue. Exponential back-off — waiting 1s, then 2s, then 4s, then 8s before retrying — is the standard mitigation. Most provider client libraries implement this automatically.

Exponential Back-Off with Jitter

Exponential back-off means doubling the wait time between retries: attempt 1 waits 1 second, attempt 2 waits 2 seconds, attempt 3 waits 4 seconds, and so on up to a maximum (typically 60–120 seconds). Jitter adds a small random component to prevent the thundering herd problem — if 1,000 clients all hit a rate limit simultaneously and all retry at exactly the same interval, they generate another synchronized spike. Adding a random offset (e.g., wait_time + random(0, 1) seconds) spreads the retry load.

import time, random, openai

def call_with_backoff(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o", messages=messages, max_tokens=300)
        except openai.RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Waiting {wait:.1f}s")
            time.sleep(wait)
    raise Exception("Max retries exceeded")

Token Cost Estimation and Budget Controls

Token pricing as of mid-2024 ranged from $0.15 per million tokens (GPT-4o mini) to $15 per million tokens (GPT-4o input). A 500-word document is roughly 650 tokens. At GPT-4o pricing, summarizing 10,000 such documents costs approximately $42.25 — affordable for a batch job, catastrophic if triggered inadvertently in a production loop.

The open-source library tiktoken (released by OpenAI) allows exact token counting before making a call, enabling cost estimation at request time. Set max_tokens conservatively for each use case. Use smaller, cheaper models (GPT-4o mini, Claude Haiku) for tasks that do not require full model capability — classification, formatting, short Q&A — and reserve large models for complex reasoning.

The $50,000 Loop — a documented pattern

Multiple developers have publicly reported accidentally running infinite or near-infinite loops against AI APIs — typically a retry loop missing a success condition check, or a webhook that re-triggered itself. In one widely discussed 2023 incident on the OpenAI developer forum, a developer reported $47,000 in API charges accrued over a weekend from a misconfigured automation. OpenAI's soft spending limit alerts (configurable in the dashboard) existed but had not been set. The lesson: set a monthly spending hard limit on day one, not after the invoice arrives.

Caching and Prompt Caching

Application-level caching stores API responses for identical or near-identical inputs, serving them without a new API call. This is appropriate for static queries — e.g., a product description that 10,000 users will all request — but not for personalized or dynamic content. Anthropic introduced prompt caching in 2024, where repeated system prompts or large context blocks (minimum 1,024 tokens) are cached server-side for up to 5 minutes, reducing cost on those tokens by 90% for cache hits. OpenAI added a similar feature (Prompt Caching) for inputs over 1,024 tokens in late 2024, automatically discounting repeated prefixes by 50%.

Builder's takeaway

Before deploying an AI feature to production, answer these four questions: What is the worst-case token consumption per user action? Have you set a monthly spending hard cap in the provider dashboard? Does your retry logic implement exponential back-off with jitter? Have you tested what happens to user experience when a 429 error occurs? If any answer is "not yet," that is a production blocker, not a nice-to-have.

Quiz — Lesson 3

Rate Limits, Cost Management, and Reliability

1. What does a 429 HTTP status code from an AI API indicate?

Correct. HTTP 429 "Too Many Requests" means the rate limit has been exceeded. The response typically includes a Retry-After header indicating when to next attempt the request.

Incorrect. 429 specifically means rate limit exceeded — too many requests or too many tokens in the current time window.

2. Why is random jitter added to exponential back-off retry logic?

Correct. Jitter prevents the thundering herd problem: if many clients hit a limit at the same time and all retry at identical intervals, they create a synchronized second spike. Random offsets spread the retries out.

Incorrect. Jitter solves the thundering herd problem — when many clients retry simultaneously at the same interval, they recreate the spike that caused the rate limit in the first place.

3. According to the lesson, what did Notion's engineering team do within 72 hours of their AI feature launch to handle rate limit issues?

Correct. Notion's team implemented queuing and back-off to prevent cascading 429 failures, and simultaneously negotiated a higher rate-limit tier directly with OpenAI to accommodate their launch volume.

Incorrect. Notion implemented request queuing and exponential back-off, and negotiated elevated API tiers with OpenAI — a two-pronged approach to the rate limit problem.

4. What is prompt caching (as introduced by Anthropic in 2024) and what does it reduce?

Correct. Anthropic's prompt caching stores frequently repeated context blocks (system prompts, large documents) server-side for up to 5 minutes. Cached tokens cost 90% less than uncached tokens on cache hits.

Incorrect. Prompt caching is a provider-side feature where repeated prompt prefixes (over 1,024 tokens) are cached server-side, dramatically reducing cost for those repeated tokens.

5. What is the most important cost-control action a developer should take on the day they generate an AI API key?

Correct. Setting a spending hard limit on day one is the single most effective prevention against runaway charges from bugs, loops, or unauthorized usage. It takes 30 seconds and prevents potentially thousands of dollars in accidental spend.

Incorrect. The first cost-control action should be setting a monthly spending hard limit in the provider dashboard — before any code is written that could trigger runaway API calls.

Lab 3 — Rate Limits & Cost Scenarios

Work through rate limiting and cost estimation problems with your lab assistant

Your Task

Your application is scaling up and you're starting to hit rate limits. Work with the lab assistant to think through: exponential back-off implementation, estimating token costs for a feature, choosing between model tiers, and structuring requests efficiently. Engage in at least 3 exchanges to complete the lab.

Suggested start: "My app lets users summarize documents up to 5,000 words. I'm on GPT-4o and I want to estimate monthly costs if 1,000 users each summarize 10 documents per day. Help me work through the math."

Rate Limits & Cost Lab

Lesson 3

Welcome to the Rate Limits and Cost lab. I can help you estimate API costs, design retry logic, choose the right model tier for your use case, and think through what happens when your app faces high traffic. What would you like to explore?

Lesson 4 · Module 3

Building Real Features: Prompt Design and Output Handling

Turning raw API capabilities into reliable product features — through structured prompts, output parsing, and defensive engineering.

Why does the same AI model produce wildly inconsistent outputs in production, and how do teams engineer around it?

In 2023, Stripe built an AI-powered documentation assistant that answered developer questions about its API. The team described in a public engineering post how their initial prototype — using a simple user message with no system prompt — produced answers that confidently cited non-existent Stripe API parameters and gave outdated deprecation dates. The solution was a layered approach: a system prompt that constrained the model to only information in the retrieved document chunks, an instruction to say "I don't know" rather than speculate, and a post-processing step that cross-referenced cited parameter names against the live API reference before displaying the answer. This architecture — prompt constraint plus output validation — became a template for Stripe's subsequent AI feature builds.

The System Prompt as Architecture

The system prompt is the developer's primary tool for shaping model behavior. It is processed before any user input and persists for the entire conversation. A well-engineered system prompt specifies: the model's role and persona, the scope of topics it should and should not address, the output format expected (plain text, JSON, markdown), the tone and length constraints, and explicit instructions for edge cases (what to do when uncertain, when asked about competitors, when asked for personal advice).

Weak system prompts — one sentence descriptions — produce unpredictable behavior in edge cases because the model fills unspecified behaviors with its training defaults, which may not match product requirements. Strong system prompts are often 200–800 words and are treated as a critical engineering artifact, version-controlled and reviewed like code.

Structured Output: JSON Mode and Function Calling

Parsing free-text AI responses in production is fragile — natural language output can vary in structure between calls even with identical inputs. Two approaches enforce parseable output. JSON mode (available in GPT-4o, Claude with proper prompting) instructs the model to return only valid JSON. The response can then be parsed with standard JSON libraries without regex heuristics. Function calling (OpenAI) and tool use (Anthropic) go further: you define a JSON schema describing a function signature, and the model returns a structured object conforming to that schema — guaranteed parseable and type-validated.

# Function calling — OpenAI
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role":"user","content":"Extract name, email, and company from: John Smith, john@acme.com, Acme Corp"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "extract_contact",
            "parameters": {
                "type": "object",
                "properties": {
                    "name":    {"type": "string"},
                    "email":   {"type": "string"},
                    "company": {"type": "string"}
                },
                "required": ["name","email","company"]
            }
        }
    }],
    tool_choice="auto"
)

Output Validation and Defensive Handling

Even with JSON mode or function calling, production code must validate model output before acting on it. Models can hallucinate field values, return null for required fields, or produce structurally valid but semantically wrong content. The engineering pattern is: parse → validate schema → validate business rules → act. For high-stakes outputs (medical, financial, legal), a second AI call asking the model to verify its own answer — or a rule-based cross-check against known data — provides an additional safety layer.

String outputs require similar care. Before displaying model-generated text to users: check length bounds (a runaway model can generate thousands of tokens if max_tokens is unset), sanitize for XSS if rendering HTML, and check for policy-violating content if your system prompt's guardrails could be bypassed by adversarial user inputs.

Prompt injection — a documented production attack

In September 2023, researchers demonstrated prompt injection attacks against Bing Chat (powered by GPT-4) where malicious instructions embedded in web pages the model was asked to summarize caused it to relay those instructions as if they were its own. Kevin Liu and others published public demonstrations where text hidden in white-on-white font in a webpage overrode the system prompt. Microsoft patched mitigations across multiple updates. The incident established prompt injection as a first-class security concern for any AI feature that processes user-controlled or external content.

Latency, Timeouts, and Graceful Degradation

AI API calls are slow by web standards: a GPT-4o call generating 500 tokens takes roughly 2–5 seconds on average, with occasional spikes to 20+ seconds during high-load periods. Production applications must set explicit HTTP timeouts (typically 30–60 seconds), implement graceful degradation — a non-AI fallback when the API is slow or unavailable — and use streaming where the UX requires responsiveness. Stripe, Notion, and GitHub Copilot all use streaming as a core architectural choice, not an optional enhancement, because synchronous waits of 5+ seconds violate user experience expectations established by instant web responses.

Testing AI-Powered Features

Traditional unit tests assert deterministic outputs, but AI outputs are probabilistic. The emerging practice uses eval suites — collections of representative inputs with expected output properties (not exact strings) — run against the model to detect prompt regressions. When you change a system prompt, run the eval suite and compare pass rates. Companies including Anthropic, OpenAI, and Brex have published internal evals frameworks. The open-source promptfoo library provides a structured way to define test cases and assertions against AI outputs as part of a CI/CD pipeline.

Builder's takeaway

Treat your system prompt as production code: version control it, test changes against an eval suite before deploying, and document why each instruction exists. Use function calling or JSON mode for any output your code will parse programmatically — never parse free-text responses with regex in production. Set HTTP timeouts, implement streaming for user-facing features, and always have a graceful degradation path for when the AI API is unavailable.

Quiz — Lesson 4

Building Real Features: Prompt Design and Output Handling

1. What is function calling (OpenAI) / tool use (Anthropic) primarily used for in production applications?

Correct. Function calling constrains the model to return a structured object matching a schema you define — making output reliably parseable without fragile string parsing.

Incorrect. Function calling / tool use makes the model return structured JSON conforming to a developer-specified schema, enabling reliable programmatic parsing of AI output.

2. What was the core architectural solution Stripe's documentation assistant team used to prevent hallucinated API parameter names?

Correct. Stripe combined prompt-level constraints (only answer from retrieved documents, say "I don't know" when uncertain) with programmatic output validation that cross-referenced cited parameter names against the live API reference.

Incorrect. Stripe used a two-layer approach: a constraining system prompt plus post-processing validation of cited API parameters against their live reference documentation.

3. What is a prompt injection attack in the context of AI applications?

Correct. Prompt injection occurs when content the model is asked to process contains instructions that cause it to ignore or override its system prompt — a serious concern for any AI feature that reads external content like webpages or documents.

Incorrect. Prompt injection is when malicious instructions embedded in user-controlled or external content (like a webpage) override the developer's system prompt, redirecting the model's behavior.

4. Why do production AI applications require explicit HTTP timeouts?

Correct. Unlike web database queries that return in milliseconds, AI inference can take 5–30 seconds and occasionally longer. Without explicit timeouts, slow requests block threads or connections indefinitely, cascading into broader application failures.

Incorrect. AI API calls are genuinely slow (2–30 seconds) by web standards. Without explicit timeouts, a slow or stalled request will hang the application indefinitely rather than failing fast and triggering a fallback.

5. What is an eval suite in the context of AI feature development?

Correct. Eval suites are test collections — inputs paired with expected output properties (not exact strings) — that developers run when changing prompts or models to catch behavioral regressions before deploying to production.

Incorrect. An eval suite is a developer-maintained set of test inputs with expected output properties, used to detect prompt regressions when the system prompt or model changes — the AI equivalent of a unit test suite.

Lab 4 — Prompt Design and Output Handling

Practice designing system prompts and handling structured output with your lab assistant

Your Task

You are building an AI feature for a real product. Work with the lab assistant to design effective system prompts, choose between free-text and structured JSON output, think through output validation logic, and plan for edge cases including prompt injection risks. Engage in at least 3 exchanges to complete the lab.

Suggested start: "I'm building an AI feature that reads customer support emails and classifies them into categories: billing, technical issue, feature request, or other. Should I use JSON mode or function calling, and what should my system prompt say?"

Prompt Design Lab

Lesson 4

Welcome to the Prompt Design and Output Handling lab. I can help you craft effective system prompts, choose between JSON mode and function calling, design output validation logic, and think through edge cases like adversarial inputs and graceful degradation. What are you building?

Module 3 — Test

Working with APIs · 15 questions · Pass at 80%

1. In which part of an HTTP request to an AI API is the API key correctly placed?

Correct. Authorization header: "Bearer sk-..."

Incorrect. The API key belongs in the HTTP Authorization header.

2. What is a "token" in the context of AI API billing?

Correct. Tokens are the billing unit — roughly 0.75 words or 4 characters of English text.

Incorrect. Tokens are sub-word units — roughly 0.75 English words — used to denominate both input and output for billing purposes.

3. AI chat completion APIs are stateless. What must developers do to support multi-turn conversations?

Correct. Stateless APIs remember nothing between calls. Developers must store conversation history and resend it in full each time.

Incorrect. AI APIs are stateless — developers must send the full conversation history in the messages array with every request.

4. What does temperature control in an AI API call affect?

Correct. Temperature scales output randomness: 0 is near-deterministic, 1+ is highly varied and creative.

Incorrect. Temperature controls output randomness — 0 is deterministic, higher values increase creativity and variation.

5. According to GitGuardian's 2023 report, secrets exposed on GitHub increased by what percentage in 2022?

Correct. GitGuardian reported a 67% year-over-year increase in exposed secrets on GitHub in 2022, with over 10 million secrets detected.

Incorrect. GitGuardian reported a 67% increase — from roughly 6 million to over 10 million exposed secrets on GitHub.

6. Which approach prevents a hardcoded API key from being accidentally committed to a public repository?

Correct. Environment variables + .gitignore is the foundational approach. Base64 is not encryption — it is trivially reversible. Private repos can be made public accidentally.

Incorrect. Environment variables with .env in .gitignore is the correct approach — base64 is trivially reversible, and private repos can change visibility.

7. What is the thundering herd problem in the context of API retry logic?

Correct. When many clients hit a rate limit together and all retry at the same interval, they generate a synchronized second spike — the thundering herd. Jitter solves this by randomizing retry timing.

Incorrect. The thundering herd problem occurs when many clients retry simultaneously after a shared rate limit event, recreating the traffic spike. Jitter (random offsets) is the mitigation.

8. What did Notion's engineering team do within 72 hours of launching Notion AI to handle rate limit issues at launch?

Correct. Notion implemented queuing and back-off engineering plus negotiated API tier upgrades — a two-pronged response to their launch-day rate limit crisis.

Incorrect. Notion implemented queuing, exponential back-off, and negotiated higher limits with OpenAI — making rate-limit handling a core engineering priority rather than an afterthought.

9. What is the primary advantage of using streaming (SSE) over synchronous API calls in a user-facing AI feature?

Correct. Streaming delivers tokens as generated — users see a typing effect rather than a blank screen followed by a sudden complete answer, dramatically improving perceived responsiveness.

Incorrect. Streaming's main benefit is perceived latency reduction: users see tokens appear progressively instead of waiting for the complete response.

10. What is prompt caching (as introduced by Anthropic) and what cost reduction does it provide on cache hits?

Correct. Anthropic's prompt caching stores large repeated prompt blocks (1,024+ tokens) on the provider's servers for up to 5 minutes, with 90% cost reduction for those tokens on cache hits.

Incorrect. Prompt caching is provider-side caching of repeated large context blocks. Cache hits receive a 90% cost reduction on those cached tokens.

11. What is function calling / tool use primarily designed to solve in production AI integrations?

Correct. Function calling constrains the model to return output matching a developer-specified schema — eliminating fragile free-text parsing in production code.

Incorrect. Function calling / tool use makes the model return structured output conforming to a developer-defined JSON schema, enabling reliable programmatic parsing.

12. What is the documented prompt injection attack demonstrated against Bing Chat in September 2023?

Correct. Kevin Liu and others demonstrated that text hidden in white-on-white font on webpages Bing Chat summarized could override its system prompt — establishing prompt injection as a first-class production security concern.

Incorrect. The 2023 Bing Chat injection attack used hidden text on webpages to insert instructions that overrode the model's system prompt, redirecting its behavior.

13. What is an eval suite used for in AI feature development?

Correct. Eval suites are test collections — inputs with expected output properties — run to catch behavioral regressions when system prompts or underlying models change.

Incorrect. Eval suites detect prompt regressions — they run test inputs with expected output properties to flag when prompt or model changes break expected behavior.

14. Why must production applications set explicit HTTP timeouts on AI API calls?

Correct. AI API calls are slow (2–30s typically, longer during overload). Without a timeout, a single stalled request can block a thread or connection indefinitely, cascading into broader service degradation.

Incorrect. AI inference is genuinely slow by web standards. Without timeouts, stalled requests hang indefinitely and cascade into service-level failures.

15. What is the correct order of operations for handling a discovered API key leak?

Correct. Revoke first — this stops the exposure immediately. Then audit what was consumed during the exposure window. Then rotate and update deployments. Root cause investigation comes last, not first.

Incorrect. Revoke first — every second of delay extends the attack window. Investigation comes after the immediate threat is stopped and a new key is deployed.