When OpenAI released the GPT-4 API on March 14, 2023, Stripe's engineering team integrated it into their fraud-detection pipeline within seventy-two hours. They didn't build a model — they made an API call. The distinction matters enormously. Every AI-first business in this course starts the same way: with a key, an endpoint, and a pricing page.
A foundation model is a large neural network trained on massive, general-purpose data that can be adapted to many downstream tasks. Unlike the narrow classifiers businesses built in the 2010s, foundation models transfer knowledge across domains: the same model that summarizes legal contracts can also write marketing copy or debug Python.
The frontier is currently dominated by four providers: OpenAI (GPT-4o, o1, o3), Anthropic (Claude 3.5 Sonnet, Claude 3 Opus), Google DeepMind (Gemini 1.5 Pro, Gemini 2.0 Flash), and Meta AI (Llama 3.1, open weights). Each exposes capabilities through an API measured in tokens — chunks of roughly four characters.
API access means you send text in, pay per token, and receive generated text out. In 2024, OpenAI's GPT-4o cost $5 per million input tokens and $15 per million output tokens. A 500-word customer support reply consumes roughly 700 tokens — about half a cent. At scale, a business handling 100,000 support queries per day would spend approximately $500/day on generation alone, before infrastructure.
The key business decision is which tier to call. Every major provider now offers a speed/cost/quality spectrum:
Klarna's AI assistant — announced in February 2024 as handling the work of 700 customer service agents — does not call a single model. It routes requests by complexity. Simple account-status queries hit a small, fast model. Disputes requiring policy reasoning escalate to a flagship model. This cascade routing pattern cut per-query cost by an estimated 60% while maintaining quality on hard cases. By September 2024 Klarna reported the assistant had handled 2.3 million conversations, with 79% resolved without a human agent.
Every API call has a context window: the maximum tokens the model can "see" at once, including both your input and its output. GPT-4o supports 128,000 tokens (~96,000 words). Gemini 1.5 Pro supports 1 million tokens. Larger windows enable full-document analysis but cost proportionally more. Match window size to task, not to maximum.
The foundation model layer is a commodity market moving toward price parity on quality. Your competitive advantage comes not from which model you pick but from how well you architect the calls around it — routing, prompting, caching, and feedback loops.
Describe a specific business use case — the task type, expected volume, latency requirements, and data sensitivity. The AI advisor will recommend a model tier, explain the routing strategy, and estimate rough costs. Push back, ask for alternatives, or give it edge cases.
In July 2024, Salesforce launched Einstein Copilot with agent capabilities — an AI that could autonomously query CRM records, draft follow-up emails, schedule meetings, and log activities across multiple systems without a human in each loop. The enabling architecture was not a smarter model. It was an orchestration layer that broke goals into steps, called tools, and recovered from failures.
A bare API call is stateless: you send context, get a response, done. Real business workflows are stateful: approve this, then notify that person, then update the database, then check if a condition is met before proceeding. Orchestration layers add state, memory, and tool use on top of raw model calls.
The dominant frameworks in 2024 were LangChain (launched 2022, >100k GitHub stars), LlamaIndex (focused on data ingestion and retrieval), and AutoGen (Microsoft, multi-agent conversations). In parallel, every major cloud provider released managed orchestration: AWS Bedrock Agents, Google Vertex AI Agents, Azure AI Studio.
An agent follows a Reason → Act → Observe loop, often called ReAct (from the 2022 Princeton/Google paper by Yao et al.). The model reasons about the current state, selects a tool to call, observes the result, and reasons again — repeating until the goal is complete or a stop condition fires.
Harvey AI — backed by OpenAI and valued at over $1.5 billion by late 2024 — built a legal workflow automation system for firms including Allen & Overy. Their architecture chains multiple steps: ingest a contract, extract clauses, compare against a precedent database, flag deviations, draft a redline, and route to the relevant attorney's queue. None of this is a single prompt. It is an orchestrated pipeline of model calls, retrieval steps, and business logic. Harvey's reported productivity gain at Allen & Overy was a 50% reduction in contract review time for participating associates.
Without hard stop conditions, agents can loop indefinitely, burning tokens and budget. Every production agent system needs maximum step limits, timeout guardrails, and human escalation paths when confidence is low. This is not optional — it's a production requirement.
Not every multi-step AI task needs a self-directed agent. Workflow automation uses a predefined DAG (directed acyclic graph) of steps — deterministic, auditable, easy to debug. Agents choose their own steps dynamically — flexible but harder to audit and predict. The right choice depends on how variable and unpredictable the inputs are.
| Dimension | Workflow Automation | Autonomous Agent |
|---|---|---|
| Step sequence | Fixed, predefined | Dynamic, model-chosen |
| Auditability | Easy to trace | Harder to explain |
| Flexibility | Rigid to new inputs | Adapts to novel situations |
| Cost predictability | Predictable token use | Variable (can over-loop) |
| Best for | Structured, repetitive tasks | Open-ended research, planning |
Describe a multi-step business process you want to automate. The advisor will help you decide: agent vs. fixed workflow, what tools to expose, what the ReAct loop looks like, and where to put guardrails. Be specific about the steps, edge cases, and what happens when something goes wrong.
When Morgan Stanley deployed an internal GPT-4-powered assistant in March 2023, they faced a problem shared by every enterprise: the model knew nothing about their 100,000 internal research documents, advisor guidelines, or product inventory. The solution was not fine-tuning — it was Retrieval-Augmented Generation. They embedded their entire content library, built a vector search layer, and injected relevant chunks into every prompt. By Q4 2023, over 200 Morgan Stanley advisors were using the system daily.
Foundation models have a knowledge cutoff — they know nothing that happened after training ended. More critically, they know nothing about your business: your contracts, your SOPs, your product catalog, your customer history. Fine-tuning can address some of this, but it is expensive (thousands of dollars and days of compute), requires labeled data, and needs retraining every time data changes.
RAG solves the same problem more cheaply and dynamically: at query time, retrieve the most relevant documents from a vector database, inject them as context, and let the model answer against that context. The model's "knowledge" updates the moment you update your documents — no retraining required.
Notion AI, launched November 2022 and expanded in 2023, used RAG over a user's own Notion workspace to answer questions about their notes, documents, and databases — turning the entire workspace into a queryable knowledge base. The model never "learned" any user's content; it retrieved it at query time. Perplexity AI, which raised $73.6 million in January 2024 and was valued at $520 million, built its entire product on real-time web RAG — embedding live search results into every answer, with citations. By mid-2024 Perplexity reported over 10 million monthly active users.
Use RAG when your data changes frequently, you need source citations, or you want to avoid retraining costs. Use fine-tuning when you need to change the model's tone, style, or format of outputs, or when a specific skill must be deeply embedded (e.g., a clinical coding model trained on thousands of labeled examples). Many production systems use both: RAG for knowledge, fine-tuning for behavior.
Poor chunking is the most common cause of RAG quality failures. If chunks are too small, they lose context — a number without the sentence that gives it meaning. If chunks are too large, retrieval becomes imprecise — you pull an entire chapter when you needed one paragraph. The 2024 research consensus favors semantic chunking: splitting at natural topic boundaries detected by embedding similarity, not arbitrary character counts. Overlap between chunks (e.g., 50-token overlap) prevents context from being lost at boundaries.
By 2024, the minimal RAG stack had become: LlamaIndex or LangChain for orchestration → OpenAI embeddings or open alternatives → Pinecone or pgvector for storage → GPT-4o or Claude for generation. Total setup time for a proof of concept: under one day. Total monthly cost for a small internal tool: $50–$300 depending on query volume.
Describe your organization's knowledge base — document types, volume, update frequency, and who queries it. The advisor will help you design the full RAG pipeline: chunking strategy, embedding model choice, vector DB selection, retrieval parameters, and generation setup. Ask about tradeoffs and edge cases.
In May 2024, Shopify's VP of Engineering publicly described their AI infrastructure bill as "the fastest-growing line item in the company" — and credited their cost-control work with keeping it manageable. That work included aggressive prompt caching (reducing repeated context tokens), output length limits, and model-tier routing. Companies that shipped AI features without these controls reported 3–5× higher-than-expected monthly API bills within sixty days.
Traditional software has deterministic outputs — you can write unit tests. AI outputs are probabilistic and often subjective. A customer support reply might be technically accurate but too terse; a summary might miss the key point without being factually wrong. Measuring this requires a different approach: LLM-as-judge, human evaluation pipelines, and golden dataset regression testing.
AI observability is the practice of logging, tracing, and monitoring model calls in production — not just catching errors but understanding behavior. The leading dedicated tools in 2024 were LangSmith (LangChain's observability product, launched 2023), Weights & Biases Weave, Helicone, and Arize AI. Each captures: the full prompt sent, the model's response, latency, token counts, cost, and any retrieved RAG context.
Without observability, debugging a hallucination in production is nearly impossible — you don't know what the model actually received as input, what context was retrieved, or whether the error was in retrieval, prompting, or generation.
Scale AI, which provides data labeling and evaluation services, published that their enterprise AI evaluation pipeline runs LLM-as-judge at 95% of the scale of human evaluation — but at 1/40th the cost and 100× the speed. Their finding: for most quality dimensions (accuracy, format, completeness), GPT-4-class models as judges correlate at 0.85+ with expert human raters. For safety-critical dimensions, human review remains essential.
Uncontrolled AI API spend is one of the most common operational failures at AI-first companies in 2024. The five most effective cost controls, used across Shopify, Notion, Intercom, and others:
| Control | Mechanism | Typical Savings |
|---|---|---|
| Prompt Caching | Cache repeated system prompt tokens (OpenAI Prompt Cache, Anthropic Cache). Pay once, reuse thousands of times. | 50–90% on system prompt tokens |
| Output Length Limits | Set max_tokens explicitly. Many tasks need 100 tokens; without limits, models can output 1,000+. | 30–60% reduction |
| Model Tier Routing | Route simple tasks to mini/flash models. Reserve flagship for complex queries only. | 40–70% on per-query cost |
| Semantic Caching | Cache full responses for semantically similar queries (GPTCache, Redis). Same question, same answer — don't call the API twice. | 20–50% at scale |
| Budget Alerts | Set hard spend limits and daily alerts via provider dashboards (OpenAI, Anthropic) or middleware. Kill switches prevent runaway costs. | Prevents catastrophic overruns |
Production AI requires a continuous loop: evaluate a change on your golden dataset → deploy with observability instrumented → monitor quality and cost metrics in real time → sample for human review → update golden dataset → repeat. Teams that skip evaluation before deployment consistently regress quality on edge cases that weren't tested. Teams that skip monitoring miss cost spikes and silent quality degradation.
An AI-first business that ships without evaluation, observability, and cost controls is not a mature AI company — it is running blind. The infrastructure for these is now cheap (LangSmith free tier, Helicone free tier, OpenAI usage dashboards). There is no excuse for skipping it. Instrument on day one.
Describe an AI feature or product you're building or have shipped — or a hypothetical scenario. The advisor will conduct a structured audit: evaluation coverage, observability instrumentation, and cost-control mechanisms. You'll get specific recommendations with tooling and estimated savings.