In 2023, Klarna deployed an AI customer-service agent that handled the equivalent workload of 700 full-time agents in its first month. One engineering decision central to that throughput: the system did not wait for one reasoning step to finish before starting the next retrievable data fetch. Intent classification, account data retrieval, and policy lookup ran as concurrent tasks. The agent assembled its answer only after all three streams resolved — typically within 600 ms instead of the 2+ seconds a strictly sequential design would have required. Klarna publicly cited this architecture as key to their cost reduction from roughly $40 per resolved conversation to under $1.
Most LLM agent code is written sequentially by default because it mirrors how developers think: ask a question, get an answer, use that answer to ask the next question. The problem is that this serializes operations that have no actual data dependency on each other, stacking their latencies end-to-end.
Consider a research agent that must: (1) search the web for a company's latest earnings, (2) retrieve internal CRM notes about that company, and (3) pull the user's calendar to check meeting context. None of these three tasks needs the result of another to begin. Done sequentially with 800 ms each, the agent waits 2.4 seconds before it can write a single word of output. Done in parallel, all three complete in ~800 ms — a 3× wall-clock improvement with zero change to answer quality.
Latency in a parallel system is determined by the slowest single task, not the sum of all tasks. Identifying and eliminating that bottleneck is more valuable than optimizing any fast task.
The dependency graph of an agent's subtasks is the map you need. Draw it explicitly. Any nodes with no incoming edges from other in-flight tasks can be dispatched simultaneously. In Python, this is typically implemented with asyncio.gather() or a thread pool executor for blocking I/O calls.
A more aggressive form of parallelization is speculative execution — launching work you might need before you have confirmed you need it. This is how modern CPUs achieve their speed, and it maps directly to agent design.
Anthropic's documentation on building effective agents describes a pattern where, if an agent has a 70 % probability of needing a particular tool call based on early context, it can fire that call speculatively while the LLM finishes its reasoning. If the tool result is ultimately unneeded, it is discarded. If it is needed, it is already available. The expected latency gain is: P(need) × latency_of_tool_call. At 70 % probability and a 500 ms tool call, that is 350 ms of expected savings per speculative dispatch.
Google's NotebookLM team described in a 2024 engineering post that their audio overview generation pipeline pre-fetches source embeddings for all cited documents in parallel with the outline generation step, cutting perceived generation latency by approximately 40 % for multi-source notebooks.
Speculative execution carries a cost: wasted API calls and compute when the branch is unused. The break-even point depends on your P(use), tool latency, and token cost. For high-probability branches with slow tool calls — like database lookups — speculation is almost always worth it. For low-probability branches with fast tools, it usually is not.
The canonical parallel agent pattern is fan-out / fan-in. A coordinator agent receives a complex task, decomposes it into independent subtasks, dispatches all subtasks simultaneously (fan-out), then collects and synthesizes results when all complete (fan-in). This is the architecture behind systems like LangChain's parallel chains and LlamaIndex's sub-question query engine.
The synthesis step — the fan-in — is itself a latency risk. If you fan out to five subagents but your synthesis prompt includes all five outputs verbatim, you have a large-context synthesis call that can take longer than any single subagent. Mitigation strategies include: streaming synthesis so output begins as soon as the first subagent resolves; summarizing each subagent output before passing it forward; and using a smaller, faster model for the synthesis pass when the task is primarily assembly rather than reasoning.
You are designing a travel-booking agent that must: search flights, check hotel availability, look up visa requirements, and retrieve the user's loyalty tier from a database — all before composing a recommendation. Your AI coach will challenge you on dependency graphs, parallel dispatch decisions, and fan-in synthesis strategies.
When Anthropic launched prompt caching for Claude in August 2024, they documented a specific benchmark: a 45,000-token legal document used as context for repeated Q&A. Without caching, each follow-up question re-processed all 45,000 tokens at full cost and latency. With caching, after the first call, subsequent calls using that same context prefix saw time-to-first-token drop by up to 85 % on cache hits. Anthropic's pricing reflected the compute reality: cached input tokens were billed at 10 % of the standard input rate. For enterprise customers running document-heavy agents — legal review, code analysis, long-context RAG — this represented both a latency and cost transformation at the same time.
When a model processes a prompt, it converts each token into a sequence of internal vector representations (the KV cache — key-value pairs in the attention mechanism). Normally this computation happens fresh on every API call. Prompt caching stores those computed representations server-side, keyed to a specific token sequence. If a subsequent call begins with the same prefix, the server skips the re-computation and loads from cache instead.
The latency reduction is most dramatic on time-to-first-token (TTFT), which is what users perceive as "how long until the agent starts responding." A 50,000-token system prompt that previously caused 3–4 seconds of TTFT can drop to under 500 ms on a cache hit. This changes what is architecturally feasible: agents that carry large persistent context (coding environments, lengthy policy documents, full conversation histories) become viable for latency-sensitive applications.
For Anthropic's prompt caching: the cached prefix must be at least 1,024 tokens; the prefix must match exactly (character-for-character); and the cache expires after approximately 5 minutes of inactivity. Insertions mid-prompt invalidate the cache for all tokens after the insertion point.
Cache utilization is an architectural decision, not an automatic benefit. The fundamental rule is: stable content goes first, dynamic content goes last. A prompt structured as [system instructions (5,000 tokens)] + [document context (40,000 tokens)] + [user question (50 tokens)] will cache the first 45,000 tokens on every call where the instructions and document are unchanged, varying only the tiny trailing question.
The antipattern is injecting dynamic content into the middle of a prompt. If you put a timestamp, user ID, or session-specific variable anywhere before a large stable block, you break the cache for everything after that injection point. In agents with conversation history, this means appending new turns at the very end — never inserting them before tool results or document context.
OpenAI introduced automatic prompt caching for GPT-4o in October 2024, caching prefixes ≥ 1,024 tokens automatically with no explicit markup required. The cache hit discount is 50 % on input tokens, with hit rates improving as request volume increases. Both platforms reward the same structural principle: long stable prefixes.
Cache warming — deliberately sending a priming request to populate the cache before production traffic arrives — is a documented production pattern for latency-sensitive agents. If you know users will query a 100-page PDF starting at 9 AM, you fire a throwaway request with that document at 8:55 AM so the cache is hot when real traffic hits. This trades a small pre-warm cost for guaranteed low TTFT on the first real request.
TTL (time-to-live) management matters when caches expire between requests. Anthropic's ~5-minute TTL means a user who pauses for 6 minutes in a conversation will experience full-latency on their next message. In UX terms, this is an argument for showing "typing" indicators or interim status messages during cold-cache calls — setting accurate user expectations rather than masking variance. Some teams implement client-side keep-alive pings (a lightweight re-send of the cached prefix every 4 minutes) to extend cache lifetime during active sessions.
You are auditing a legal document review agent. Its current prompt structure is: [timestamp + user ID] → [45,000-token contract text] → [system instructions] → [user query]. Cache hit rate is near zero despite the contract being the same for every query in a session.
In a 2024 benchmarking study published by the AI engineering team at Brex, their internal expense-categorization agent was taking an average of 4.1 seconds to respond. The model being used (GPT-4) was not the problem — the prompt was. It began with 800 tokens of general background about the company before reaching the actual task instruction. When the team restructured the prompt to lead with a direct, specific instruction — task first, context second — and stripped the preamble, average latency dropped to 2.3 seconds with no measurable accuracy regression. The same tokens, reorganized, reduced TTFT by 44 % because the model's generation of the output began earlier in its forward pass once the instruction was unambiguous from the first tokens.
Output tokens are significantly more expensive in latency than input tokens. Each output token is generated autoregressively — the model cannot parallelize output generation the way it processes input. A response that uses 800 tokens where 300 would suffice has added measurable latency that compounds across a multi-turn agent conversation.
The most effective output-compression technique is explicit format constraint in the system prompt. Instead of "provide a thorough analysis," use "respond in JSON with keys: decision (string), reasoning (≤2 sentences), confidence (0–1)." This removes the model's default tendency to pad with transitions, acknowledgments, and elaborations it cannot know you do not want. In Anthropic's documentation, they call this "defining the output schema explicitly" and note it can reduce output token count by 30–60 % for structured-output tasks.
For APIs that support it (including Claude's), you can set the assistant's first tokens (the "prefill") to skip the model's preamble entirely. Setting prefill to {"decision": forces JSON output starting from the very first token, eliminating opener sentences like "Here is the analysis you requested:" and saving 10–30 tokens per call.
Extended chain-of-thought (CoT) reasoning dramatically improves accuracy on hard tasks but adds significant latency — sometimes 5–10× the output token count of a direct answer. The key engineering insight is that CoT is not always necessary, and when it is necessary, it does not always need to be visible to the user.
For routing and classification tasks (is this a refund request, a technical question, or account inquiry?), a small fast model with no CoT typically matches a large model's accuracy while being 5–10× faster. Anthropic's model hierarchy — Haiku, Sonnet, Opus — is designed explicitly for this pattern: use Haiku for high-volume classification, Sonnet for moderate reasoning, Opus only when the task genuinely requires frontier capability.
Anthropic's agent design guide (2024) explicitly recommends building routing layers that classify task complexity before model selection. A two-step pipeline — classify first with a fast model, then route to the appropriate capability tier — reduces median latency even when the second step uses a slow model, because most tasks end up in the fast tier.
Streaming does not reduce actual token generation latency, but it transforms perceived latency — which is what users measure. A 3-second response streamed from token 1 feels faster than a 2-second response that displays all at once. The psychological principle is "time-to-meaningful-content" rather than "time-to-complete."
For agent applications, the practical implication is to surface the first meaningful token as early as possible. If the agent is doing tool calls before generating a response, consider streaming a status message ("Searching for flights...") during the tool call rather than presenting a blank screen. GitHub Copilot's engineering team documented in 2023 that showing intermediate completions — even before the model's full output was certain — reduced perceived latency by 35 % in user studies despite identical actual generation times.
{"decision":), you skip 10–30 tokens of preamble the model would otherwise generate. The model begins producing the actual content from token 1 of its response.Below is a real-world-style agent prompt that is generating slow responses. Your coach will guide you through rewriting it for maximum speed without accuracy loss.
This lesson explores architecture & tradeoffs — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to architecture & tradeoffs.