In 2023, Klarna's AI assistant — built on OpenAI's API and handling over 700,000 customer service conversations per month — reported that a significant portion of inbound queries were near-identical: order status lookups, return policy questions, and payment FAQ variants. By routing these through a result cache keyed on normalized query text, Klarna reduced redundant LLM calls dramatically. The principle was simple: if two queries resolve to the same intent and the answer doesn't change faster than the cache TTL, one LLM call is enough.
Every piece of data an agent handles sits somewhere on a spectrum from "always cacheable" to "never cacheable." At one end: static system instructions, tool schemas, and reference documents that change only on deliberate deployment events. At the other end: live financial quotes, user authentication tokens, and real-time sensor readings that are stale within seconds of generation.
The critical analytical skill is placing each data type accurately on that spectrum, because a cache hit on stale financial data is actively dangerous, while refusing to cache a static FAQ response is just wasteful. Neither failure mode is trivial at scale.
Cache data at the granularity that matches its actual change rate. Combining fast-changing and slow-changing data into a single cache entry forces you to use the shorter TTL for everything, destroying the value of caching the stable component.
Anthropic's prompt caching feature — introduced in 2024 for Claude — introduced a specific dimension of cacheability that applies to the token processing stage rather than the result stage. When a long system prompt or document context is marked with cache_control: ephemeral, the KV (key-value) cache of the transformer's attention computation is stored on Anthropic's infrastructure. Subsequent requests reusing that exact prefix skip recomputation and are billed at 10% of the normal input token cost.
OpenAI rolled out analogous automatic prompt caching for GPT-4o in late 2024, caching prefixes of 1,024 tokens or more automatically at 50% cost reduction. The design difference matters: Anthropic's system requires explicit cache_control markers, giving the developer precise control; OpenAI's system is automatic but less controllable. For agents with long, stable system prompts — legal analysis agents, code review agents carrying large style guides — this can represent 60–80% reduction in per-call costs.
To maximize prompt cache hit rates, structure your prompts so stable content appears at the top (system role, instructions, reference documents) and variable content appears at the bottom (user message, dynamic context). Cache hits require prefix matching — any change to a character before the cached region breaks the hit.
Beyond raw LLM inputs, agents produce several categories of intermediate output that are worth caching. Tool call results — especially API responses from external services — are expensive to re-fetch and often stable within a request window. Embedding vectors for documents that don't change are computationally expensive to regenerate. Retrieved chunks from a RAG pipeline, given a stable document corpus, can be cached keyed on the query embedding and document set version hash.
Planning outputs from a reasoning step — the task decomposition a ReAct agent produces before executing — are worth caching when the same high-level goal recurs across sessions. GitHub Copilot's infrastructure caches code completion prefixes across users working in the same repository, leveraging the fact that common boilerplate appears repeatedly across sessions even from different developers.
You're architecting a customer-service agent for an e-commerce platform. The agent has access to: a static brand voice guide, live inventory counts, user order history, product descriptions, and real-time shipping carrier status APIs.
In this lab you'll work through:
When Notion launched its AI features in early 2023 — embedded directly into the document editor and backed by OpenAI's API — latency was a primary user complaint. The team implemented a multi-layer caching architecture: an in-process memory cache for prompt prefixes within a single editing session, a Redis-backed distributed cache for repeated block-level completions shared across users, and Cloudflare's edge cache for static system prompts served at the CDN layer. Each layer had a different TTL and eviction policy. The result was sub-100ms response times for common completions that would otherwise require a round trip to OpenAI's API.
Production agent systems typically operate across three distinct cache layers, each optimized for different access patterns and scale characteristics.
functools.lru_cache, LangChain's InMemoryCache.Cache lookups should cascade: check L1 first, then L2, then L3, then origin. Writes should populate all lower layers. A miss at any layer should warm the layers above it so the next request is faster. This is the same write-through / read-through pattern used in CPU cache hierarchies.
Agents introduce concerns that pure web caches don't face. The first is statefulness: a web cache stores response payloads, but an agent's cache may store intermediate reasoning states, partially completed tool call sequences, or RAG context windows. These need careful scoping — a cached tool result from user A's session must never be served for user B without explicit cross-user sharing logic.
The second concern is the multi-step nature of agent execution. In a ReAct loop, caching the output of step 2 is only valid if steps 0 and 1 were identical. The cache key must encode enough of the execution context to guarantee this. LangChain's SQLiteCache, for example, keys on the exact prompt text sent to the LLM — meaning it won't incorrectly reuse a cached response if the context changed.
LangChain's production caching backends — including Redis, MongoDB, and Cassandra integrations released in 2023–2024 — all key on the full serialized prompt. This means cache keys can be very large, and teams often hash them (SHA-256) to keep key sizes manageable while avoiding collisions.
A third concern specific to agents is tool call deduplication within a single run. When an agent's ReAct loop generates the same tool call twice in one execution (a known failure mode called "looping"), an L1 in-process cache that returns the cached result immediately breaks the loop without incurring a second external API call.
For serverless or edge-deployed agents (AWS Lambda, Cloudflare Workers), L1 in-process caching is largely useless because each invocation may run in a fresh process. The architecture must rely on L2/L3 for any cache benefit. Redis on ElastiCache or Upstash's serverless Redis (which offers HTTP-based access compatible with edge runtimes) are the standard choices in 2024 production deployments.
For long-running agent servers — common in enterprise deployments running LangGraph or CrewAI — L1 caching provides substantial benefit and should be the first layer added. The @lru_cache decorator on embedding generation functions alone can halve embedding API costs when documents repeat across requests within a server's lifetime.
You're building a code review agent that runs on AWS Lambda, uses a large GPT-4o system prompt containing your organization's coding standards, calls a GitHub API to fetch PR diffs, and generates inline comments. The agent processes ~500 PRs per day, with many PRs touching the same files.
In this lab you'll work through:
In 2024, a widely-reported issue with several RAG-based legal research tools involved documents being updated by regulatory bodies — GDPR guidance updates, SEC rule amendments — while the tools' vector stores and result caches retained the old versions. Attorneys received AI-generated summaries confidently citing superseded rules. The root cause was not missing invalidation logic, but invalidation logic tied only to TTL rather than to source document version hashes. When the source document changed, the cache had no mechanism to detect it until the TTL expired — sometimes weeks later.
TTL (time-to-live) invalidation is the simplest strategy: every cache entry expires after a fixed duration. It requires no coordination between the cache and the data source. But it has two failure modes: entries that expire too early cause unnecessary cache misses, and entries that expire too late serve stale data. For agent knowledge bases, "too late" can mean users receiving wrong answers for the entire TTL window after the source changes.
Event-driven invalidation solves the staleness problem by coupling the cache to the data source's change events. When a document is updated, a webhook, message queue event, or database trigger fires and explicitly deletes or updates the affected cache entries. This is more complex to implement but eliminates the staleness window entirely.
The legal tool failures stemmed from using TTL alone on regulatory documents. The fix: compute a SHA-256 hash of each source document at cache write time, store it with the cache entry, and re-verify on read. If the hash doesn't match the current document hash, treat as a miss and re-fetch. This adds a lightweight source read on every cache hit but eliminates silent staleness.
When an invalidation event fires, you need to invalidate the right scope. Three invalidation granularities matter in practice:
Entry-level invalidation deletes a single cache key. This is precise but requires knowing the exact key. For agent caches keyed on SHA-256 prompt hashes, the system must maintain a reverse mapping from source document ID to all cache keys that included that document. Without this, entry-level invalidation on a document change requires a full cache scan.
Tag-based invalidation groups cache entries under semantic tags. When the "pricing-policy-v3" document updates, all cache entries tagged with "pricing-policy-v3" are invalidated in one operation. Redis 7.4 (released 2024) introduced native support for key expiration callbacks that can power this pattern. FastAPI's fastapi-cache2 library supports tag-based invalidation out of the box.
Namespace invalidation wipes an entire cache namespace at once. Used when a major update affects too many entries to invalidate individually — for example, when a company's entire product catalog is re-priced. The trade-off is a temporary cache miss storm (cache stampede) as all requests hit the origin simultaneously.
A cache stampede after namespace invalidation can overwhelm your origin. Mitigation techniques include: probabilistic early expiration (regenerate cache entries slightly before expiry using a random jitter), request coalescing (a single "lock" key ensures only one request regenerates a cache entry while others wait), and staggered TTL assignment so entries don't all expire simultaneously.
For LLM-side prompt caches (Anthropic's and OpenAI's), invalidation is implicit: any change to the cached prefix — including a single character — creates a new cache key and starts a new cache entry. This means prompt cache invalidation is automatic but also accidental. A tempting anti-pattern is injecting dynamic content (current date, user-specific flags) into the system prompt, which defeats prefix caching entirely. Anthropic's documentation explicitly recommends moving all dynamic content below the cached prefix boundary.
In multi-tenant agent deployments, each tenant's customized system prompt is a different cache entry. When updating a base template that all tenants share, the update propagates only to newly created sessions — existing cached sessions retain the old prefix until their ephemeral cache expires (Anthropic's ephemeral TTL is 5 minutes). This 5-minute window must be accounted for in deployment rollout plans for agents where prompt changes have correctness implications.
You're building a compliance monitoring agent that watches regulatory document sources (SEC filings, GDPR guidance updates, OSHA standards) and answers questions about current rules. Documents can update at any time. Your cache currently uses TTL-only invalidation with a 24-hour window.
In this lab you'll work through:
This lesson explores l4: semantic caching — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to l4: semantic caching.