🎯 Advanced

What to Cache

Not everything deserves to be stored. Understanding cacheability is the first step to building agents that scale without accumulating garbage.

In 2023, Klarna's AI assistant — built on OpenAI's API and handling over 700,000 customer service conversations per month — reported that a significant portion of inbound queries were near-identical: order status lookups, return policy questions, and payment FAQ variants. By routing these through a result cache keyed on normalized query text, Klarna reduced redundant LLM calls dramatically. The principle was simple: if two queries resolve to the same intent and the answer doesn't change faster than the cache TTL, one LLM call is enough.

The Cacheability Spectrum

Every piece of data an agent handles sits somewhere on a spectrum from "always cacheable" to "never cacheable." At one end: static system instructions, tool schemas, and reference documents that change only on deliberate deployment events. At the other end: live financial quotes, user authentication tokens, and real-time sensor readings that are stale within seconds of generation.

The critical analytical skill is placing each data type accurately on that spectrum, because a cache hit on stale financial data is actively dangerous, while refusing to cache a static FAQ response is just wasteful. Neither failure mode is trivial at scale.

Static artifacts: System prompts, tool definitions, policy documents, brand voice guides — cache aggressively with long or indefinite TTL.
Slow-changing data: Product catalogs, user preference profiles, pricing tiers — cache with TTLs measured in hours, with event-driven invalidation on update.
Session-scoped data: Conversation context, current task state — cache for session duration, invalidate on session end.
Real-time data: Live inventory, token prices, authentication challenges — do not cache or use sub-second TTLs with explicit staleness flags.

Key Principle

Cache data at the granularity that matches its actual change rate. Combining fast-changing and slow-changing data into a single cache entry forces you to use the shorter TTL for everything, destroying the value of caching the stable component.

LLM-Specific Cacheability: Prompt Caching

Anthropic's prompt caching feature — introduced in 2024 for Claude — introduced a specific dimension of cacheability that applies to the token processing stage rather than the result stage. When a long system prompt or document context is marked with cache_control: ephemeral, the KV (key-value) cache of the transformer's attention computation is stored on Anthropic's infrastructure. Subsequent requests reusing that exact prefix skip recomputation and are billed at 10% of the normal input token cost.

OpenAI rolled out analogous automatic prompt caching for GPT-4o in late 2024, caching prefixes of 1,024 tokens or more automatically at 50% cost reduction. The design difference matters: Anthropic's system requires explicit cache_control markers, giving the developer precise control; OpenAI's system is automatic but less controllable. For agents with long, stable system prompts — legal analysis agents, code review agents carrying large style guides — this can represent 60–80% reduction in per-call costs.

Design Implication

To maximize prompt cache hit rates, structure your prompts so stable content appears at the top (system role, instructions, reference documents) and variable content appears at the bottom (user message, dynamic context). Cache hits require prefix matching — any change to a character before the cached region breaks the hit.

What Agents Specifically Produce Worth Caching

Beyond raw LLM inputs, agents produce several categories of intermediate output that are worth caching. Tool call results — especially API responses from external services — are expensive to re-fetch and often stable within a request window. Embedding vectors for documents that don't change are computationally expensive to regenerate. Retrieved chunks from a RAG pipeline, given a stable document corpus, can be cached keyed on the query embedding and document set version hash.

Planning outputs from a reasoning step — the task decomposition a ReAct agent produces before executing — are worth caching when the same high-level goal recurs across sessions. GitHub Copilot's infrastructure caches code completion prefixes across users working in the same repository, leveraging the fact that common boilerplate appears repeatedly across sessions even from different developers.

❓ Quiz

Lesson 1 Quiz

3 questions — free, untracked, retake anytime.

1. Klarna's AI caching strategy worked primarily because a large portion of their customer queries shared which characteristic?

✓ Correct — ✓ Correct. Near-identical intent with a stable answer is the exact condition that makes result caching effective — one LLM call satisfies all equivalent requests.

✗ Not quite. The key was that many queries resolved to the same intent and the answer didn't change faster than the cache TTL — not surface-level query similarity.

2. Anthropic's prompt caching charges cached input tokens at what fraction of normal input token cost?

✓ Correct — ✓ Correct. Anthropic prices prompt cache hits at 10% of normal input token cost, versus OpenAI's 50% for their automatic caching on GPT-4o.

✗ Not correct. Anthropic's prompt caching hits are billed at 10% of normal input token cost. OpenAI's automatic caching is 50%.

3. To maximize prompt cache hit rates with prefix-based caching, where should variable content (like the user message) be placed?

✓ Correct — ✓ Correct. Prefix caching requires that the cached portion be an exact prefix match. Placing variable content at the bottom preserves the stable prefix and maximizes cache hits.

✗ Not quite. Since prefix caching requires the cached text to appear at the start of the prompt unchanged, all variable content must go at the bottom to preserve that prefix.

🧪 Lab

Lab 1: Classifying Cacheability

Work with an AI tutor to analyze real agent data types and place them on the cacheability spectrum.

Your Mission

You're architecting a customer-service agent for an e-commerce platform. The agent has access to: a static brand voice guide, live inventory counts, user order history, product descriptions, and real-time shipping carrier status APIs.

In this lab you'll work through:

Classifying each data type on the cacheability spectrum with justification
Deciding whether to use result caching, prompt caching, or both for each type
Identifying which data types, if cached incorrectly, could cause user-facing errors

Start by asking the tutor to walk you through classifying the brand voice guide — then work through the remaining data types one by one.

🧪 Lab Assistant — Cacheability Classification Advanced · Module 5

🎯 Advanced

Cache Layers

Where you store a cache is as important as what you store. Agents operating across distributed infrastructure need a deliberate layering strategy.

When Notion launched its AI features in early 2023 — embedded directly into the document editor and backed by OpenAI's API — latency was a primary user complaint. The team implemented a multi-layer caching architecture: an in-process memory cache for prompt prefixes within a single editing session, a Redis-backed distributed cache for repeated block-level completions shared across users, and Cloudflare's edge cache for static system prompts served at the CDN layer. Each layer had a different TTL and eviction policy. The result was sub-100ms response times for common completions that would otherwise require a round trip to OpenAI's API.

The Three-Layer Model

Production agent systems typically operate across three distinct cache layers, each optimized for different access patterns and scale characteristics.

L1 — In-process / in-memory: Fastest access (nanoseconds), zero serialization overhead. Limited to the memory of a single process. Appropriate for: per-request memoization, session state that doesn't need to survive process restart. Tools: Python's functools.lru_cache, LangChain's InMemoryCache.
L2 — Shared distributed cache: Millisecond access, shared across all agent instances. Survives individual process restarts. Appropriate for: shared tool results, embedding vectors, RAG retrieval results. Tools: Redis, Memcached, DynamoDB with TTL.
L3 — Persistent / edge cache: Hundreds of milliseconds to seconds for cold reads, effectively zero for CDN-edge hits. Appropriate for: static system prompts, documentation chunks, tool schemas. Tools: Cloudflare KV, Fastly, AWS CloudFront with S3 origin.

Architecture Rule

Cache lookups should cascade: check L1 first, then L2, then L3, then origin. Writes should populate all lower layers. A miss at any layer should warm the layers above it so the next request is faster. This is the same write-through / read-through pattern used in CPU cache hierarchies.

Agent-Specific Considerations

Agents introduce concerns that pure web caches don't face. The first is statefulness: a web cache stores response payloads, but an agent's cache may store intermediate reasoning states, partially completed tool call sequences, or RAG context windows. These need careful scoping — a cached tool result from user A's session must never be served for user B without explicit cross-user sharing logic.

The second concern is the multi-step nature of agent execution. In a ReAct loop, caching the output of step 2 is only valid if steps 0 and 1 were identical. The cache key must encode enough of the execution context to guarantee this. LangChain's SQLiteCache, for example, keys on the exact prompt text sent to the LLM — meaning it won't incorrectly reuse a cached response if the context changed.

Real Deployment

LangChain's production caching backends — including Redis, MongoDB, and Cassandra integrations released in 2023–2024 — all key on the full serialized prompt. This means cache keys can be very large, and teams often hash them (SHA-256) to keep key sizes manageable while avoiding collisions.

A third concern specific to agents is tool call deduplication within a single run. When an agent's ReAct loop generates the same tool call twice in one execution (a known failure mode called "looping"), an L1 in-process cache that returns the cached result immediately breaks the loop without incurring a second external API call.

Choosing a Cache Backend

For serverless or edge-deployed agents (AWS Lambda, Cloudflare Workers), L1 in-process caching is largely useless because each invocation may run in a fresh process. The architecture must rely on L2/L3 for any cache benefit. Redis on ElastiCache or Upstash's serverless Redis (which offers HTTP-based access compatible with edge runtimes) are the standard choices in 2024 production deployments.

For long-running agent servers — common in enterprise deployments running LangGraph or CrewAI — L1 caching provides substantial benefit and should be the first layer added. The @lru_cache decorator on embedding generation functions alone can halve embedding API costs when documents repeat across requests within a server's lifetime.

❓ Quiz

Lesson 2 Quiz

3 questions — free, untracked, retake anytime.

1. In Notion's AI caching architecture, which layer handled repeated block-level completions shared across multiple users?

✓ Correct — ✓ Correct. Redis handled the shared, cross-user completions. The in-process cache was session-scoped, and Cloudflare handled static system prompts.

✗ Not quite. Redis was the shared distributed layer. The in-process cache was per-session, and Cloudflare handled static content at the edge.

2. Why is L1 in-process caching largely ineffective for agents deployed on serverless platforms like AWS Lambda?

✓ Correct — ✓ Correct. Without a persistent process, L1 memory caches don't survive across invocations. The architecture must rely on L2 distributed caches like Redis for any cache persistence.

✗ Not quite. The issue is process lifecycle: each Lambda invocation may be a cold start with a fresh process, so anything stored in memory from a previous invocation is gone.

3. LangChain's production cache backends key on the full serialized prompt. Why do teams typically SHA-256 hash these keys?

✓ Correct — ✓ Correct. Full serialized prompts can be very large. SHA-256 produces a fixed-length 64-character key regardless of prompt size, keeping the cache index lightweight.

✗ Not quite. The reason is practical key size management. Full prompt strings can be thousands of tokens long — SHA-256 hashing produces a compact, fixed-length key while preserving uniqueness.

🧪 Lab

Lab 2: Designing a Cache Layer Stack

Design a three-layer cache architecture for a real production scenario with the AI tutor.

Your Mission

You're building a code review agent that runs on AWS Lambda, uses a large GPT-4o system prompt containing your organization's coding standards, calls a GitHub API to fetch PR diffs, and generates inline comments. The agent processes ~500 PRs per day, with many PRs touching the same files.

In this lab you'll work through:

Deciding which cache layer is appropriate for the system prompt vs. PR diff results vs. generated comments
Designing the cache key structure for each layer
Handling the Lambda-specific constraint that L1 caching may not persist between invocations

Ask the tutor to start by helping you decide what goes in L2 (Redis) vs. L3 (edge/CDN) for this specific agent scenario.

🧪 Lab Assistant — Cache Layer Design Advanced · Module 5

🎯 Advanced

Invalidation Strategies

Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. For agents, stale data can mean wrong answers delivered confidently.

In 2024, a widely-reported issue with several RAG-based legal research tools involved documents being updated by regulatory bodies — GDPR guidance updates, SEC rule amendments — while the tools' vector stores and result caches retained the old versions. Attorneys received AI-generated summaries confidently citing superseded rules. The root cause was not missing invalidation logic, but invalidation logic tied only to TTL rather than to source document version hashes. When the source document changed, the cache had no mechanism to detect it until the TTL expired — sometimes weeks later.

TTL-Based vs. Event-Driven Invalidation

TTL (time-to-live) invalidation is the simplest strategy: every cache entry expires after a fixed duration. It requires no coordination between the cache and the data source. But it has two failure modes: entries that expire too early cause unnecessary cache misses, and entries that expire too late serve stale data. For agent knowledge bases, "too late" can mean users receiving wrong answers for the entire TTL window after the source changes.

Event-driven invalidation solves the staleness problem by coupling the cache to the data source's change events. When a document is updated, a webhook, message queue event, or database trigger fires and explicitly deletes or updates the affected cache entries. This is more complex to implement but eliminates the staleness window entirely.

TTL: Simple, low-coordination overhead. Best for data with predictable and acceptable staleness windows.
Event-driven: Zero staleness window. Requires change event infrastructure (Kafka, SNS, database triggers). Best for data where any staleness is unacceptable.
Version-hash invalidation: Cache entries include a hash of the source document. On retrieval, hash is verified against source. Stale entries are detected immediately without requiring push events from the source.
Hybrid: TTL as a safety net, event-driven for known high-change data, version hashing for critical documents.

The Legal RAG Lesson

The legal tool failures stemmed from using TTL alone on regulatory documents. The fix: compute a SHA-256 hash of each source document at cache write time, store it with the cache entry, and re-verify on read. If the hash doesn't match the current document hash, treat as a miss and re-fetch. This adds a lightweight source read on every cache hit but eliminates silent staleness.

Invalidation Scope: Entry, Tag, and Namespace

When an invalidation event fires, you need to invalidate the right scope. Three invalidation granularities matter in practice:

Entry-level invalidation deletes a single cache key. This is precise but requires knowing the exact key. For agent caches keyed on SHA-256 prompt hashes, the system must maintain a reverse mapping from source document ID to all cache keys that included that document. Without this, entry-level invalidation on a document change requires a full cache scan.

Tag-based invalidation groups cache entries under semantic tags. When the "pricing-policy-v3" document updates, all cache entries tagged with "pricing-policy-v3" are invalidated in one operation. Redis 7.4 (released 2024) introduced native support for key expiration callbacks that can power this pattern. FastAPI's fastapi-cache2 library supports tag-based invalidation out of the box.

Namespace invalidation wipes an entire cache namespace at once. Used when a major update affects too many entries to invalidate individually — for example, when a company's entire product catalog is re-priced. The trade-off is a temporary cache miss storm (cache stampede) as all requests hit the origin simultaneously.

Cache Stampede Mitigation

A cache stampede after namespace invalidation can overwhelm your origin. Mitigation techniques include: probabilistic early expiration (regenerate cache entries slightly before expiry using a random jitter), request coalescing (a single "lock" key ensures only one request regenerates a cache entry while others wait), and staggered TTL assignment so entries don't all expire simultaneously.

Prompt Cache Invalidation

For LLM-side prompt caches (Anthropic's and OpenAI's), invalidation is implicit: any change to the cached prefix — including a single character — creates a new cache key and starts a new cache entry. This means prompt cache invalidation is automatic but also accidental. A tempting anti-pattern is injecting dynamic content (current date, user-specific flags) into the system prompt, which defeats prefix caching entirely. Anthropic's documentation explicitly recommends moving all dynamic content below the cached prefix boundary.

In multi-tenant agent deployments, each tenant's customized system prompt is a different cache entry. When updating a base template that all tenants share, the update propagates only to newly created sessions — existing cached sessions retain the old prefix until their ephemeral cache expires (Anthropic's ephemeral TTL is 5 minutes). This 5-minute window must be accounted for in deployment rollout plans for agents where prompt changes have correctness implications.

❓ Quiz

Lesson 3 Quiz

3 questions — free, untracked, retake anytime.

1. The legal RAG tool failures described in this lesson were caused by what specific invalidation design flaw?

✓ Correct — ✓ Correct. TTL-only invalidation has no mechanism to detect that the source document changed. The cache could serve stale regulatory guidance for the entire TTL window — potentially weeks.

✗ Not quite. The flaw was specifically that TTL-only invalidation has no coupling to source document changes. When a document was updated, the cache had no way to know until the TTL expired.

2. What is a "cache stampede" in the context of namespace invalidation?

✓ Correct — ✓ Correct. A cache stampede (also called a thundering herd) occurs when a mass expiration or invalidation causes all requests to simultaneously miss the cache and hit the origin system, potentially overwhelming it.

✗ Not quite. A cache stampede happens after mass invalidation when all requests simultaneously miss the cache and hit the origin server at once, which can overwhelm the origin.

3. Anthropic's ephemeral prompt cache TTL is approximately how long, and why does this matter for deployment rollouts?

✓ Correct — ✓ Correct. Anthropic's ephemeral cache has a ~5-minute TTL. For correctness-critical prompt updates, deployment plans must account for this window where old prompts remain cached.

✗ Not quite. The ephemeral TTL is approximately 5 minutes. Any existing session may continue using the old cached prompt for up to 5 minutes after a prompt update is deployed.

🧪 Lab

Lab 3: Building an Invalidation Strategy

Design a multi-strategy invalidation system for a knowledge-intensive agent with the AI tutor.

Your Mission

You're building a compliance monitoring agent that watches regulatory document sources (SEC filings, GDPR guidance updates, OSHA standards) and answers questions about current rules. Documents can update at any time. Your cache currently uses TTL-only invalidation with a 24-hour window.

In this lab you'll work through:

Identifying the specific risks of your current 24-hour TTL-only strategy for regulatory content
Designing a version-hash invalidation layer that runs alongside TTL
Planning a tag-based invalidation structure so updating one regulatory document invalidates only the affected entries

Start by asking the tutor to help you articulate the specific failure scenario that could occur with the 24-hour TTL if an SEC rule is amended overnight.

🧪 Lab Assistant — Invalidation Strategy Advanced · Module 5

Building AI Agents V — Optimization · Module 5 · Lesson 4

L4: Semantic Caching

Advanced concepts, real-world applications, and practical implications

Core Concepts

This lesson explores l4: semantic caching — examining the key principles, real-world applications, and implications for practitioners working in this domain.

Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.

Practical Applications

The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.

Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.

Looking Forward

As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.

Lesson 4 Quiz

L4: Semantic Caching

What is the primary focus of L4: Semantic Caching?

✓ Correct — Correct. This lesson bridges theory and practice, focusing on real-world implementation.

Review the lesson — the focus is on connecting frameworks to practical reality.

Why does real-world deployment introduce challenges that pure theory doesn't capture?

✓ Correct — Correct. Real deployment requires judgment, not just framework application.

Practice doesn't invalidate theory — it reveals complexities that require nuanced application of theoretical principles.

What separates effective practitioners from those who merely follow checklists?

✓ Correct — Correct. Critical thinking and adaptability matter more than memorized procedures.

The key differentiator is critical thinking ability, not experience or resources alone.

🎯 Advanced · Lesson 4 Lab

Lab: Apply What You've Learned

Synthesize concepts from L4: Semantic Caching through guided AI conversation

Your Task

Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to l4: semantic caching.

Try: "How would the concepts from this lesson apply to a real-world scenario in this field?"

🤖 AESOP Lab Assistant Lesson 4 Lab

Module 5 Test

Caching for Agents · 15 Questions · 70% to Pass

Score: 0/15

1. What is the core objective of Caching for Agents?

2. How should practitioners approach applying concepts from this module?

3. Which best describes the relationship between theory and practice in Building AI Agents V — Optimization?

4. What distinguishes expert practitioners from novices in this field?

5. How does Caching for Agents build on previous modules?

6. What role do constraints play in practical implementation?

7. When applying frameworks from this module, what is most important?

8. How should practitioners handle conflicting perspectives in this field?

9. What makes the concepts in Caching for Agents relevant beyond their immediate context?

10. How should practitioners continue developing expertise after completing this module?

11. What is the relationship between understanding Building AI Agents V — Optimization concepts and making decisions?

12. How do the lessons from this module apply to novel situations?

13. What is the value of understanding multiple perspectives on {course_title}?

14. How should practitioners evaluate new information or developments in this field?

15. What is the ultimate goal of learning Caching for Agents?