In 2023, the team behind the customer-support agent at Intercom published a post-mortem on their first production chatbot. Users would open a new conversation and reference something they'd said "last time" — a ticket number, a billing complaint, a specific feature request. The agent had no idea what they meant. Every session started from zero. Churn from that cohort of users was 34% higher than users who reached human agents. The fix was not a smarter model. It was a memory layer that persisted user-level facts between sessions. After six weeks of work, that churn gap closed to 4%.
The problem was architectural, not algorithmic. The agent had only one kind of memory — a context window that evaporated at session end. What it needed was two kinds working together.
Every production AI agent that handles multi-turn interactions must manage two distinct memory regimes. In-memory (also called working memory or session memory) holds the current conversation — messages, tool outputs, intermediate reasoning. It lives in RAM and in the model's context window. It is fast, directly accessible, and ephemeral. When the process ends or the session closes, it is gone.
Persistent memory is anything written to a durable store — a database, a vector index, a file — that survives process restarts and session boundaries. Persistent memory is slower to access, requires a retrieval step, and must be explicitly managed. But it is the only way an agent can accumulate knowledge over time.
The OpenClaw framework (the open-source agent orchestration layer used in several documented production deployments including the Dust.tt platform and early Fixie.ai prototypes) formalizes this split with two explicit memory objects: a SessionBuffer for in-memory state and a MemoryStore interface for persistence. Understanding why they are separate — not just what they do — is the core competency of this module.
In-memory context is scoped to a single agent run. Persistent memory is scoped to an entity — a user, a project, a domain — and outlasts any individual run. Conflating them is the single most common architecture mistake in first-generation agent deployments.
Context windows are not free. As of mid-2024, GPT-4o charges $5 per million input tokens. A naive implementation that stuffs all historical memory into every prompt to avoid building a retrieval layer will quickly hit two walls: cost and latency. The Anthropic Claude 3 models support 200K-token contexts, but sending 200K tokens on every agent turn costs roughly $0.30 per call — catastrophic at scale.
The OpenClaw design answer is that in-memory context should contain only what the model needs right now to reason correctly. Persistent storage holds everything else, and a retrieval step fetches only the relevant subset when a new turn begins. This is not just an optimization — it is what allows agents to scale to users with years of interaction history.
The Dust.tt team documented in their 2024 engineering blog that switching from a full-history-in-context approach to a retrieved-summary approach reduced their average prompt size by 71% while increasing user-reported continuity satisfaction by 28%. The numbers confirm what the architecture already implies: retrieval is not a crutch, it is the design.
OpenClaw's MemoryStore.retrieve(query, topK) is called at the start of every agent turn, injecting the top-K most relevant memories as a compressed block at the head of the system prompt. The rest of the context window is reserved for the live conversation and tool outputs.
MemoryStore.retrieve(query, topK) call inject into the agent's context?You are reviewing the architecture of a new customer-success agent. The agent needs to handle: (a) the current conversation turn, (b) a user's subscription tier and past purchase history, (c) tool call results from a live inventory lookup, and (d) a summary of the user's last three support tickets.
retrieve() call should look like for this agent at the start of a new session.In late 2023, the engineering team at Fixie.ai (a production agent platform that served thousands of developers) published a detailed breakdown of their session management challenges. Their agents ran on GPT-4 with an 8K-token context window. As conversations grew — especially in coding assistant workflows where tool outputs could run to hundreds of lines — the context would silently overflow. The model would start "forgetting" earlier parts of the conversation mid-session, producing contradictory responses. Users reported it as the model "going crazy." The fix was a structured buffer with explicit windowing and mid-session summarization, cutting mid-session contradiction rates by over 60%.
OpenClaw's SessionBuffer is a typed data structure — not a raw list of strings — that holds the live state of a single agent run. It contains four slots: a system block (the agent's identity and injected memories), a message list (user/assistant/tool turns in order), a tool result cache (raw outputs from tool calls, which can be large), and a token counter (a running tally of current usage against the configured budget).
The buffer enforces a configurable maxTokens ceiling. When the ceiling is approached — OpenClaw defaults to triggering at 80% utilization — the buffer executes one of two strategies: sliding window (drop the oldest N messages) or compress-and-summarize (call the LLM to produce a summary of the dropped messages and insert that summary as a single compressed block). The choice between them is a design decision with real tradeoffs.
Sliding window is fast and cheap but loses exact phrasing and detail. Compress-and-summarize preserves semantic content but costs an extra LLM call — roughly $0.01–0.03 per compression event. For agents in high-frequency workflows, that adds up. OpenClaw lets you configure the strategy per agent type.
A critical but underappreciated aspect of buffer management is that different parts of the context have different "priority" for being retained. OpenClaw implements a priority scoring system: the system block is never evicted, tool results from the current turn are protected, and older tool results from prior turns are marked as eviction candidates first. User and assistant message pairs are ranked by recency, with the most recent always protected.
This matters because naive FIFO (first-in, first-out) eviction can remove the user's initial problem statement — the most important message in the conversation — long before it should be dropped. OpenClaw's priority eviction ensures the problem statement (the first user turn) is one of the last things removed, after all tool results and intermediate exchanges.
The Fixie.ai team's 2023 implementation closely mirrored this pattern. Their internal tooling showed that prior-turn tool results — source code outputs, API responses — constituted 62% of their context window usage but less than 15% of the tokens that actually influenced the model's final responses. Evicting them first was a straightforward win.
OpenClaw exposes SessionBuffer.setEvictionPolicy(policy) where policy can be 'fifo', 'priority', or a custom comparator function. Production deployments almost always use 'priority' or a custom policy. Default is 'priority'.
You are building a legal research agent that runs long multi-turn sessions. The agent calls three tools: a case law database (returns 2,000–5,000 token results), a statute lookup (returns 500–1,000 tokens), and a citation checker (returns 100–300 tokens). Sessions can span 20–40 turns.
When Replit launched their AI coding assistant "Ghostwriter" in 2022, an early version stored user preferences and project context in a standard PostgreSQL table with a single JSONB column per user. This worked for simple key-value facts ("preferred language: Python") but completely failed at semantic lookup — there was no way to retrieve "what did this user say about authentication handling six months ago" without scanning every row. In mid-2023, Replit migrated to a hybrid architecture: structured facts remained in Postgres (fast exact lookup), while semantic memory — past explanations, design decisions, debugging patterns — moved to a pgvector extension with embeddings. Query latency for semantic lookup dropped from 800ms (full table scan) to 12ms (ANN index).
OpenClaw's MemoryStore is an interface, not an implementation. It defines four methods: write(key, value, metadata), retrieve(query, topK), delete(key), and list(filter). Any compliant backend can be plugged in. OpenClaw ships with three official adapters: an in-process JSON store (for development), a Redis adapter (for fast key-value production use), and a pgvector adapter (for semantic retrieval in production).
Choosing the right backend requires understanding your retrieval pattern. If you always look up memory by an exact key — user ID, session ID, entity name — a key-value store like Redis is optimal. If you need to retrieve memories that are semantically related to a current query — "what does this user know about tax law?" — you need a vector store. Most production agents need both, which is why Replit's hybrid approach is the industry norm rather than the exception.
Key-value (Redis): exact user facts, session flags, counters, preferences.
Vector (pgvector, Pinecone, Weaviate): episodic memories, past reasoning, domain knowledge snippets.
Relational (Postgres): structured records with filtering — purchase history, ticket logs, user accounts.
Hybrid: any production agent that needs all three query patterns simultaneously.
Knowing what to write to persistent memory is as important as knowing how to retrieve it. OpenClaw formalizes three write strategies. Write-on-close: at session end, a summarization step extracts key facts and decisions from the session buffer and writes them to the store. This is the most common pattern — cheap, simple, and sufficient for most use cases. Write-on-event: specific trigger conditions (user confirms a fact, agent makes a commitment, a milestone is reached) cause an immediate write. This is used when session loss (crash, timeout) would be costly. Write-on-every-turn: the current buffer state is checkpointed after each turn. This is the most expensive but enables full session recovery.
The Dust.tt engineering team documented in 2024 that they use write-on-event as their primary strategy, with write-on-close as a fallback. Their events are: user provides a new preference, agent produces a plan with numbered steps, agent calls an external action (email, calendar event). This gives them crash resilience for the things that matter without the cost of full turn-by-turn checkpointing.
Write-on-every-turn creates a new failure mode: partial writes. If a write succeeds but the turn fails, you have a memory store that is ahead of the actual conversation state. OpenClaw's Redis adapter uses optimistic locking to handle this, but it requires deliberate configuration — it is not on by default.
You are architecting the memory system for a financial planning agent. It needs to store: (a) a user's exact account balances updated daily, (b) the user's past stated goals ("I want to retire at 60"), (c) past advisory conversations going back two years, and (d) a lookup of the user's risk tolerance category (conservative/moderate/aggressive).
This lesson explores l4: memory retrieval — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to l4: memory retrieval.