Designing Memory for Agents That Outlast a Single Context Window

Simon Willison’s guide to agentic engineering patterns is a solid survey of what it actually takes to build systems where LLMs take sequences of actions. It covers tool design, prompt injection, and the engineering discipline required to make agentic loops reliable. One area worth exploring further is how these systems manage memory, specifically how they maintain relevant state across tool calls, across long tasks, and across sessions. Getting this right matters more than any individual framework choice, and the failure modes are distinct enough to be worth examining on their own terms.

Four Kinds of Memory

Cognitive scientists have long distinguished several types of human memory, and their categories map usefully onto what agentic systems need to manage.

Working memory corresponds to the context window. Everything in context is immediately available to the model for reasoning; everything outside it is inaccessible unless explicitly retrieved. Context windows have expanded substantially, with Claude’s current limit at 200,000 tokens and Gemini 1.5 Pro supporting up to one million tokens in certain configurations, but those limits are reached faster than intuition suggests once tool results, retrieved documents, and accumulated reasoning traces start stacking up across a multi-step task.

Episodic memory corresponds to external storage the agent can query: a vector database of past interactions, a structured log of previous runs, a record of what has been observed and done. The model does not hold this in context directly; it retrieves specific pieces on demand through a lookup or search tool, much like querying an external service.

Semantic memory is the general world knowledge baked into model weights during training. Engineers cannot directly modify it within a session, though retrieval-augmented approaches can supplement it at query time.

Procedural memory is the tool set: what the agent is equipped to do. It is defined structurally before the session starts, rather than accumulated during it.

The engineering complexity concentrates on episodic memory and the interaction between episodic and working memory, because that is where design choices produce the most consequential differences in production behavior.

The Retrieval Problem

Retrieval-augmented generation is the standard approach to episodic memory in agentic systems. You store documents or past observations as vector embeddings and query the embedding space at each reasoning step to find the most relevant chunks to pull into context. Libraries like Chroma and Qdrant handle the infrastructure cleanly, and the approach works well for lookup-heavy tasks: a coding agent retrieving relevant API documentation, a support agent retrieving past ticket resolutions.

The limitation is that semantic similarity and task relevance diverge often enough to cause real problems. The chunks most similar to the query in embedding space may not be the ones the model actually needs. A document about “rate limits” might rank below one about “API performance” even when rate limiting is the precise constraint at issue, because the embedding distance between those concepts is smaller than it ought to be.

More structured retrieval approaches address this by building explicit indexes rather than dense vector spaces. Microsoft’s GraphRAG extracts entities and relationships from a corpus and stores them in a knowledge graph; queries traverse the graph structure rather than ranking embeddings. The precision gains are real for structured knowledge domains. The setup overhead is correspondingly larger, and the approach works best when the corpus has clear entity boundaries and relationships worth modeling explicitly.

For agents with more open-ended retrieval needs, the practical answer is usually a hybrid: dense retrieval for broad coverage, filtered by metadata like document date or source type, combined with keyword or structured lookup for high-precision cases. Neither alone is sufficient for tasks where the model needs to draw on a mix of recent history and reference material.

State Across Sessions

Single-session agents sidestep the cross-session problem because everything lives in the context window and terminates cleanly when the session ends. Agents that need to resume interrupted work, track state over days, or coordinate across subagents face a harder question: what to persist, in what form, and how to reload it without consuming the entire context budget before any useful work happens.

The naive approach is to serialize the full conversation history and reload it next session. This fails at scale because history grows linearly with sessions and the model eventually cannot fit it into context. A more sustainable pattern is hierarchical summarization: at the end of each session, ask the model to summarize what happened, what was decided, and what remains to be done. Store those summaries rather than the raw transcript. Load the recent summaries rather than the full history on resume.

Libraries like Mem0 automate a version of this by classifying observations into facts, preferences, and events and storing them in structured form. Whether the overhead is worth it depends on task duration and how much cross-session context matters for the specific application. For agents tracking long-running projects or user preferences across many interactions, structured external memory tends to outperform raw summarization because retrieval is more precise: you can ask “what did the user say about their deployment environment” and get a specific fact back rather than a compressed summary that may or may not mention it.

Explicit State Representation

A pattern that complements all of the above is making the agent’s working state explicit rather than leaving it implicit in conversation history. In most agent implementations, state is whatever the conversation history happens to contain at a given moment. The result is state that is opaque, hard to inspect, and difficult to hand off to a subagent or resumed session.

An alternative is to define a state schema and treat it as a first-class artifact that the agent reads and updates on each turn:

from pydantic import BaseModel
from typing import List, Optional

class AgentState(BaseModel):
    task_description: str
    completed_steps: List[str]
    pending_steps: List[str]
    retrieved_documents: List[str]
    notes: Optional[str] = None

state = AgentState(
    task_description="Summarize Q4 financials",
    completed_steps=["Retrieved Q4 report"],
    pending_steps=["Extract key metrics", "Write summary"],
    retrieved_documents=["q4_report.pdf"]
)

messages.append({
    "role": "user",
    "content": f"Current state: {state.model_dump_json()}\n\nContinue with the next step."
})

The model receives the current state, executes the next step, and returns an updated state as part of its output. You persist that state before the next turn. When the agent stalls or produces unexpected output, you can read the state object and know exactly where it is and what it was trying to accomplish. Handoffs are clean: a subagent or a resumed session receives a structured summary of what is known and what remains.

The trade-off is that you are asking the model to maintain a structured object through every turn, which increases prompt complexity and occasionally produces malformed state updates. Validating the returned state with the Pydantic model and falling back on the previous state if validation fails keeps that failure mode contained.

What Memory Architecture Reveals

Memory architecture is where agentic systems become hard to design well, as distinct from hard to get running. A demo agent with five tool calls and a single well-bounded session can ignore memory architecture entirely and still look impressive. A production agent that handles complex tasks across multiple sessions, or coordinates across several subagents working in parallel, will produce failures that trace back to memory decisions rather than model capability.

The four-type framework provides a useful vocabulary for diagnosing those failures. An agent that keeps repeating work it already completed has an episodic memory problem. An agent that loses consistency across sessions has a persistence problem. An agent that drifts from its original goal during a long task has a working memory management problem, likely because intermediate tool results have pushed the original task context out of the active window. Each diagnosis points to a different class of solution.

Willison’s framing of agentic engineering as a distinct discipline is useful precisely because the failure modes do not map onto traditional software failures. Memory architecture is one of the clearest places where that distinction is concrete rather than rhetorical.