Memory Architecture for AI Agents: Four Problems That Context Windows Don't Solve

The stateless problem

LLMs don’t remember anything between invocations. Each call is a fresh start: you provide a sequence of tokens, the model produces output, and the weights remain unchanged. For a simple chatbot, this is manageable. For agents that track ongoing tasks, adapt to user preferences, or maintain coherent behavior over weeks and months, it is the central engineering challenge.

This article at tombedor.dev frames agent memory as a taxonomy problem, which is a productive starting point. Different things an agent needs to remember require different storage strategies. The taxonomy comes from cognitive psychology, and it maps onto implementation choices in ways worth understanding carefully before reaching for the nearest vector database.

The cognitive science taxonomy

Cognitive psychologists divide long-term memory into three broad types. Semantic memory holds general knowledge: facts, concepts, and world knowledge. Episodic memory holds personal experience: specific events, conversations, and the sequence of things that happened. Procedural memory holds skills and habits: how to perform tasks, which strategies work in which contexts.

Working memory is separate from long-term memory. It is the small, fast buffer where active reasoning happens. The LLM context window is a reasonable analogy: limited in size, fast to access, and cleared when the session ends.

This taxonomy is imperfect when applied to agents. A user’s preference for concise responses is simultaneously semantic (a fact about the user), episodic (derived from past interactions), and procedural (it should govern future behavior). The categories blur constantly, and the blurring is precisely where interesting engineering problems live.

In-context memory: simple and surprisingly viable

The simplest memory approach is to put everything in the context window. Conversation history enters the prompt as a sequence of user and assistant turns. No infrastructure required, no retrieval pipeline, no embeddings. The model attends to anything in its context directly.

Context windows have grown substantially. Claude 3.5 Sonnet supports 200,000 tokens; Gemini 1.5 Pro reaches 1 million. For a Discord bot handling short conversations, keeping a rolling window of recent messages in the prompt is often sufficient and has no moving parts to fail.

The problems are cost, latency, and attention quality. Every token in the context costs money and processing time. More critically, a 2023 paper by Liu et al., “Lost in the Middle”, demonstrated that transformer models consistently attend worse to information in the middle of long contexts than to information near the beginning or end. A 200,000-token context is not uniformly useful; material buried deep in a long history is often effectively invisible to the model.

External memory: retrieval-augmented generation and its limits

The standard solution to context limits is retrieval-augmented generation. Information gets stored outside the context window, retrieved at query time based on relevance, and injected into the prompt. Vector embeddings and approximate nearest-neighbor (ANN) search are the dominant mechanism.

The pipeline is well-understood. Text gets embedded into high-dimensional vectors using models like OpenAI’s text-embedding-3-small (1,536 dimensions) or Voyage AI’s voyage-large-2 (4,096 dimensions). These vectors are stored in a database that supports ANN search: Chroma for local development, pgvector for Postgres deployments, Weaviate or Pinecone for managed scale. At query time, the user’s message gets embedded, and the top-k nearest stored chunks are retrieved and prepended to the prompt.

This approach works well for semantic memory, retrieving factual knowledge or documentation, but works poorly for episodic memory, which is where most conversational agent use cases land. Semantic similarity and temporal or causal relevance are different axes. “What did the user ask me to do last week?” is an episodic query. The relevant stored memory might share no semantic overlap with the query text at all. Time-based and causal queries do not map cleanly onto cosine similarity.

Zep and Mem0 have both moved to address this by storing memories in hybrid form: as vector embeddings for semantic retrieval and as structured entity graphs for relationship and attribute queries. This allows retrieval that combines embedding similarity with graph traversal, handling episodic queries better than embeddings alone. Neither system fully solves the problem, but both are meaningfully better than pure vector retrieval for conversational agents.

MemGPT: explicit memory management

The most architecturally ambitious approach comes from MemGPT (now marketed as Letta), described in a 2023 paper from Berkeley. The authors drew an analogy to OS memory management: an LLM agent could manage the boundary between its context window (fast, limited) and external storage (slow, unlimited) in the same way an operating system manages the boundary between RAM and disk.

In the MemGPT architecture, the agent has explicit programmatic control over its own context. It uses tool calls to write facts to external storage, retrieve memories back into context on demand, and compress old context to make room for new information. A small “core memory” block in the system prompt holds critical persistent facts. An “archival storage” layer holds everything else, searched via embeddings when the agent decides to retrieve something.

This makes memory management a first-class concern for the agent rather than a background infrastructure concern. The upside is that the agent can reason about what it knows and doesn’t know, and retrieve accordingly. The downside is that this reasoning is itself fallible; an agent that misjudges what to retrieve will perform worse than one with a simpler automatic pipeline. You are trading infrastructure complexity for model-level complexity, and it is not obvious which is more reliable.

Letta has refined this into a practical SDK with configurable memory blocks, server-side persistence, and tool-based memory editing. For agents that need long-term persistent identity across many sessions, this architecture is the most principled currently available.

Procedural memory: the underappreciated dimension

Procedural memory receives less attention than episodic and semantic in most discussions of agent architecture, but it matters more for behavioral consistency than either. For an LLM agent, procedural memory translates to: behavioral constraints in the system prompt, few-shot examples demonstrating correct output format, tool schemas specifying external capabilities, and corrections extracted from past user feedback.

The last category is where most systems fall short. When a user corrects an agent’s behavior repeatedly, that correction should persist across sessions. If someone tells a bot to stop using bullet points, or to always include source citations, or to check a database before answering pricing questions, those preferences should be stored as durable rules and injected into future system prompts.

This is harder than it sounds. Corrections are usually implicit rather than explicit; the user says “that’s not quite right, I meant…” rather than “here is a rule I want you to follow.” Extracting a generalizable rule from an implicit correction, storing it in a form that can be consistently applied, and preventing it from causing unintended behavior in unrelated contexts is a non-trivial extraction and classification task. Most current frameworks handle procedural memory poorly or not at all, which is why agents that learn from feedback over time remain rare in production.

The forgetting problem

One insight the cognitive science framing makes concrete is that forgetting is functional, not a failure. Human memory selectively retains information based on recency, access frequency, emotional salience, and interference from similar memories. Selective forgetting prevents older, outdated information from overwhelming current context.

Most agent memory systems do not forget. Every conversation gets stored, every fact gets embedded, and the retrieval corpus grows indefinitely. This creates consistency problems over time: a user’s preferences from a year ago may conflict with current preferences; facts that were true six months ago may now be false; old interactions create noise in retrieval results for current queries.

Memory consolidation, reviewing and compressing stored memories to resolve conflicts and remove outdated information, is largely unsolved in current agent frameworks. MemGPT performs some conversation history compression, but it lacks principled forgetting. Mem0 includes a memory update mechanism that attempts to reconcile conflicting facts, but it is heuristic-based and brittle at scale.

The practical engineering solution most teams reach for is TTL-based expiration: memories expire after a fixed period. This is coarse but works for many use cases. A bot that forgets user preferences after 90 days of inactivity behaves reasonably in most scenarios. An agent tracking the full history of a long-standing client relationship requires more careful architecture, and the tooling for that does not yet exist in a reliable, production-ready form.

Choosing an architecture

The choice comes down to what the agent needs to remember and over what time horizon.

For short sessions with no cross-session requirements, in-context storage is sufficient. A rolling window of recent messages, truncated to fit the context limit, requires no infrastructure and has no failure modes beyond the model’s own context handling.

For cross-session factual recall, a vector store for semantic memory is the right first addition. Embed key facts extracted from conversations, retrieve at query time, and inject into the prompt. This handles questions like “what did the user tell me about their project” adequately.

For cross-session behavioral consistency, a structured store for procedural memory becomes important. Corrections and preferences should be stored as discrete, queryable rules and injected selectively into future system prompts.

For long-term episodic memory with causal or temporal queries, something closer to the MemGPT architecture is warranted: hybrid retrieval combining embeddings and structured metadata, explicit time-based indexing, and agent-driven memory management.

The frameworks are still catching up to the full problem. Most current memory libraries handle semantic retrieval adequately, handle episodic retrieval with improving hybrid approaches, and handle procedural memory inconsistently. Building agents that learn from experience over time, consolidate and forget appropriately, and maintain consistent behavior across months of deployment remains an open systems engineering problem. The cognitive science taxonomy is a useful map, but the territory is messier than any clean four-quadrant diagram suggests.