When building anything that persists across sessions, memory is the problem everything else depends on. The article on approaches to agent memory lays out the main categories cleanly: in-context storage, external vector stores, knowledge graphs, and parametric memory baked into weights. These categories are real and useful. But the taxonomy hides a question with more practical weight: who is responsible for deciding what gets remembered?
That question divides agent memory architectures into two fundamentally different schools, and understanding the divide matters more for system design than picking a storage backend.
The System-Managed Approach
The dominant pattern today is memory managed by the surrounding system. Application code decides what to store, when to store it, and what to retrieve before building each prompt. Retrieval-augmented generation is the clearest example: your pipeline embeds documents, queries a vector store at inference time, and injects the top-k retrieved chunks into the prompt before the model sees anything.
This is a tractable engineering problem. Vector databases like Pinecone, Qdrant, and Weaviate are mature infrastructure. pgvector brings approximate nearest-neighbor search into Postgres with an extension. The retrieval pipeline has a known shape:
user query → embedding model → ANN search → top-k chunks → context injection → LLM call
LangChain’s VectorStoreRetrieverMemory, LlamaIndex’s VectorMemory, and mem0’s extraction pipeline all follow variations on this shape. The system extracts information from conversations, embeds it, stores it, and retrieves relevant pieces on demand.
The problems with this approach are well-documented but often understated in tutorials. Retrieval precision on domain-specific corpora in production typically lands between 60 and 85% recall@5, meaning in roughly one out of five queries the fact the agent needs is not in the retrieved chunks. The BEIR benchmark puts NDCG@10 for state-of-the-art dense retrieval models like E5-large and BGE at 54 to 60 across heterogeneous retrieval tasks, which means 40 to 46% of relevant documents are missed in the top 10. For a QA system that’s disappointing but survivable. For an agent that needs accurate memory to take consequential actions, that miss rate accumulates into unreliable behavior.
Chunking strategy is a larger source of failure than most practitioners expect. The right chunk for retrieval (small, semantically tight, good embedding representation) is often not the right chunk for comprehension (needs surrounding context, section headers, prior setup). Anthropic’s contextual retrieval approach addresses this by prepending a chunk-specific summary before embedding, using an LLM call during ingest to write context for each chunk. Anthropic reported a 49% reduction in retrieval failures in their tests. The technique works, but it doubles ingest cost and introduces a dependency on generation quality during indexing.
Another problem specific to memory use cases: the write path is not automatic. Standard RAG is a read-only memory system. The application has to decide separately when to write, what to write, and how to handle conflicts between new information and existing stored facts. mem0 handles this explicitly with an LLM extraction call on each conversation turn and a deduplication pass that merges conflicting facts. That extracts real value but adds latency and model spend to every interaction, not just retrievals.
The Agent-Managed Approach
The MemGPT paper from UC Berkeley, now productized as Letta, introduced a different framing. Instead of the surrounding system deciding what the agent should remember, the model itself is given memory management tools and decides autonomously what to store, retrieve, and discard.
The architecture borrows explicitly from operating systems. There is a main context (working memory, bounded by the context window), archival memory (an external vector store, effectively unlimited), and recall storage (searchable conversation history). The model is given function calls:
core_memory_append(name="human", content="User prefers async patterns in Python")
archival_memory_insert(content="Discussed database schema design on 2025-03-15")
archival_memory_search(query="user preferences for code style")
The qualitative shift is that the model decides when to call these. After each interaction, the agent can choose to write a fact to its core memory block, push something to archival storage, or do nothing. When approaching context limits, the agent compresses older content and pushes it to archival, then retrieves it later. Memory becomes a capability the agent exercises, not infrastructure the application manages.
Letta exposes this through a REST API with persistent agent state stored server-side:
from letta import create_client
from letta.schemas.memory import BasicBlockMemory, Block
client = create_client()
agent = client.create_agent(
name="persistent-assistant",
memory=BasicBlockMemory(
blocks=[
Block(label="human", value=""),
Block(label="persona", value="Technical assistant with memory across sessions"),
]
)
)
Each agent’s memory blocks persist between API calls. Multiple agents can share memory blocks, which enables patterns like a team of specialized agents operating against a common knowledge base.
The limitation is trust. You are delegating memory management decisions to the model, and models make mistakes. An agent might fail to archive a critical fact, write conflicting information to its memory blocks, or surface irrelevant archival context when it matters most. System-managed memory is more predictable because human-written retrieval logic is easier to inspect and debug than LLM decisions about what to remember. In a customer-facing application where memory errors produce visible failures, that predictability has real value.
Knowledge Graphs: The Case for Structure
Both in-context and vector approaches share a weakness with multi-hop reasoning. “What projects did Alice work on that relied on the technology Bob’s team built?” requires connecting at least three entities across at least two relation types. Dense retrieval doesn’t traverse edges; it retrieves documents ranked by similarity, and two-hop reasoning requires either multiple retrieval passes or an LLM that can bridge the gap without explicit structure.
Microsoft’s GraphRAG addressed this by pre-building a community-level knowledge graph from source documents, then using hierarchical community summaries at query time. On global summarization queries requiring synthesis across many documents, GraphRAG outperformed naive RAG by 20 to 40% in Microsoft’s evaluations.
The graph construction step is where this approach struggles. Using an LLM to extract entity and relation triples from unstructured text introduces extraction errors at 10 to 20% rates, and those errors compound: a misattributed relation is retrieved with high confidence because it matches the structure of a query exactly. Correcting a wrong triple requires finding it in the graph, not just re-indexing a document.
The Cognitive Architectures for Language Agents survey frames knowledge graphs as “semantic memory” in its cognitive taxonomy, distinct from episodic memory (specific past experiences) and procedural memory (how to perform tasks). That framing clarifies when a graph is worth building: when your agent needs precise, structured, queryable facts about a relatively stable domain with clear entity types and relations. For conversational history or user preference tracking, a graph adds more overhead than it resolves.
What Stays Unsolved
The Generative Agents paper from Stanford introduced a memory stream architecture where every agent observation is stored with a timestamp, an LLM-rated importance score from 1 to 10, and an embedding. Retrieval combines recency, importance, and relevance into a single score. This is elegant and it works for simulated environments with low message volumes.
Scaling it to real applications with high interaction frequency exposes the consolidation problem: when do you summarize memories, and how do you do it without losing information that turns out to matter later? This is analogous to deciding what to write down in notes during a meeting. You cannot know at the time what will be important to recall six months later. Most agent memory systems handle this poorly, either keeping everything (expensive, noisy) or applying arbitrary time-based or size-based cutoffs (cheap, lossy).
The MemoryBench benchmark, which evaluates agents on verbatim recall within a session, semantic recall across sessions, and behavioral adaptation from prior feedback, found that most production agents perform well on single-session recall but score below 50% on cross-session tasks. That gap is not primarily a storage problem; the data is there. It is a retrieval timing and query formulation problem: the agent does not know, at the moment it needs a fact, how to construct the query that would surface it.
Long context windows defer but do not resolve this. At GPT-4o’s pricing of roughly $2.50 per million input tokens, filling a 128K context window costs around $0.32 per call, which becomes untenable at any real usage volume. Beyond cost, Liu et al.’s “Lost in the Middle” research demonstrated that model accuracy on multi-document QA drops significantly when the relevant document appears in the middle of a long context rather than at the beginning or end. The attention pattern problem is real and not easily resolved by prompt engineering.
Choosing an Approach
Production systems end up using hybrids. A short-term in-context buffer for the current session, a vector store for long-term semantic memory, and optionally a knowledge graph for domain-specific structured facts. mem0’s architecture makes this explicit with a dual vector-plus-graph store, an LLM extraction step on ingest, and deduplication via a second LLM call that merges conflicting facts rather than creating duplicates.
The choice of who manages memory, system or agent, matters more than which storage backend you pick. System-managed memory is more predictable and debuggable but requires you to anticipate what will be relevant and encode that into retrieval logic ahead of time. Agent-managed memory is more adaptive but introduces a dependency on the model’s judgment for a class of decisions that directly affect its own reliability.
Neither school has produced a clearly dominant solution. Cross-session memory remains weak in most production deployments. Retrieval precision is real and widespread. The tombedor.dev article is a useful map of the terrain. What the map doesn’t show is that we are still in the early phase of figuring out which roads are worth building.