· 6 min read ·

The Hard Part of Agent Memory Is the Write Path

Source: lobsters

The retrieval side of agent memory gets most of the engineering attention: which vector database to use, which embedding model, what similarity threshold to set before returning results. A recent piece by Tom Bedor on approaches to agent memory surveys the main strategies well. But retrieval is only half the problem, and arguably the less interesting half. The harder problem sits upstream: the write path.

What gets stored? When? How does the system handle a new fact that contradicts an existing one? How do you prevent memory from accumulating noise over thousands of sessions? These questions don’t map cleanly onto traditional database engineering, and most production implementations either simplify them away or treat them as an afterthought.

A Useful Taxonomy

The field borrows terminology from cognitive science. In-context working memory is whatever the model can see right now, bounded by the context window. Episodic memory stores records of specific past events, timestamped and scoped to an interaction. Semantic memory holds distilled, general-purpose facts that stay stable across sessions. Procedural memory encodes how to do things, typically baked into system prompts and tool schemas.

Most discussions treat these as a storage hierarchy, with different backends appropriate to each tier. What the taxonomy obscures is that these tiers don’t exist independently. The hard engineering problem is the promotion path: how raw episodic observations get distilled into semantic facts, how semantic facts get updated when they become stale, and who decides when to run that distillation.

MemGPT’s Answer: Give the Agent the Keys

The MemGPT paper (arXiv:2310.08560) framed agent memory as OS memory management. The model’s context window maps to RAM; external storage maps to disk. When the context fills, content gets paged out, and the agent retrieves it later via tool calls.

The paging mechanism itself is familiar; what’s architecturally significant is that memory operations became first-class tool calls the agent invokes by its own judgment. MemGPT, now rebranded as Letta, exposes functions like these to the agent:

archival_memory_insert(content: str) -> None
archival_memory_search(query: str, page: int) -> list[MemoryFragment]
memory_replace(old: str, new: str) -> None
conversation_search(query: str, page: int) -> list[Message]

The agent decides when to call these, what to store, and how to update existing entries. Memory management becomes part of the reasoning loop rather than a background pipeline running independently of the agent’s awareness.

The tradeoff is direct coupling to model quality. A capable model consolidates redundant memories and resolves contradictions. A weaker model stores noise, hallucinates false recollections, or fails to retrieve context it needs. The system’s memory quality is bound to the model’s reasoning quality in a way that more automated pipelines avoid.

mem0: Automated Extraction with Deduplication

mem0 (pip install mem0ai) takes the opposite approach. Rather than giving the agent explicit memory tools, it runs an extraction pipeline behind the scenes. When you call memory.add(), an LLM pass pulls discrete facts from the conversation and stores them as structured memory objects:

from mem0 import Memory

m = Memory()
result = m.add(
    "I prefer concise responses and I use Python 3.11",
    user_id="alice",
)
# → {"results": [{"id": "...", "memory": "User prefers concise responses", "event": "ADD"}, ...]}

memories = m.search("communication style", user_id="alice")
# → {"results": [{"memory": "User prefers concise responses", "score": 0.91}]}

Before writing each extracted fact, mem0 runs a semantic similarity check against existing memories. If the incoming fact is sufficiently close to an existing one, it updates or merges rather than creating a duplicate. If it contradicts an existing fact, the newer one wins. The default storage backend is Qdrant, a Rust-based vector database with straightforward Python bindings.

This automated approach handles basic deduplication reasonably well. Where it struggles is with nuanced contradictions. If a user says in January “I mostly work in Python” and says in March “I’ve switched to TypeScript for most of my projects,” the system needs to determine whether this is a contradiction to resolve or an addendum to merge. That distinction requires semantic judgment that similarity thresholds alone don’t reliably provide.

Zep’s Knowledge Graph Layer

Zep approaches the problem through a temporal knowledge graph. Rather than storing text strings or extracted fact blobs, it builds a structured graph of entities and relationships from conversations. Each entity is a node; each relationship is a typed, timestamped edge.

When a user mentions their colleague Sarah, Zep creates an entity for Sarah and links it to the user entity with a relationship. Future mentions of Sarah resolve to the same node, and the relationship gets updated as new information arrives. Retrieval uses graph traversal alongside vector similarity, which enables structured lookups that pure embedding search cannot express cleanly.

Zep’s hybrid retrieval combines BM25 full-text search, vector cosine similarity, and graph traversal, then re-ranks results. For factual lookups where exact phrasing matters, BM25 often outperforms embedding similarity. For fuzzy semantic queries, the vector index fills the gap. The Zep Community Edition requires PostgreSQL with the pgvector extension and optionally Neo4j for the full graph layer, which raises operational requirements compared to a simpler vector store setup.

The Generative Agents Scoring Model

The Generative Agents paper from Stanford and Google introduced a retrieval scoring model worth understanding even outside the full simulation context. Memory retrieval scores candidates on three dimensions simultaneously:

  • Recency: exponential decay since the memory was last accessed
  • Importance: an LLM-assigned score from 1 to 10 at storage time, reflecting how significant the event seemed
  • Relevance: cosine similarity between the query embedding and the memory embedding

Final score = α·recency + β·importance + γ·relevance, with tunable weights. Multi-factor scoring prevents any single dimension from dominating. Pure semantic similarity returns the most topically relevant memories, which are not always the most useful ones. A recent memory about the current task can score lower on pure similarity than an older, more topically similar but less actionable memory from months prior.

The paper also introduced reflection: periodic synthesis passes where the agent examines its recent memory stream and generates higher-order insights, storing them as new memories with elevated importance scores. This creates a more abstract and compressed layer of knowledge above the raw episodic record. It is the closest any current architecture comes to genuine memory consolidation, and it requires deliberate engineering rather than emerging naturally from the storage layer.

LangGraph’s State-Based Model

LangChain’s original memory classes provided a menu of standalone backends: ConversationSummaryBufferMemory, VectorStoreRetrieverMemory, ConversationKGMemory, and others. The current recommended path is LangGraph, where memory is represented as typed state that persists across graph nodes via checkpointers.

from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("memory.db")
graph = builder.compile(checkpointer=checkpointer)

This is a meaningfully different model. Memory is no longer a separate module bolted onto the LLM call; it’s part of the graph’s execution state, visible to all nodes and automatically serialized at each step. The tradeoff is that it’s more opinionated. You manage a typed state schema rather than selecting a memory backend, and migrating to a different architecture later means rethinking your state shape, not just swapping a dependency.

What the Vector Database Choice Determines

The vector database choice, between Pinecone, Chroma, Weaviate, Qdrant, pgvector, or Milvus, matters less than the design decisions upstream of it. These systems all perform comparably for typical agent memory workloads at moderate scale. The embedding model selection has more impact on recall quality: text-embedding-3-small (OpenAI, 1536 dimensions) is a reasonable default for most use cases; open-source alternatives like bge-large-en-v1.5 from BAAI (1024 dimensions) compete well when avoiding external API calls matters.

What determines memory quality in production is the extraction logic, the deduplication strategy, and the conflict resolution approach. Those decisions are harder to swap out once a system has accumulated real user data. A change in extraction granularity, for instance, means existing memories and new memories are no longer comparable in scope, which breaks retrieval scoring and may require a full re-indexing pass.

Where This Leaves the Field

Long-context models have partially deferred the memory problem for session-scoped interactions. Claude’s 200k context window and Gemini’s 1M window can hold substantial conversation history without any external memory machinery. Cross-session memory, where the agent accumulates knowledge about a user or domain over weeks and months, still requires deliberate architecture.

The choice between automated extraction (mem0-style), agent-controlled memory (Letta-style), and structured knowledge graph approaches (Zep-style) depends on how much control you want to hand to the model, how structured your domain knowledge is, and how much you need to reason over relationships between entities rather than individual facts. There is no universally correct answer. Different teams are reaching different conclusions, often based on their specific workloads rather than first principles.

What’s consistent across the approaches is that treating memory as an add-on layer rather than a core architectural concern tends to produce systems where the write path is an afterthought. You end up with a retrieval system that performs well on benchmarks and poorly on the specific facts that matter most to actual users, because the garbage-in problem was never solved upstream.

Was this interesting?