Every LLM agent runs against a bounded context window. That window is the agent’s working memory: everything it can reason about in a given moment. Once it fills, the agent faces a decision. Forget the oldest content, compress it into a summary, or offload it to external storage. The mechanisms an agent uses to make that decision, and how reliably it makes them, define what kind of system it actually is.
Tom Bedor’s survey of approaches to agent memory does a good job laying out the landscape. The standard taxonomy maps loosely to cognitive science: in-context memory (the active window), episodic memory (records of past events), semantic memory (factual knowledge), and procedural memory (encoded skills and behaviors). These map reasonably well to engineering primitives: the token buffer, a timestamped event log, a vector store, and fine-tuned weights or a tool schema. Most agents combine a subset of these.
The retrieval side of this taxonomy is well-studied. Retrieval-augmented generation has a mature benchmark ecosystem; MTEB and BEIR give you something concrete to optimize against. Vector databases have standardized around a handful of well-understood data structures: HNSW for approximate nearest-neighbor search, IVF-PQ for compressed large-scale retrieval. The embedding model quality gap between providers has narrowed significantly. If your problem is fetching relevant knowledge from a large corpus, the tools are in decent shape.
The write side has received far less systematic treatment, and it is the harder problem.
What the Write Decision Actually Involves
Deciding what to store requires answering at least four questions: what is worth keeping, in what form, when, and for how long. None of these have clean answers that generalize across use cases.
Storing everything is the naive approach. It is expensive, it fills whatever storage you provision, and it defers the selection problem to retrieval time. Retrieval systems do not perform better with more noise. Every irrelevant chunk that returns from a query competes for context window space with the actually useful content. At scale, storing everything without filtering is not a neutral choice; it degrades retrieval quality.
Selective storage requires a function that scores whether a given piece of information is worth keeping. The most principled attempt to define that function appeared in the Generative Agents paper from Park et al. at Stanford in 2023. Their architecture stores every observation a simulated agent makes in a memory stream, but scores each one for importance on a 1-10 scale by querying the LLM: “On a scale of 1 to 10, rate how important this observation is to the agent’s long-term goals.” Retrieval then combines three signals:
score = α_recency × decay(t) + α_importance × importance + α_relevance × similarity(query, memory)
Recency uses exponential decay from the timestamp. Importance is the stored score from write time. Relevance is cosine similarity between the query embedding and the memory embedding. The weights α are tunable per application. This formulation is elegant because it separates the three legitimate reasons a memory might matter: it just happened, it was significant when it occurred, or it is similar to what the agent needs right now.
The cost is real. Scoring importance at write time requires an LLM call per observation. The Generative Agents system also includes a reflection step: periodically, the agent queries its own memory stream for “what are the most important observations I’ve made recently?” and generates higher-order summaries that get added as new memories. Each reflection pass involves multiple LLM calls. This works for a research simulation running on a timeline you control. It adds up fast in a production system handling continuous event streams.
The Storage Form Problem
Assuming you have decided to store something, what form should it take? Raw text preserves everything but compresses poorly and retrieves inconsistently. Structured extraction (converting text to entity-attribute-value triples or JSON facts) is more precise but lossy. Summarization trades recall for compression. The right answer depends on what you will need to retrieve later, which you often do not know at write time.
Mem0 takes the extraction approach: it processes conversation turns through an LLM that extracts discrete facts, deduplicates them against what is already stored, and maintains a structured memory graph. Zep builds a temporal knowledge graph from conversation history, assigning validity windows to facts so that “user prefers dark mode” recorded in January can be superseded by “user switched to light mode” recorded in March. These are sensible approaches to the storage form problem, but they add complexity and latency to the write path.
The Contradiction Problem Vector Stores Cannot Solve
Vector databases store embeddings without semantic awareness of their content’s relationship to other stored content. “The deployment target is AWS” and “The deployment target is GCP” sit as equally valid vectors in the same index. A query for “deployment infrastructure” may return either one, or both, depending on how the query embeds. There is no native contradiction detection.
The most common workarounds are timestamp-based metadata filtering (always prefer the most recent version of a fact), entity-based key-value overwriting (treat facts about a named entity as mutable state), or LLM-mediated reconciliation at retrieval time (prompt the model to resolve conflicts in the returned context). All of these require the developer to anticipate which categories of information are subject to contradiction. For an agent that might store arbitrary facts about an arbitrary domain, that is a hard precondition to satisfy.
Knowledge graphs handle this more naturally because edges between entities make relationships explicit and inconsistent triples are detectable by graph traversal. The trade-off is that graph construction from unstructured text is itself an imperfect process, and querying a knowledge graph requires either SPARQL/Cypher queries or a translation layer that converts natural language to graph queries.
Agent-Controlled Memory as a Different Philosophy
The MemGPT paper (the system is now called Letta) takes a different approach to all of this. Rather than building a memory infrastructure that operates around the agent, it makes memory management an explicit part of the agent’s behavior. The context window is split into named regions: a fixed system prompt, an editable core memory block for key facts, and searchable archival storage for older content. The agent gets tool functions: core_memory_append, core_memory_replace, archival_memory_insert, archival_memory_search.
When context pressure builds, the agent decides what to archive and what to keep. The decision is not made by a rule or a scoring function external to the agent; the agent itself reasons about it. This sidesteps the question of how to design a universal importance function, because the agent’s existing reasoning capabilities are deployed for the task.
The obvious limitation is that this requires the agent to be reliably self-aware about its own memory needs. Weaker models, or models under time pressure, may make poor decisions about what to retain. There is also a cost to context: the memory management tool calls consume tokens that could otherwise be used for the task. For systems where memory operations are infrequent, this is a reasonable trade. For high-frequency interactive agents, the overhead compounds.
What This Means in Practice
The right memory architecture depends heavily on what kind of continuity the agent actually needs.
For short-lived assistants with no cross-session requirements, a conversation buffer with optional summarization on session close is sufficient. For agents that need to personalize across sessions, a hybrid approach works best: a structured key-value or entity store for facts that change (preferences, account state, recent decisions), plus a vector store for episodic recall of past interactions. For domain-specific agents grounded in a fixed knowledge base, pure RAG is the appropriate tool and episodic memory adds noise without benefit.
For fully autonomous agents that act in the world over long periods, the MemGPT-style agent-controlled approach combined with importance scoring at write time is the most defensible architecture, at the cost of being the most expensive and the most dependent on model quality.
The consistent thread is that retrieval quality is downstream of write-time decisions. What you choose to store, in what form, and with what metadata determines what is retrievable and how reliably. Frameworks and benchmarks have converged on good primitives for the read path. The write path, the question of what matters and how to encode it, is still largely an open problem that each system solves for itself.