The standard treatment of AI agent memory begins with a taxonomy of types and mechanisms. This piece on tombedor.dev covers that terrain well: in-context memory, episodic memory, semantic memory, procedural memory, and the retrieval mechanisms that serve each. It is a useful orientation to the design space.
The harder problems sit in the policy layer: when to write a memory, what exactly to write, how to consolidate overlapping entries, and when to discard information that has become stale or contradictory. These decisions determine whether a memory system genuinely improves agent behavior over time or accumulates noise that degrades retrieval quality and eats context budget.
Context Window as Default Memory
The simplest memory system for an LLM agent is no external memory at all. For short-lived interactions, the context window is sufficient, and as windows have expanded substantially over the past two years, the class of problems solvable by in-context approaches has widened. Claude 3.5 Sonnet supports 200k tokens; Gemini 1.5 Pro reaches 1M.
Two problems limit context-as-memory for serious agent work. First, token cost scales linearly with context length for most providers. Second, attention quality degrades on information buried in the middle of long contexts. The “lost in the middle” problem, documented by Liu et al. in 2023, shows measurably weaker retrieval for content that falls between the strong primacy and recency zones of a context window. A 200k context is not 200k of uniformly reliable memory; you get reliable recall at the boundaries and degraded recall in the interior.
For agents that maintain state across long sessions or across entirely separate invocations, the context window alone fails. External memory systems are the answer, but which kind, structured how, with what retrieval mechanism, and updated on what schedule are not answered by the taxonomy.
The MemGPT Framing
The most conceptually clear treatment of agent memory comes from the MemGPT paper by Peng et al. (2023), which forms the basis of the Letta framework. MemGPT draws an explicit analogy to operating system memory management: the LLM’s context window is RAM, external storage is disk, and the agent manages its own memory hierarchy through explicit page-in and page-out operations. When the context fills, the agent decides what to evict based on relevance to the current task.
This framing is valuable because it makes the policy question visible. In a conventional OS, page eviction is handled by hardware-informed algorithms: LRU, the clock algorithm, working set models. In MemGPT, the LLM issues the eviction decisions itself. That is flexible. It is also expensive in tokens, adds latency to every memory operation, and means the model can make poor eviction choices, particularly when it cannot know in advance what will be needed later in the session.
The analogy also clarifies something usually left implicit: memory hierarchy is a tradeoff between speed and capacity. In-context memory is fast and small. Vector retrieval is slower, larger, and introduces retrieval error. Designing a memory system means deciding how to move information up and down that hierarchy based on predicted access patterns, and there are no free variables.
The Write Side
Most published guidance on agent memory focuses on retrieval: embedding models, chunking strategies, hybrid BM25 plus dense vector search, re-ranking, metadata filtering. Tools like Mem0, Zep, and pgvector have made the read side progressively more tractable. Managed memory services can return relevant context with low latency, blending keyword and semantic search in ways that required significant custom infrastructure a few years ago.
The write side attracts less attention and causes more problems in practice.
The core question is: when should an agent write a memory, and what should it write? Three common approaches each have distinct failure modes.
Write everything. Log every message, every tool call, every observation. Simple to implement. Storage grows without bound, and retrieval quality degrades as the index fills with low-signal content. A vector index with 50,000 undifferentiated conversation turns returns worse results than one with 500 well-curated entries.
Write on explicit signal. Store information only when the user asks or when a predefined trigger fires. Predictable, low noise. Systematically misses implicit context that would prove useful later, because users do not narrate what is worth remembering.
Write based on LLM judgment. After each interaction, ask the model to extract what is worth retaining. More intelligent than the alternatives. Also expensive in tokens, inconsistent across similar interactions, and subject to its own context limitations during the extraction step.
Production systems typically mix all three. LangChain’s memory primitives offer ConversationBufferMemory (write everything), ConversationSummaryMemory (rolling LLM-compressed summary), and VectorStoreRetrieverMemory (embedding-indexed storage). Each trades off differently on the write problem, but none offers a principled answer to what is worth storing in the first place. That question requires judgment the framework cannot supply.
Consolidation and Contradiction
Episodic memory stores events with temporal context: on Tuesday a user asked about a database migration, on Thursday they revised a key assumption. Semantic memory stores persistent facts: this project targets PostgreSQL, this user prefers concise explanations. The conversion from episodic to semantic, called consolidation, is one of the harder problems in memory system design.
Human memory consolidation happens during sleep, with the hippocampus replaying episodic records and integrating them into cortical semantic storage. AI memory systems have no equivalent idle process, though some experimental systems run off-peak consolidation passes where an LLM reviews recent episodic logs and extracts generalizations. Whether the extracted generalizations are accurate, internally consistent, or compatible with prior semantic memory is an open problem that consolidation passes alone do not close.
Contradiction handling sits adjacent to consolidation. An agent stores that a user prefers brief responses, then that user spends three sessions requesting detailed technical explanations. Naive systems append both facts. A production system needs to detect the conflict, determine which record is more current or more reliable, and update accordingly.
Mem0’s memory update operations address this by comparing new observations against the current store and deciding whether to add, modify, or delete existing entries rather than always appending. This outperforms append-only storage, but it remains fragile for subtle or gradual contradictions where neither the old nor the new record is obviously incorrect. Detecting that a belief has been superseded requires tracking its provenance and age, and most memory systems do not maintain that metadata with enough fidelity to reason over it.
Forgetting as Policy
Human memory forgets deliberately. The Ebbinghaus forgetting curve describes exponential decay of memory strength without reinforcement, and spaced repetition systems exploit this property, scheduling review at the point where decay would otherwise cause loss. Forgetting is load-shedding under bounded resources, not a deficiency.
Agent memory systems treat forgetting as a failure mode. The design goal is total retention. This produces indexes that grow indefinitely and gradually become noisier, more expensive to search, and harder to keep consistent as contradictions accumulate.
A more considered design implements decay: memories not accessed over time lose retrieval weight and are eventually compressed or pruned. This mirrors how human episodic memories that are never recalled tend to fade while frequently accessed memories consolidate into durable semantic representations. LangChain’s TimeWeightedVectorStoreRetriever implements a basic version of this, where retrieval relevance scores decay with age. In practice, decay-based memory management remains a peripheral concern rather than a first-class design decision in most agent frameworks. You have to go looking for it, and the defaults push toward full retention.
What the Tooling Misses
Building the memory layer for a long-running Discord bot, the problems I kept hitting were not on the retrieval side. Given a reasonable embedding model and a working vector store, retrieval returns useful results. The friction was consistently in policy: what triggers a write without spending more tokens on the write decision than on the original interaction, how to detect when a stored fact has been superseded by something said two weeks later, and how to keep the index lean enough that retrieval stays fast without discarding information that turns out to matter months down the line.
The infrastructure for memory-enabled agents is now mature enough to build production systems on. Vector databases, embedding APIs, retrieval frameworks, and managed memory services have converged on reasonable defaults. The guidance that has not kept pace is in policy: write triggers, consolidation schedules, contradiction resolution strategies, and decay functions. These are the decisions that determine whether a memory system compounds in value or compounds in noise.
The tombedor.dev piece provides a clear map of the architecture choices available. The more pressing conversation, and where the interesting research is concentrated, is about the policies that run above that architecture. Those policies do not fall out of infrastructure selection. The field is still working out what it would mean to get them consistently right, and that gap is where memory systems fail in practice.