Most conversations about agent memory converge on retrieval: vector databases, semantic search, hybrid BM25-plus-embedding approaches, rerankers. The tooling ecosystem has followed suit, with products like Pinecone, Weaviate, and Chroma all competing on retrieval quality. This focus makes sense at a surface level, because retrieval is where failures are visible. When an agent pulls the wrong memory or misses a relevant one, the resulting response is obviously wrong. The write side fails silently.
Tombedor’s overview of agent memory approaches lays out the classification space clearly: in-context vs. external storage, episodic vs. semantic memory, the tradeoffs at each layer. It is a useful map. What it points toward, without fully exploring, is that the interesting engineering work happens before retrieval: in deciding what to write, when to write it, and what form the stored representation should take.
The Stateless Baseline
An LLM call is inherently stateless. Everything the model knows about a conversation comes from what you put in the context window. This is both the simplicity and the limitation of the baseline case: you accumulate messages, and the model responds based on all of them. No external system, no retrieval, no write logic.
This works for short conversations. For anything longer, two problems appear. First, context windows have hard limits. Even at 200k tokens, a sufficiently long interaction will eventually overflow. Second, and less discussed, long context hurts retrieval quality even when it fits. The “Lost in the Middle” research from 2023 showed that LLMs attend unevenly to long input sequences, with content in the middle of the context receiving less weight than content at the start and end. Putting everything in context is not the same as the model reliably using everything in context.
The Taxonomy
Cognitive science distinguishes between episodic memory (records of specific events: what happened, when, in what sequence) and semantic memory (facts about the world, disconnected from the episode that introduced them). Working memory is a third category: what is actively held in mind right now, with limited capacity.
The mapping to agents is approximate but useful. In-context content is working memory. External storage splits into raw conversation logs (episodic) and extracted facts or user preferences (semantic). Procedural memory, how to do things, mostly lives in system prompts and tool schemas and is less often discussed in memory architecture conversations.
The practical difference between episodic and semantic storage matters for retrieval strategy. Episodic retrieval often wants chronological ordering or recency weighting: what happened in this conversation last week. Semantic retrieval wants conceptual proximity: what do I know about this user’s preferences. Treating them the same, as many implementations do by storing everything in one vector store, means using a tool designed for one shape of problem to solve two different ones.
MemGPT’s Tiered Model
The MemGPT paper, now productized as Letta, introduced a tiered architecture that maps more carefully to the cognitive model. Core memory, the agent’s main context, is always in the prompt. Archival memory is external vector storage, queried explicitly by the agent using a tool call. Recall memory is conversation history, also retrieved via tool call rather than always included.
The key architectural decision is that the agent controls its own memory operations. Rather than a framework deciding what to retrieve and when, the model itself calls memory_search() and memory_insert() as tools, alongside other capabilities. This makes memory a first-class part of the agent’s action space rather than invisible scaffolding.
The tradeoff is that this requires the model to reason about when to look something up. A poorly prompted or less capable model will either over-retrieve, adding latency and context noise, or under-retrieve, missing relevant history. The memory system’s quality becomes partly a function of the model’s metacognitive ability. With weaker models, a more automatic retrieval approach, pulling relevant memories before each turn based on the current input, often outperforms in practice even though it is architecturally less elegant.
The Write Problem
Here is where most implementations are underbuilt. Retrieval quality is bounded by storage quality. If what you have stored is raw conversation turns, retrieval is essentially a search over chat logs: high recall, low precision, noisy context. If what you have stored is extracted, structured facts, retrieval is more precise but only as good as the extraction step.
Extraction means running an LLM pass over conversation content to pull out memorable facts: user preferences, corrections, stated goals, important context. This adds latency and cost at write time. It also introduces extraction failures. The model might miss something important, store something wrong, or store a contradiction of something stored earlier without recognizing the conflict.
The conflict case is particularly tricky. If a user tells an agent one thing in January and the opposite in March, what should the memory system do? Overwrite? Append with a timestamp and let the model sort it out at read time? Flag as conflicting? Most implementations pick one of these without making the choice explicit, which means the behavior under contradiction is undefined. Systems like mem0 attempt structured conflict resolution, versioning stored facts and prioritizing more recent ones, but this remains an active area without a clear standard approach.
Staleness is a related problem. Semantic memories about a user’s preferences or situation can become wrong over time, and nothing in a typical vector store signals that a memory is outdated. The stored embedding for “user prefers short responses” will keep matching “user preferences” queries long after the user has changed their mind and said so in a more recent session. There is no equivalent of a cache expiry for embedded knowledge.
Retrieval Is the Visible Problem
Once the write side is handled reasonably well, retrieval reduces to a search problem with well-understood tradeoffs. Pure vector search gives semantic proximity but can miss exact matches and is sensitive to embedding model quality. BM25 keyword search gives exact matching but no semantic generalization. Hybrid approaches combine both, with a weighting parameter that becomes another hyperparameter to tune per application.
pgvector has become a practical default for teams already running PostgreSQL: it adds vector search as a native extension, letting you query on both structured columns and semantic similarity in the same statement. This matters when you want to filter episodic memories by user ID and date range before ranking by semantic similarity. A single query like WHERE user_id = $1 AND created_at > $2 ORDER BY embedding <=> $3 handles both constraints without a second-pass filter in application code.
Reranking, running a second model pass to score retrieved candidates for relevance to the current query, improves precision at the cost of added latency. Whether the latency is acceptable depends on the application. In an interactive chat context, adding 300-500ms for a reranker pass is often not worth it. In an async or background processing context, it frequently is.
Practical Observations
Building the memory layer for a Discord bot brings a specific set of constraints into focus. Conversation turns are short, often fragmented across channels, and lack the long coherent structure that many memory extraction approaches assume. A user corrects the bot in one message, asks an unrelated question, then returns to the correction topic three hours later in a different channel. Episodic logs capture this faithfully but make extraction hard. Semantic extraction has to work on short, context-dependent utterances without full conversation context around them.
The approach that works best in practice: extract at natural breaks, such as the end of a conversation thread or when an explicit correction or preference statement appears, store structured key-value facts alongside the raw turn for fallback, and apply a recency weight that decays stored facts’ retrieval scores over time. It is not elegant, but it handles the cases that purely semantic approaches miss.
The framing in Tombedor’s article is accurate: the design space is defined by tradeoffs between storage types and retrieval mechanisms. The lesson from working with these systems is that the tradeoffs compound. A decision about storage format constrains retrieval strategy, which constrains what you can usefully extract, which constrains what gets stored. The architecture has to be designed end-to-end, not assembled from independently chosen components. That is harder than it sounds, and it is why the retrieval problem keeps getting attention while the write problem keeps causing failures.