The Retrieval Problem Is the Hard Part of Agent Memory

When I started adding memory to my Discord bots, the first instinct was to reach for a vector database. Store everything, embed it, do a similarity search at query time. It works well enough in demos. In production it produces confident nonsense, because the retrieval problem is harder than the storage problem, and most of the writing on agent memory skips past it.

Tom Bedor’s overview of approaches to agent memory is a good map of the taxonomy. This post is about the gap that taxonomy leaves open: how the retrieval decision gets made, what breaks when it goes wrong, and what a more principled approach looks like.

The Fundamental Tension

Context windows are fast, coherent, and immediately accessible to the model. They are also finite. An external memory store can hold arbitrarily much, but every retrieval is an approximation. You are asking the system to decide, before the model has reasoned about the query, which subset of its accumulated knowledge is relevant. That decision is made with less information than the model will have once it reads the retrieved results. The ordering is backwards, and there is no clean way to fix it.

The four broad categories of agent memory map onto this tension:

In-context (working memory): what is in the prompt right now; fast, coherent, expensive per token
External vector store (episodic/semantic memory): ChromaDB, Pinecone, Weaviate; large but retrieval is approximate
Structured KV store (procedural/factual memory): deterministic lookup, no approximation, but requires knowing the key
Parametric memory (weights): baked into the model at training time; unchangeable at inference

For most stateful bot use cases, parametric memory is irrelevant and KV stores handle a narrow slice of well-structured facts. The interesting design space is the boundary between in-context and vector retrieval.

What MemGPT Gets Right

MemGPT (now Letta) reframes this as an OS memory management problem. The model has a fixed “main context” analogous to RAM, and the external store is treated like disk. The system exposes explicit functions to the model for paging memory in and out: archival_memory_search, archival_memory_insert, core_memory_replace. The model itself decides what to retrieve and when, rather than having a retrieval step run automatically before every inference.

This is a meaningful inversion. Instead of pre-retrieval (retrieve, then reason), MemGPT does in-reasoning retrieval: the model reads the current context, determines it needs more information, and issues a retrieval call as part of its reasoning chain. The retrieval is conditioned on the full context the model already has, which is a much richer query signal than the raw user input alone.

The cost is latency and token overhead. Multi-step reasoning with retrieval calls in the loop is slower than a single-pass pre-retrieval approach. For a Discord bot responding in a conversational channel, that latency budget matters. For an async task agent, it usually does not.

Embedding Similarity Is Not Semantic Relevance

The assumption under naive vector retrieval is that embedding distance approximates semantic relevance. This breaks in several concrete ways for agent memory.

First, the embedding model was trained on a different distribution than your memory corpus. Conversation snippets, code fragments, and structured notes produce embeddings in different regions of the space. A query about a user’s preferred programming language may return conversation fragments about programming in general rather than the specific preference statement you stored six weeks ago.

Second, top-k retrieval has no notion of what the agent already knows. If the same fact has been stored multiple times in slightly different forms, top-k will return several near-duplicate chunks, consuming context budget with redundant information and crowding out genuinely diverse memories.

Third, relevance is not symmetric across time. A memory from yesterday is not inherently more or less relevant than one from three months ago, but recency should be a signal in many conversational contexts. Pure cosine similarity has no temporal dimension.

A more robust retrieval strategy combines several signals:

from datetime import datetime, timezone
import numpy as np

def score_memory(memory: dict, query_embedding: list[float], now: datetime) -> float:
    """
    Score a memory candidate combining cosine similarity, recency, and
    an explicit importance weight set at storage time.
    """
    mem_vec = np.array(memory["embedding"])
    q_vec = np.array(query_embedding)
    cosine_sim = float(
        np.dot(mem_vec, q_vec) / (np.linalg.norm(mem_vec) * np.linalg.norm(q_vec))
    )

    # Recency decay: half-life of 30 days
    age_days = (now - memory["created_at"]).days
    recency = 0.5 ** (age_days / 30.0)

    importance = memory.get("importance", 0.5)

    return 0.5 * cosine_sim + 0.3 * recency + 0.2 * importance


def retrieve_memories(
    query_embedding: list[float],
    candidates: list[dict],
    top_k: int = 5,
    dedup_threshold: float = 0.97,
) -> list[dict]:
    now = datetime.now(timezone.utc)
    scored = [(score_memory(m, query_embedding, now), m) for m in candidates]
    scored.sort(key=lambda x: x[0], reverse=True)

    results = []
    for score, mem in scored:
        too_similar = any(
            np.dot(np.array(mem["embedding"]), np.array(r["embedding"])) /
            (np.linalg.norm(mem["embedding"]) * np.linalg.norm(r["embedding"])) > dedup_threshold
            for r in results
        )
        if not too_similar:
            results.append(mem)
        if len(results) >= top_k:
            break

    return results

The importance weight requires discipline at write time: something has to decide what matters enough to flag. In practice, a second LLM call at memory storage time that rates the importance of what just happened produces better results than any post-hoc scoring.

Memory Staleness and Invalidation

Stale memories are worse than no memories. If a user told the bot their preferred deployment target was Heroku eighteen months ago and has since migrated to Fly.io, a retrieval that surfaces the old preference will produce confidently wrong behavior. The model has no way to know the memory is stale; it treats retrieved context as ground truth.

A few partial mitigations work in combination:

Versioned facts with explicit supersession. When storing a fact about a user or system state, check whether a similar fact already exists and mark the old version as superseded. This requires the storage step to do a similarity search before writing, which adds overhead but prevents silent accumulation of contradictory facts.

TTL by memory type. Episodic memories decay quickly. Semantic memories should be long-lived but periodically reviewed. Storing a ttl_days field at write time and filtering expired memories before retrieval is low-overhead and catches the obvious cases.

Contradiction detection at retrieval time. Before injecting retrieved memories into context, check whether any retrieved memories directly contradict each other on the same subject. If so, surface the most recent one and flag the conflict. This can be approximated with deterministic field matching for structured facts.

In the bot codebase I maintain, the most practical approach turned out to be a combination of explicit supersession for user preference facts stored in a KV store, and TTL-based expiry for episodic vector memories. The KV store handles the cases where correctness matters; the vector store handles fuzzy recall where a stale result is a degraded experience rather than a broken one.

Memory Distillation

As episodic memories accumulate, the vector store grows, retrieval gets noisier, and storage costs increase. Memory distillation compresses episodic memories into semantic ones: instead of storing fifty individual conversation fragments about a user’s interests, store a single synthesized profile that captures the signal.

This is roughly what human memory does. Specific episodes fade; the abstracted knowledge they contributed persists. For agents, distillation can be triggered on a schedule or when the episodic store for a given entity exceeds a size threshold. The distillation step is an LLM call that reads a batch of episodic memories and writes a condensed semantic summary.

Distillation discards the specific context of individual episodes. For debugging and auditability, keeping episodic memories in cold storage even after distillation is worth the cost.

What Actually Works

For a stateful conversational agent built on top of a Discord bot, the architecture that has held up is straightforward in structure if not in execution:

In-context window carries the last N turns plus any explicitly pinned facts. This is always present regardless of retrieval results.
Structured KV store (SQLite for anything beyond a prototype) holds typed facts about users and system state, with explicit supersession on write.
Vector store (ChromaDB locally, Pinecone when scale demands it) holds episodic and semantic memories with hybrid scoring.
Distillation job runs periodically, compressing episodic memories per-user into a semantic profile that gets persisted to the structured store.

Libraries like Mem0 and Zep implement variations of this pattern with more polish. Mem0 has a well-designed memory layer that handles the add/search/update lifecycle and includes automatic contradiction resolution. Zep’s temporal knowledge graph approach addresses the staleness problem more directly by treating memory as a graph of facts with explicit validity windows rather than a flat vector index.

LangChain’s built-in memory abstractions are convenient starting points. ConversationBufferMemory is pure in-context management; VectorStoreRetrieverMemory is naive top-k retrieval with none of the hybrid scoring or deduplication described here. They are reasonable for prototyping and insufficient for production use where memory correctness matters.

The retrieval decision is where most agent memory systems break down. The storage layer is, at this point, a largely solved problem with multiple mature options at every scale. Getting the right facts into context at the right time, without surfacing stale contradictions or drowning the context window in near-duplicates, requires treating retrieval as a first-class design concern rather than an afterthought bolted onto a vector database.