· 6 min read ·

Designing Memory for AI Agents: Where Each Approach Breaks Down

Source: lobsters

When I started building memory into the Discord bot I maintain, I assumed the core problem was storage: where do you persist information between sessions? After several iterations, storage turned out to be the straightforward part. The harder problems are deciding what counts as a memory, how long it should survive, and how to retrieve it without flooding the model with irrelevant context.

Tom Bedor’s piece on approaches to agent memory lays out a useful taxonomy, but the interesting engineering questions emerge once you start implementing. Each approach makes different assumptions about retrieval cost, precision requirements, and acceptable information loss. Misjudging any of those means building a system that either confabulates context it never had or loses things it should have kept.

Four Memory Types, Four Tradeoffs

The cognitive science framing, borrowed heavily by the agent memory literature, divides memory into four types. Each maps to a distinct implementation pattern.

Working memory is the context window: everything the model can attend to right now. It is exact, zero-latency, and ephemeral. For a conversational agent, this is the current message thread, injected user preferences, and the system prompt. The constraint is obvious: context windows are finite and every token costs money. Loading full conversation history on every turn is accurate but does not scale past a few hundred exchanges.

Episodic memory is time-stamped, event-based storage. “User asked about deploying to Fly.io on March 15th.” This maps cleanly to a conversation log. The failure mode is information density: storing everything makes retrieval expensive; summarizing aggressively loses specificity.

Semantic memory stores facts independent of when or how they were acquired. “User prefers TypeScript over JavaScript.” This is where vector databases come in. You embed a fact as a dense vector, persist it, and at query time retrieve the most relevant entries by cosine similarity or approximate nearest neighbor search. LangGraph’s persistence layer, Letta (the open-source successor to MemGPT), and LlamaIndex all provide implementations here, with pgvector, Chroma, and Pinecone as common storage backends.

Procedural memory encodes how to do things. For agents, this mostly lives in the system prompt or in fine-tuned model weights. It is also the hardest type to update, because you cannot cheaply modify weights at runtime.

The Retrieval Problem

The appealing promise of semantic memory is that you can store millions of facts and surface the relevant ones at query time. The reality is that embedding-based retrieval has a precision ceiling. Cosine similarity between dense vectors gives you semantic proximity, not logical relevance. Two sentences can be semantically close but contextually useless, and retrieving them wastes context tokens while potentially misleading the model.

The “lost in the middle” problem compounds this. A 2023 paper by Liu et al. showed that language models attend poorly to information placed in the middle of long prompts, with performance peaking near the beginning and end of the context. If you retrieve ten memory items and inject them mid-prompt, the model may effectively ignore half of them regardless of their relevance.

The practical implication: top-k retrieval with k=10 tends to perform worse than top-k with k=3 followed by a reranker. Cross-encoders take a query and a candidate document as joint input, producing much better relevance scores than bi-encoder similarity alone. The two-stage retrieve-then-rerank pattern is now standard in production RAG pipelines.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_memories(query: str, candidates: list[str], top_k: int = 3) -> list[str]:
    pairs = [[query, candidate] for candidate in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [text for _, text in ranked[:top_k]]

Cohere’s reranking API is a managed alternative if you do not want to host a cross-encoder locally.

Consolidation and the Cost of Forgetting

When a conversation runs long enough that full history no longer fits in context, you have two options: truncate or consolidate.

Truncation discards old messages and assumes the model will not need them. For casual conversations, this works most of the time.

Consolidation is the principled approach: summarize old exchanges, extract key facts into semantic memory, and drop the raw logs. MemGPT’s architecture, now continued under the Letta project, pioneered this for LLM agents. It maintains a fixed-size main context and treats the model itself as the memory manager, allowing it to issue explicit read and write operations against external storage. The paper describes this as self-directed memory management, which is meaningfully different from passive retrieval, because the model decides what to persist.

The engineering cost is real. Consolidation requires an LLM call to summarize, which adds latency and token spend. For a high-traffic bot, consolidating on every turn compounds quickly. The pattern I have settled on is lazy consolidation: trigger summarization when context utilization crosses a threshold rather than on every message.

async function maybeConsolidate(
  history: Message[],
  maxTokens: number
): Promise<Message[]> {
  const used = estimateTokens(history);
  if (used / maxTokens > 0.7) {
    const summary = await summarizeOldMessages(history.slice(0, -20));
    return [
      { role: "system", content: `Prior context: ${summary}` },
      ...history.slice(-20)
    ];
  }
  return history;
}

This keeps the most recent twenty messages verbatim, which preserves conversational coherence, while compressing everything older into a summary that occupies far fewer tokens.

Fine-Tuning Is Not Memory

Fine-tuning gets conflated with memory in a lot of architectural discussions, and the conflation leads to bad decisions. Fine-tuning is appropriate for encoding stable procedural knowledge into model weights: how the agent should respond, what communication style it should maintain, what constraints it should respect. It is not appropriate for episodic or semantic memory, because you cannot update it cheaply as knowledge changes.

The right signal is update frequency. If a piece of information changes more than once a month, it should not be in fine-tuning. If it is essentially static, fine-tuning is worth considering. For anything in between, RAG or in-context injection is the correct default.

There is also a generalization risk. Fine-tuning on specific facts tends to degrade the model’s broader capabilities, a phenomenon documented extensively in the continual learning literature. The catastrophic forgetting problem does not disappear just because the use case is a custom chatbot.

What the Taxonomy Misses

The four-type model maps reasonably well onto agent implementations, but it underspecifies two important dimensions.

The first is the cost of retrieval errors. For a customer support agent, surfacing incorrect episodic memory and presenting it confidently is worse than retrieving nothing. For a coding assistant, hallucinating a code snippet from semantic memory is worse than admitting ignorance. Most frameworks treat the retrieval confidence threshold as a hyperparameter to tune, but the appropriate value depends on the failure mode of a wrong answer in the specific domain, not on generic benchmarks.

The second is multi-session identity. Most memory frameworks assume a single agent with a single undivided memory store. When multiple users interact with a shared agent, you need per-user episodic isolation, so what Alice said does not get attributed to Bob, while potentially sharing semantic memory, where facts about the shared codebase or project apply to everyone. Getting that boundary right is more of a data modeling problem than a machine learning problem, and it is where most production deployments hit trouble first.

A naive implementation might store all memories in the same vector index and rely on metadata filters to segregate them by user. That works until retrieval latency at scale pushes you toward approximate indexing strategies that occasionally cross the metadata boundary. Strict per-user indices are more reliable but more expensive to maintain. The right answer depends on your throughput and isolation requirements.

The design space is genuinely complex, and Bedor’s taxonomy is a sensible starting frame. The implementation decisions that follow from choosing one approach compound in ways that are not obvious until the system is under load. Starting with the simplest viable approach, measuring where context quality degrades, and adding retrieval infrastructure only when degradation is observable is usually the right sequence. Memory systems that are over-engineered before you have production traffic tend to optimize for problems you never actually encounter.

Was this interesting?