· 6 min read ·

The Engineering Reality of AI Agent Memory

Source: lobsters

The way an AI agent handles memory shapes its behavior more than most builders expect. A model that remembers nothing treats every conversation as its first; one that stores everything eventually retrieves the wrong context at the wrong moment. The interesting design work lives in the middle, and most of it is unsolved.

This overview of agent memory approaches is a useful starting point for thinking through the taxonomy. But taxonomy is the easy part. The harder questions are about consolidation, invalidation, and retrieval precision, and those are what determine whether a memory system helps or hurts.

The Four-Type Framework and Its Limits

Most treatments of agent memory borrow from cognitive science and arrive at four categories: working memory (the current context window), episodic memory (records of past events or conversations), semantic memory (general facts and user-specific knowledge), and procedural memory (workflows, skills, how to do things).

This map is accurate enough to be useful and imprecise enough to mislead. A user correction like “I prefer Python, not JavaScript” is semantic knowledge but was learned through an episode. A stored workflow is procedural but may reference semantic facts that become stale. In practice these categories are not cleanly separated stores; they are overlapping concerns a system has to balance simultaneously.

Working Memory Is Not Free

Working memory is the context window. Modern large models handle 128k to 200k tokens, but the relevant cost is not just money per token; it is attention quality. The Lost in the Middle paper from Liu et al. (2023) demonstrated empirically that language models attend significantly worse to information positioned in the center of a long context. Performance on multi-document question-answering tasks drops substantially when relevant documents are buried in the middle versus placed at the start or end of the prompt. Naively dumping all retrieved memory into a context window, even one that technically fits, is not a neutral choice.

The practical response is to be selective about what goes into context and where it appears. High-priority facts and recent events belong near the top. Retrieved chunks belong near the user’s message. Everything else either stays in external storage or gets summarized before inclusion.

External Storage and the Retrieval Problem

Episodic and semantic memory in most agent systems lives in external storage, retrieved on demand. The dominant implementation pairs a vector database (Pinecone, Weaviate, or pgvector in Postgres) with embeddings generated from something like OpenAI’s text-embedding-3-small. When a query arrives, the agent embeds it and pulls the top-k nearest chunks.

This works reasonably well for broad semantic similarity, but it has a structural weakness: it retrieves what is similar to the query, not necessarily what is relevant to answering it. A user asking “what did we decide about the database schema?” will surface chunks about schema discussions but may miss a later chunk where that decision was reversed. Recency and causality are not encoded in embedding space.

Hybrid retrieval improves things. Combining semantic search with BM25 keyword search improves recall on factual lookups, and adding metadata filters brings precision up further. The implementation in most production systems looks something like this:

results = memory_store.query(
    embedding=embed(query),
    filters={"user_id": user_id, "created_at": {"$gte": cutoff}},
    top_k=5,
    rerank=True  # cross-encoder reranker, e.g. Cohere or a local model
)

Reranking top candidates with a cross-encoder before injecting them into context is one of the more reliable ways to improve relevance without overhauling storage architecture. The semantic pass casts a wide net; the cross-encoder narrows it by scoring query-document pairs jointly rather than independently.

MemGPT and the OS Metaphor

The most architecturally interesting take on agent memory came from the MemGPT paper out of Berkeley in late 2023. The core idea is that the LLM context window is analogous to CPU registers: fast, limited, and volatile. External storage is disk. The agent’s job is to manage page-ins and page-outs explicitly, using function calls to move information between layers.

MemGPT, now called Letta, implements this with a structured in-context memory block that the model can read and write through tool calls, plus archival storage for long-term facts. The model decides what to persist and what to discard, which means the memory system benefits from the model’s own reasoning rather than relying entirely on embedding similarity.

The trade-off is that this burns context on memory management overhead. Every memory update is a tool call round-trip. For latency-sensitive applications this adds up quickly. For agents running over long time horizons, days rather than minutes, it is often the right trade-off. The regime matters: a customer support agent optimized for 30-second resolutions has different memory requirements than an engineering assistant that tracks a codebase over months.

The Consolidation Problem

What most taxonomy articles skip is consolidation: how do you decide which experiences become long-term memory, and how do you update memories that have grown stale or wrong?

Human memory does not store everything. It encodes experiences selectively based on salience and repetition, and consolidates them over time into more abstract semantic knowledge. AI agents have no equivalent process by default. They either store everything and retrieve noise, or store nothing and lose useful context.

One practical approach is periodic summarization. At the end of a conversation, a second model call produces a structured summary of key facts, decisions, and corrections, which gets stored with higher priority than raw logs. This is cheap and reliable, though it loses nuance.

A more principled approach uses the agent itself to evaluate memory candidates. Mem0 does this with its memory extraction pipeline: after each interaction it runs a structured extraction pass over the conversation, deduplicates against existing memories, and resolves conflicts by treating newer evidence as higher confidence. The result is a maintained semantic store that converges on accurate facts rather than accumulating contradictions.

Invalidation is the harder sibling of consolidation. A memory that was true six months ago may now be wrong. User preferences change; architectural decisions get reversed. Without explicit invalidation logic, memory systems serve outdated information with full confidence. The minimum viable solution is to store a timestamp with every memory and apply recency decay in retrieval scoring. A more robust solution tracks corroboration across multiple recent observations and downgrades memories that contradict newer evidence.

Procedural Memory Belongs in Code

Procedural memory, how to do things, rarely belongs in a retrieval system. Storing workflows as text chunks and embedding them is fragile because executing a procedure depends on precision, and semantic retrieval is approximate. A workflow surfaced at middling similarity might be subtly wrong in ways the model cannot detect until execution fails.

Better to encode procedural knowledge as actual code or structured configuration that the agent invokes deterministically. The prompt instructs the agent to call a known function; the function handles the procedure. This keeps the natural language layer thin and the reliable layer robust. Mixing procedural knowledge into the same embedding space as factual memory is a reliable way to introduce silent failures that are difficult to debug.

Where the Model Fits

A recurring issue in agent memory designs is delegating too much to the model. The model is good at reasoning over retrieved context; it is less reliable as a memory manager when given no scaffolding. Designing explicit storage schemas, retrieval pipelines, and consolidation schedules, then giving the model well-structured input from those systems, produces more consistent behavior than asking the model to figure out memory management on its own.

The taxonomy is a starting point, not a solution. The real design work is in the interfaces between layers: what triggers retrieval, what triggers consolidation, what triggers invalidation, and how the model receives the output of each. Getting those right matters more than which vector database you pick.

Was this interesting?