Layering Memory in AI Agents: Beyond the Context Window

Building an agent that holds a conversation is tractable. Building one that remembers anything useful across time, across sessions, across weeks of operation, is a different problem entirely.

The Discord bot I maintain runs continuous conversations in multiple channels. After a few weeks, the gap between what happened and what the model knows about what happened starts to matter. A user references a decision from three weeks ago. The model has no idea. The conversation history is too long to fit in context. The relevant facts are buried somewhere in a pile of messages about something else. This is the practical face of a problem the field calls agent memory, and a recent overview at tombedor.dev lays out the design space clearly. But the taxonomy is the beginning, not the end.

The Four Patterns

Most treatments of agent memory converge on four main patterns.

In-context memory is the naive baseline. You include relevant history directly in the prompt. It works until the context window fills up, and it works poorly even before that. The “Lost in the Middle” paper from Liu et al. (2023) documented this clearly: retrieval accuracy for facts in the middle of long contexts drops significantly compared to facts near the beginning or end. Context is not uniform attention. Stuffing a 50,000-token history into a prompt and expecting the model to find the relevant sentence is not a retrieval strategy.

External retrieval is the RAG pattern. You embed documents and conversation history, store them in a vector database, and pull relevant chunks at query time using semantic similarity. This scales arbitrarily and lets you store years of history. The cost is retrieval quality: semantic search finds things that are similar in embedding space, not necessarily things that are contextually relevant. If a user asks “what was the decision about authentication,” you need the embedding of that question to be close enough to the stored answer. Sometimes it is. Often it is not.

Key-value stores are explicit, named memory. The agent writes to named slots: user_preference_language = "Python", project_status = "in review". This is reliable for known, structured facts. The problem is that you have to know in advance what to write down, and agents are inconsistent at deciding what matters enough to persist.

Summarization is compression. As history grows, you collapse older portions into shorter representations and carry those forward. This preserves the shape of past interactions without the full token cost. What it loses is detail, and detail has a way of mattering at inconvenient moments.

The Cognitive Science Frame

The agent memory taxonomy maps roughly onto Endel Tulving’s classic distinction from cognitive psychology. Episodic memory is autobiographical and time-stamped: “the user complained about the deployment pipeline on March 12th.” Semantic memory is general and context-free: “the user prefers Kubernetes over Docker.” Procedural memory covers skills and behaviors rather than facts, which in an agent context maps closer to the system prompt and tool definitions than to stored history.

This mapping matters because different memory types call for different storage strategies. Episodic memory benefits from vector retrieval because you are searching by semantic similarity to a query: you want memories that are about the same kind of thing. Semantic memory, the settled facts, works better in structured key-value storage because you want reliable lookup, not fuzzy search. A preference or a configuration value should not depend on whether the embedding of your query happens to land near the embedding of the stored answer.

Mixing both types into a single vector store treats all memory as the same kind of thing. This creates the frustrating failure mode where the agent reliably retrieves interesting episodic context but misses the plain factual answer that was right there.

MemGPT and the Virtual Memory Model

The most architecturally coherent attempt to address this is MemGPT, introduced in the 2023 paper “MemGPT: Towards LLMs as Operating Systems” by Packer et al. from UC Berkeley. The core insight is treating the LLM context window like RAM in a virtual memory system. There is a “main context” that holds active information, an “archival storage” that holds everything else, and the model itself manages paging between them using function calls.

The agent is given tools to search archival memory, insert new memories, and edit existing ones. When context gets full, it summarizes and offloads. When it needs something specific, it searches and loads. The paging is explicit and model-controlled rather than hidden infrastructure.

This is a real architectural insight: the context window is working memory, and everything else is secondary storage. The model becomes responsible for its own memory management, which shifts what the agent is doing at any given step from just reasoning to reasoning plus memory housekeeping.

The Berkeley team subsequently built Letta, an open-source framework that productizes this approach. It provides persistent agent state across conversations, memory blocks with explicit character limits, and a layered structure where the agent can introspect and edit its own memory. Each agent has a core_memory block that is always present in the system prompt, and archival_memory that is externally stored and searchable.

from letta import create_client
from letta.schemas.memory import BasicBlockMemory, Block

client = create_client()
agent = client.create_agent(
    memory=BasicBlockMemory(
        blocks=[
            Block(label="human", value="Name: Alice. Prefers concise answers. Works on backend infra."),
            Block(label="persona", value="Helpful assistant. Engineering focus. Direct and technical."),
        ]
    )
)

The core_memory block is always present. The agent can call core_memory_append or core_memory_replace as tools when it decides something is important enough to persist at that level. This makes memory management visible and auditable rather than a side effect of whatever the embedding model does.

The tradeoff is that memory management becomes part of the task. Every agent step potentially involves a memory operation. This adds latency, increases token costs, and introduces a new failure mode: the agent deciding to forget or overwrite something it should have kept.

The Retrieval Problem Is Harder Than It Looks

Most implementations that reach for vector search underestimate the retrieval problem. A user asking “is that resolved?” after a previous conversation about a deployment bug needs the agent to retrieve the memory about the deployment bug, but the query has essentially no semantic content on its own. It depends on conversational context that may not be embedded with the stored memory fragment.

Several mitigations exist, none of them complete.

Hypothetical document embedding (HyDE), from Gao et al. (2022), addresses the query-document mismatch directly. Instead of embedding the raw query, you ask the model to generate a hypothetical answer, then embed that. An answer will tend to be semantically closer to the stored memory than a vague question. This works often enough to justify the extra LLM call, but it adds latency and can hallucinate its way into the wrong retrieval neighborhood.

Metadata filtering tags memories with user ID, topic, timestamp, and channel, then filters before running vector search. This shrinks the search space and prevents cross-user contamination. It requires consistent tagging at write time, which is its own discipline.

Re-ranking retrieves a larger candidate set from the vector store and then scores candidates against the full query using a cross-encoder model. Cross-encoders process both the query and the candidate together, which gives much better relevance scores than bi-encoder cosine similarity. The cost is that cross-encoders are slower and cannot be precomputed.

Explicit writes at the moment of establishment sidestep retrieval entirely for facts that are confirmed at a specific time. When a preference is stated, write it to a structured store immediately. Do not leave it to retrieval to surface it later, because retrieval will fail at the worst time.

Long-Context Models Are Not Sufficient Alone

Gemini 1.5 Pro’s million-token context window prompted a wave of takes arguing that RAG is unnecessary. The argument is understandable and incorrect, or at least incomplete.

Long context helps with bounded tasks: summarizing a large document, reasoning over a full codebase, analyzing a long conversation that happened today. It does not help with unbounded history. An agent that runs for months accumulates more interaction history than any current context window can hold, and the Lost in the Middle degradation is well-documented even in models with expanded windows.

More importantly, long context is expensive. Prefilling a million-token context on every query is not economically viable for production workloads. The cost of retrieving 500 tokens of relevant history from a vector store is orders of magnitude lower. Long context is a fallback for cases where other layers fail, not the default architecture.

A Practical Tiered Architecture

For the bot I run, the architecture separates memory by update frequency and reliability requirements.

The system prompt carries a small block of settled facts: server context, recurring users, current project state. This is manually curated, updated infrequently, and always in context. It is the closest thing to long-term semantic memory the system has.

Conversation history per channel is stored in a vector database with metadata tags. It is retrieved when the incoming message seems likely to benefit from historical context, using a lightweight classifier before firing the retrieval query.

Confirmed preferences, decisions, and explicit facts are written to a structured JSON store at the time they are established. These are injected by key at query time, not searched by embedding.

This is not elegant. It requires deciding which store each piece of information belongs in, and that decision is not always obvious at write time. But it fails gracefully in ways that pure vector retrieval does not. When retrieval misses, the structured store still has the settled facts. When the structured store lacks the context, retrieval can find the relevant episode.

The Open Questions

Memory consistency over time is still unsolved. If a fact stored three months ago conflicts with something said last week, which should the agent surface? Which should it trust? Without a principled answer, the agent either silently picks one or surfaces both and leaves the contradiction to the model to resolve, which often does not go well.

Memory about gaps is harder still. A retrieval miss is invisible to the model. It simply has no information and may confabulate rather than acknowledge ignorance. Building an agent that knows what it does not know requires the retrieval layer to report confidence, not just results, and the model to reason about that confidence appropriately.

The field has good tooling for storing and retrieving information. What it does not have is reliable judgment about what to remember, when to retrieve, and how to behave when the memory system returns nothing useful. The design patterns emerging from projects like Letta are pointing in the right direction: explicit memory structures, model-controlled paging, layered stores for different memory types. The defaults are not yet obvious, and the failure modes are still educational.