Memory as Architecture: How AI Agents Decide What to Remember

When you build a stateless HTTP server, reasoning about memory is straightforward: each request gets its state, processes it, returns a response, and the slate is wiped. AI agents break this model. An agent handling a multi-step task, or interacting with a user across sessions, needs to carry state forward in ways that don’t fit neatly into a request/response cycle. The context window is the only thing an LLM actually “sees,” so everything the agent knows must either be in that window or retrievable on demand.

This is the fundamental design constraint that drives all agent memory architecture, and tombedor.dev’s overview of memory approaches captures the major categories well. But the choice between them isn’t just technical. It’s operational, economic, and deeply dependent on what your agent is actually supposed to do over time.

Three Categories of Memory

Most memory systems fall into three broad categories.

In-context memory is the simplest: put everything relevant directly in the prompt. The LLM processes it all at once, with full attention across every token. No retrieval step, no index to maintain, no separate infrastructure. The tradeoff is obvious: context windows are finite and expensive. GPT-4o and Claude 3.7 have 128K-200K token windows, which sounds generous until you’re running a long-horizon task or maintaining user history across months of interactions. At scale, always-on context also has real cost implications: every token in context is a token you pay to process on every call.

External memory uses a database outside the model, retrieved at inference time. This is where RAG (retrieval-augmented generation) lives, along with vector stores like Chroma, Weaviate, Pinecone, and Qdrant. The agent queries the store, gets back relevant chunks, and injects them into context. This scales arbitrarily but introduces retrieval quality as a new failure mode: if the wrong chunks come back, the agent reasons from incomplete or misleading information, and the error is silent.

Parametric memory is what the model learned during training. It isn’t “memory” in the agentic sense since you can’t update it at runtime without fine-tuning, but it matters because it sets the baseline competency. When an agent knows Python syntax or common API patterns, that’s parametric memory at work. The agent’s ability to reason usefully from retrieved information depends heavily on this foundation.

Most production systems combine all three. The question is how to balance them.

The Cognitive Science Framing

Researchers building memory-augmented agents often borrow from cognitive science, specifically the taxonomy of human memory types:

Working memory: what’s actively in mind right now. Maps directly to the context window.
Episodic memory: memories of specific events. Maps to retrievable records of past interactions.
Semantic memory: general facts and knowledge. Maps to a knowledge base or the model’s parametric knowledge.
Procedural memory: how to do things. Maps to system prompts, tool definitions, and learned behavioral patterns.

This framing clarifies what you’re actually building when you add “memory” to an agent. A vector store of past conversations is episodic memory. A knowledge base populated with domain facts is semantic memory. A carefully crafted system prompt encoding behavioral rules is procedural memory. Mixing these without recognizing the distinction leads to architectures that retrieve the wrong kind of information for the task at hand.

MemGPT: The Virtual Memory Metaphor

The paper “MemGPT: Towards LLMs as Operating Systems” from UC Berkeley (2023) made the most explicit analogy between OS memory management and LLM context management. The core idea: treat the context window like RAM and external storage like disk. The LLM explicitly manages what gets paged in and out, just as an OS manages physical memory.

In MemGPT’s architecture, the “main context” holds the current system prompt, recent conversation, and a small set of retrieved documents. “External context” holds the full conversation history, a persona and user model, and an archival store. The model itself decides when to call memory functions to read from or write to external storage.

# MemGPT-style memory function signatures the model can call
def search_archival_memory(query: str) -> list[str]:
    """Retrieve relevant memories from long-term storage"""
    ...

def insert_archival_memory(content: str) -> None:
    """Write a new memory to long-term storage"""
    ...

def recall_memory(query: str) -> list[dict]:
    """Search conversation history"""
    ...

This differs from standard RAG, where retrieval happens mechanically before the model sees anything. MemGPT gives the model agency over its own memory management, which produces better behavior on long-horizon tasks but adds latency and depends on the model following tool-use instructions consistently. In practice, models sometimes forget to search before answering, which defeats the purpose.

Letta, the company that grew out of the MemGPT research, has productized this architecture and continues to develop it. Their open-source framework treats stateful agents with managed memory as a first-class primitive.

The Write Problem

Most discussion of agent memory focuses on the read side: how to retrieve the right context at the right time. The write side is equally hard and gets less attention.

When should an agent write to memory? A few approaches are in use.

Write everything: log every interaction to the store. Safe and auditable, but creates noise and grows unbounded. Retrieval quality degrades as the store fills with redundant or low-signal content.

Write on explicit cue: the agent decides what to remember. This is what MemGPT does. The problem is that models are inconsistent about what they choose to memorize, and they miss important information when they don’t recognize it as significant in the moment.

Write on compression: periodically summarize recent interactions and write the summary. Many production chatbot systems use this approach. The compression step is lossy, which causes subtle failures when the specific wording of a previous instruction mattered.

Write on schema extraction: identify structured facts from interactions and write to a key-value or relational store. This works well for factual attributes (“user prefers TypeScript”, “user’s timezone is UTC-5”) but handles unstructured knowledge poorly.

Mem0, an open-source memory layer for AI agents, uses a hybrid approach: it runs an extraction step after each interaction to identify facts worth retaining, deduplicates against existing memories, and writes structured records. The extraction uses the LLM itself, which adds latency but produces higher-quality signal than embedding-based filtering alone.

Retrieval Quality as a First-Class Concern

With external memory, retrieval quality determines how useful the memory is. Vector similarity search finds semantically related content, but “semantically related” and “contextually relevant” are not the same thing.

Consider a user who, in session one, mentioned they’re building a Discord bot in Python. In session five, they ask about rate limiting. A naive vector search for “Discord rate limiting” might return API documentation chunks but miss the earlier mention of their Python context, which shapes how you’d usefully answer.

Retrieval approaches in production use:

Dense retrieval: embed the query and documents, find nearest neighbors. Fast and scalable, but struggles with exact keyword matching and rare terms.

Sparse retrieval (BM25): classic term-frequency matching. Good for exact terms, weaker at semantic generalization.

Hybrid search: combine dense and sparse scores. Most mature vector databases support this now. Weaviate, Qdrant, and Elasticsearch all have hybrid search modes that mix both signals.

Reranking: use a cross-encoder model to rescore top candidates after initial retrieval. Adds 50-200ms of latency per call in typical deployments but improves relevance substantially. Cohere Rerank, ColBERT, and BGE-reranker are common choices.

GraphRAG: Microsoft’s approach, which builds a knowledge graph from documents and retrieves via graph traversal in addition to semantic search. Handles multi-hop reasoning better than pure vector search but is significantly more complex to operate.

Context Budget Management

Even with good retrieval, you have to manage how much retrieved content goes into the context window. This is a budget allocation problem.

A typical context budget for a tool-using agent might look like:

System prompt:          ~2,000 tokens  (fixed)
Recent conversation:    ~4,000 tokens  (sliding window)
Retrieved memories:     ~6,000 tokens  (dynamic)
Tool definitions:       ~1,500 tokens  (fixed per capability)
User message:           variable
Total reserved:         ~14,000 tokens minimum

With a 128K context window, you have substantial headroom. With a 16K window, which still appears in cost-constrained deployments, budget management becomes critical. Exceeding the budget means truncating something, and what you truncate determines what the agent forgets.

Some systems use a tiered approach: recent messages get full token allocation, older messages get progressively summarized, and the oldest messages are dropped or archived. This mirrors how human working memory functions, with recency bias and lossy compression of older episodes. The tradeoff is that aggressive summarization loses specifics that turn out to matter later.

The Operational Reality

The choice between memory architectures is not only a technical one. In-context memory is easy to debug: log the full prompt and inspect what the model saw. External memory systems have more moving parts. The embedding model, the vector store, the retrieval parameters, and any reranking step are all independent failure points. When an agent gives a wrong answer based on retrieved memory, diagnosing whether the failure was in retrieval, in relevance ranking, or in reasoning requires tooling that most teams haven’t built yet.

The right architecture depends on several factors:

Time horizon: single session versus persistent user history versus shared knowledge across agent instances
Update frequency: how often memories need to be corrected or superseded
Query patterns: broad semantic search versus exact key lookup versus relational queries
Latency requirements: external retrieval adds measurable latency that accumulates across an agentic loop
Cost: embedding, storing, and retrieving at scale carries real infrastructure cost

Where the Field Is Converging

Memory is increasingly treated as a modular architectural component rather than an afterthought. LangGraph treats memory as explicit state passed between graph nodes, making the flow of information across an agent’s reasoning steps visible and inspectable. The Model Context Protocol from Anthropic standardizes how tools and context sources connect to models, which shapes how memory systems can plug in as first-class resources. The OpenAI Assistants API includes built-in file-based retrieval as a native capability.

The deeper shift is that memory is increasingly understood as a system design problem, not a prompting problem. How you structure, retrieve, and inject memory determines agent behavior as much as the base model does. A well-designed memory system lets an agent accumulate useful context across interactions and degrade gracefully when relevant history is absent. A poorly designed one produces an agent that confidently reasons from stale or irrelevant context, which is often worse than no memory at all.

The hardest part isn’t storage or retrieval in isolation. It’s knowing when to write, what to write, and how to surface the right subset at inference time without overwhelming the context budget. Those three problems interact with each other in ways that make memory systems genuinely difficult to tune in production.