Stale Facts and the Consistency Problem in AI Agent Memory

Every agent memory system accumulates an assumption that rarely gets named: that the things it has stored are still true. A user preference captured in January may be wrong by March. A decision made about a database schema may be reversed a week later. A project dependency noted during one session may be replaced before the next. The system has no way to know unless something explicitly tells it.

Tom Bedor’s survey of memory approaches touches on the write path and the choice between episodic, semantic, and procedural storage. But the consistency problem, keeping stored memories coherent as the underlying reality changes, is a distinct issue from the write-time decision about what to store. You can build an excellent extraction pipeline that produces clean, well-structured memory entries and still end up with a system that confidently reasons from stale facts three months later.

What consistency means in traditional databases

Relational databases have mature tooling for handling changing facts. Multi-version concurrency control (MVCC) keeps a history of row versions, allowing reads to see a consistent snapshot even as writes proceed. Temporal tables, standardized in SQL:2011, store validity periods alongside each row:

SELECT value FROM preferences
FOR SYSTEM_TIME AS OF '2026-01-15'
WHERE user_id = 'alice' AND key = 'primary_language';

A fact marked valid from January to March is gone from queries in April, but it remains recoverable for historical analysis. Indexes enforce entity constraints, so two conflicting facts about the same attribute cannot coexist without the schema explicitly permitting it.

None of this transfers to unstructured text stored as embeddings in a vector database. There are no foreign key constraints on meaning. A vector store does not know that “I work mainly in Python” and “I’ve switched to TypeScript for everything” describe the same attribute of the same entity. Both entries sit as equally valid high-dimensional points, equally retrievable by any query that touches the semantic neighborhood of programming language preferences.

Why the embedding space has no concept of supersession

This is the structural problem. Cosine similarity measures proximity in embedding space; it says nothing about whether one fact should be believed over another, or whether one was recorded more recently, or whether they describe the same real-world state.

The standard workaround is metadata. Tag each stored memory with a timestamp, a user ID, and a topic label. At retrieval time, filter by user and apply a recency preference. This works for clear-cut cases where you can identify the key and retrieve the most recent version. It breaks down in three situations:

The contradicting entries describe the same thing using different surface forms: “I use a Mac,” “I prefer macOS over Windows,” and “I installed Homebrew” may be about the same attribute or may carry independent meaning
The contradiction is implicit rather than explicit: “I usually work alone” from January sits alongside a March message mentioning a deadline the team is pushing against, with no flag marking the earlier fact as stale
The newer fact is adjacent rather than a direct replacement: “I mostly use Python” in January versus “I’ve been learning Rust lately” in March, where neither cleanly supersedes the other

Metadata filtering solves the first case sometimes, the second never, and the third only if you’ve already modeled the attribute as a mutable key.

How mem0 approaches it

mem0 runs an LLM reconciliation pass on each new fact before storing it. The incoming fact is embedded and compared against existing memories; semantically close matches are passed to the LLM to determine whether the new entry contradicts, duplicates, or extends the existing ones. When a contradiction is found, the newer version replaces the older:

from mem0 import Memory

m = Memory()
m.add("I prefer Python for all my projects", user_id="alice")
m.add("I've switched to TypeScript for most things now", user_id="alice")
# The LLM reconciliation pass detects the contradiction;
# the Python preference is superseded

This handles unambiguous cases reasonably well. The failure mode is the nuanced or conditional statement: “I prefer TypeScript now” after “I prefer Python” is clear enough to reconcile; “I use TypeScript at work but Python for scripts” after “I prefer Python” requires a more careful merge that semantic similarity thresholds and a single LLM call don’t reliably produce. The reconciliation prompt has to work for all possible combinations of incoming and existing facts, which is a demanding generalization.

How Zep’s knowledge graph handles it

Zep takes the most architecturally complete approach: it builds a temporal knowledge graph where entities are nodes and relationships are typed, timestamped edges with explicit validity windows. “Works primarily in Python” becomes an edge from the user node to the Python node with a start timestamp. When “now mainly uses TypeScript” arrives, Zep ends the validity window on the Python edge and opens a new one to TypeScript. Queries filtered to the current timestamp automatically see the current state; historical queries can still traverse the older edges.

This is the closest analog to SQL temporal tables, applied to the unstructured text domain. Multi-hop queries, such as “what languages does this user know and for how long,” become graph traversals that are much harder to express in pure embedding search. The operational overhead is real: you’re running a graph construction pipeline over unstructured text, which is imperfect, and the system requires PostgreSQL with pgvector and optionally Neo4j for the full graph layer.

The deeper limitation is that graph construction from unstructured text makes mistakes. “My colleague Sarah” and “my manager” may or may not refer to the same person; the graph builder has to decide, and it won’t always decide correctly. Entity resolution errors compound over time, producing a graph where some nodes represent multiple real-world entities and some real-world entities are scattered across multiple nodes.

How Letta handles it, and why it’s the hardest approach operationally

Letta, the production form of MemGPT, gives the model explicit control over its own memory updates. The model calls core_memory_replace(label, old_text, new_text) when it determines an existing memory should be updated. This is principled: the model applies its own reasoning to the consistency decision rather than a mechanical rule.

The failure mode is that model consistency maintenance is only as reliable as model reasoning, and models are inconsistent. They forget to check existing memories before writing new ones; they miss the implication that a new piece of information supersedes an older one; they hallucinate memory entries that don’t exist when asked to retrieve. The system’s consistency guarantees are bounded by model quality in a way that automated pipelines avoid. That bound is significant when deploying smaller models or models under cost pressure.

The implicit contradiction problem

The hardest case is the implicit contradiction, and none of the systems above solve it well. A user’s situation changes in ways that don’t produce explicit statements. “I usually work alone” from six months ago; a recent message about a team deadline. Nothing in the second message declares the first fact stale, but a human with context would update their model of the situation.

The Generative Agents paper from Stanford is the most serious attempt to address this at the architecture level. Periodic reflection passes where the agent reviews its own recent memories and generates higher-order summaries can surface changed circumstances that individual memories don’t flag as contradictions. A reflection pass that reads “user mentioned team deadline” alongside “user prefers working alone” can produce a synthesis note that flags the tension. The cost is multiple LLM calls per reflection pass, which limits how often this can run in cost-constrained production.

What a practical system looks like

When building the memory system for my Discord bot, the approach that worked best for mutable facts was not embeddings at all, but a keyed structure. Preferences and operational decisions are stored under explicit keys derived from a schema at write time: user:alice:primary_language, server:1234:pr-review-channel, bot:conventions:commit-style. A new value for an existing key atomically replaces the old one. Contradiction is structurally impossible because the key space is designed to have at most one current value per attribute.

This isn’t general. It requires knowing at schema design time which attributes you want to track as mutable, named facts. For unstructured episodic content, confidence decay approximates staleness: entries not referenced in 30 days have their confidence score reduced; entries below threshold for 90 days are garbage collected. This doesn’t detect implicit contradictions, but it does prevent the memory store from indefinitely treating six-month-old context as current.

The limitation of the keyed approach is obvious: it only works when you can name the attribute in advance. The category of memory that doesn’t fit a predefined schema, which is most of what makes agents genuinely useful over long horizons, still requires either a knowledge graph, LLM-mediated reconciliation, or accepting that some stale facts will persist.

Where this leaves the design question

The tombedor.dev taxonomy is a useful map of storage types and retrieval strategies. What it points toward, without fully resolving, is that the consistency requirements differ meaningfully across tiers. Episodic memories can tolerate staleness because they’re anchored to specific past moments; semantic memories cannot, because they’re relied upon as current state. A system that treats both the same will inevitably surface stale semantic facts retrieved with high confidence because they scored well on similarity.

The practical design question, for any long-running agent, is not just “where do I store this” but “what consistency guarantee does this tier need to provide.” The answer shapes whether you need a keyed overwrite mechanism, a temporal knowledge graph, model-controlled update functions, or some combination. Getting that wrong tends to be a quiet failure: the agent answers confidently, the answer is wrong, and the source is a fact that was true once and no longer is.