Causal Memory for LLM Agents: The Architecture Behind ALTK-Evolve

Most LLM-based agents have a memory problem that goes deeper than context length. They can be given access to past conversations, vector-searched fragments of prior interactions, and summary transcripts, and still repeat the same mistakes the next time a similar task arrives. The issue lies in what gets stored, not in how much can be stored.

IBM Research’s ALTK-Evolve calls this the “eternal intern” problem: an agent that processes every task from scratch, re-reads past transcripts instead of extracting transferable lessons, and never accumulates the working knowledge that would make it more effective over time. The accompanying paper, “Trajectory-Informed Memory Generation for Self-Improving Agent Systems”, draws a sharp distinction between episodic memory (what happened) and procedural guidance (what to do differently). Most agent memory systems, including Mem0, Letta, and basic RAG-over-transcripts pipelines, operate in the episodic mode. ALTK-Evolve operates in the procedural one.

The Learning Taxonomy

The framework introduces three categories of learnable tips, and the taxonomy is worth understanding before looking at the technical pipeline.

Strategy tips come from clean successful executions. When an agent completes a task efficiently and correctly, the system extracts the pattern: what triggered the sequence, which API calls were made, what context was established before proceeding. These are the most familiar type of memory in agent systems, roughly analogous to what Agent Workflow Memory calls “workflows.”

Recovery tips come from failure-then-recovery sequences. When an agent hits an error at step 15, diagnoses a root cause missed at step 3, and successfully corrects course, there is valuable signal in that trajectory that purely success-focused systems discard entirely. The recovery tip includes a negative example: a concrete description of the suboptimal approach, not just the correct one. LLMs benefit from knowing what to avoid, and encoding that explicitly rather than hoping the model infers it is a meaningful design decision.

Optimization tips come from inefficient but ultimately successful executions. If an agent called remove_from_cart() in a loop ten times when empty_cart() existed and would have accomplished the same thing, that trajectory has signal. The system identifies the more efficient alternative and stores it with the trigger condition that should prompt its use.

Prior work tends to collapse all three categories into one or ignore two of them. Reflexion (Shinn et al., 2023) does verbal reinforcement within a retry loop, capturing failure-and-correction patterns but discarding them after the episode ends. Voyager (Wang et al., 2023) stores executable skill programs from successful Minecraft runs, with no concept of failure or inefficiency as a learning signal. Most retrieval-augmented memory systems, similarly, index what happened rather than why it succeeded or failed.

The Causal Attribution Pipeline

The extraction process is where ALTK-Evolve invests most of its engineering. Raw agent trajectories pass through a Trajectory Intelligence Extractor that classifies each agent thought as analytical, planning, validation, or reflection, then identifies cognitive patterns like self-correction and API discovery before determining the outcome. This structured intermediate representation feeds into a Decision Attribution Analyzer.

The attribution analysis distinguishes four causal levels for failures: the immediate cause (what directly triggered the failure), the proximate cause (recent decisions that enabled it), the root cause (the underlying originating issue), and contributing factors. This is not a trivial distinction. An agent that calls the wrong API at step 15 may have made the enabling decision at step 3; conflating these in the stored tip produces guidance that addresses the symptom rather than the source.

For recoveries, the analyzer traces what enabled the failure, how the agent recognized it, what corrective action was taken, and why the correction worked. For inefficiencies, it identifies what made execution suboptimal and what the more efficient alternative would have been. Each resulting memory entry carries a unique ID, tip category, actionable content, purpose explanation, concrete steps, trigger condition, optional negative example, application context, task category, priority level, and the source trajectory ID for provenance.

Subtask Granularity

Tips can be extracted at the task level (treating the entire trajectory as a unit) or at the subtask level (segmenting first, then extracting per segment). The paper’s ablations are clear: subtask-level extraction produces tips that transfer more reliably to novel tasks.

The reason is reusability. A task-level tip for “order a birthday gift and send a payment reminder” bundles authentication, cart management, checkout, and messaging into one unit. A subtask-level tip for “authenticate with a shopping service” applies to any task that requires that subtask, regardless of what follows. The system uses a two-phase segmentation: an LLM first segments the trajectory into logical subtasks with generalized descriptions, then a second LLM pass extracts two to four concrete, generalizable tips per segment.

Those generalized descriptions then pass through semantic clustering. “Retrieve Spotify password for john.doe@email.com” gets abstracted to “Retrieve service account credentials,” entity references are stripped out, and semantically equivalent descriptions (using cosine similarity at threshold ~0.85) are clustered so that conflicting tips from different sources can be merged according to trajectory outcome quality. Tips from successful trajectories rank above tips from failed ones during consolidation.

Retrieval and the SGC Metric

At runtime, the incoming task description is embedded and compared against stored subtask descriptions. Two retrieval strategies are available: cosine similarity filtering with a threshold and top-k selection (no additional LLM calls, low latency), or LLM-guided selection with metadata filtering (one LLM call per task, slower but more consistent).

The choice between them is where the SGC metric becomes important. Task Goal Completion (TGC) measures whether individual tasks pass their unit tests. Scenario Goal Completion (SGC) measures whether all variants of a scenario pass simultaneously, a much stricter consistency criterion. A system that passes three of five variants of the same scenario every time scores 0% SGC even if its TGC looks reasonable.

On the AppWorld benchmark, a suite of 750 complex tasks across nine simulated applications with 457 APIs and 1,470 arguments, cosine retrieval yields slightly higher TGC (+4.2 pp vs. +3.6 pp for LLM-guided), but LLM-guided retrieval achieves +14.3 pp SGC improvement compared to +7.1 pp for cosine. If consistency matters, LLM-guided selection is the correct choice.

The Hard difficulty results are the most striking. Hard tasks in AppWorld average 9.5 API calls across 1.8 apps with up to 26 unique APIs. The best configuration improves SGC on Hard tasks from 19.1% to 47.6%, a 149% relative improvement. The baseline difficulty there reflects how poorly general-purpose agents perform on multi-hop, multi-app coordination without accumulated task-specific knowledge. It also suggests that the systems that stand to gain most from this kind of memory are precisely the ones tackling complex, multi-step work rather than simple single-API queries.

One ablation worth noting: setting the cosine similarity threshold to 0.5 with top-3 selection pushes TGC below the no-memory baseline entirely. Retrieving loosely matched, irrelevant tips is worse than retrieving nothing, which reinforces the design principle behind LLM-guided selection.

How It Compares to Weight-Update Approaches

The contrast with RLHF is worth examining. RLHF requires training a reward model from human preference data, then fine-tuning the base model with PPO against that reward signal. The reward formula penalizes KL divergence from the reference model, and meaningful improvement typically requires around 50K labeled preference pairs. The learned behavior gets embedded into weights in a way that is opaque and difficult to audit.

ALTK-Evolve learns from production trajectories with no weight updates, no preference labels, and no reward model. The memory is structured, readable, and carries provenance. Each tip traces back to the trajectory it came from; if a tip turns out to be wrong, it can be removed. The consolidation layer can demote tips from failed trajectories over time. This interpretability is a core property of the design, not an incidental benefit, and it makes the system substantially easier to debug and correct than a fine-tuned model.

Deployment and Limitations

The framework offers three integration tiers. The Lite mode adds a plugin to Claude Code, Codex, or IBM’s internal tooling with no code changes. The low-code path uses a single import altk_evolve.auto statement to wrap any OpenAI or LiteLLM agent. The pro-code path integrates via Model Context Protocol with explicit calls:

# Before task execution:
guidelines = get_guidelines()   # retrieve relevant tips

# After task execution:
save_trajectory(trace)          # extract and store new tips

The limitations are real. The evaluation uses GPT-4 throughout; open-source models remain untested. Medium-difficulty tasks show no SGC improvement in the best configuration, which the paper does not fully explain. Dev partition results (+26.3 pp SGC) dramatically exceed Test-Normal results (+14.3 pp SGC), suggesting that proximity to the tip-generation data still matters for retrieval quality. And the framework is single-agent: causal attribution across concurrent agents in multi-agent workflows remains an open problem.

For anyone building production agents today, the core insight is worth separating from the full framework. Most of what an agent learns from any given run gets discarded, because the event log gets saved while the derived principle does not. ALTK-Evolve’s contribution is making the distinction between those two modes of storage architectural, building a pipeline that extracts causal structure, classifies learning type, and generalizes to subtask granularity before anything reaches the vector store. Whether the specific numbers generalize beyond GPT-4 on AppWorld is an open question, but the design reasoning is sound and the gap it addresses is clearly there.