Closing the Adaptation Gap: How ALTK-Evolve Teaches AI Agents to Learn From Their Own Deployments

Most AI agents arrive at a deployment and never get better at it. An agent that consistently misinterprets a specific error code from a legacy service will misinterpret it on the thousandth run the same way it did on the first. The model weights have not changed, the environment has not changed, and nothing has been learned.

The underlying failure mode is an adaptation gap. The model knows everything it learned during pretraining and nothing it has observed since deployment. Every session begins from the same starting state: same weights, same default behaviors, same blind spots. A human contractor picks up environment-specific knowledge within the first week: which internal API has an undocumented rate limit, which legacy service requires retrying with alternate parameters on a specific error code, which multi-step workflow silently fails if you skip an intermediate verification step. An AI agent, in default configuration, never accumulates any of this.

IBM Research describes this gap in their recent work on ALTK-Evolve (arXiv:2603.10600), citing a figure worth pausing on: a MIT study found that 95% of AI agent pilots fail, with adaptation being a central factor. The analogy they use is a line cook who has memorized every cookbook but forgets the kitchen every morning, never learning that the oven runs hot, never remembering that the regulars want extra salt.

What Not to Do: Trajectory Replay

The obvious naive solution is to feed previous runs back into the context window. Store the last N agent trajectories, append them to the system prompt, and let the model reason across runs. This approach has a fundamental flaw: it does not compress. Each trajectory might be thousands of tokens. Relevant patterns get buried in irrelevant details. The model spends its context budget re-reading history rather than extracting generalizations from it, and the signal-to-noise ratio degrades with every additional session logged.

The academic literature has explored this space. Reflexion (Shinn et al., 2023) addressed it by having agents reflect on failure trajectories to generate verbal self-critique, stored in an episodic buffer. This was a meaningful step: instead of replaying raw trajectories, the agent crystallizes a text reflection from each run and retrieves those reflections on subsequent runs. The limitation is that Reflexion is per-episode and append-only. The buffer grows without bound, there is no mechanism for merging redundant insights, and weak or incorrect reflections survive alongside useful ones indefinitely.

ExpeL (Zhao et al., 2023) went further, extracting structured insights from both successful and failed trajectories into a reusable experience pool retrieved at inference time. ExpeL is the most direct academic predecessor to ALTK-Evolve. The gap is in production engineering: ExpeL demonstrates the research pattern; ALTK-Evolve adds the infrastructure for running it continuously at deployment scale.

The ALTK-Evolve Architecture

ALTK-Evolve is not a fine-tuning system. It operates entirely at inference time, requires no gradient updates to the underlying model, and is compatible with any LLM accessible through a standard API. The architecture divides into three layers and two flows.

The Application Layer is the user-facing surface. The Interaction Layer handles observation and retrieval: it captures full agent trajectories using OpenTelemetry-compatible tracing (the paper names Langfuse and Arize Phoenix as compatible observability backends), and serves as the injection point where retrieved guidelines reach the agent just before each run. The Entity Memory Layer is where the substance lives: persistent storage of extracted entities (guidelines, policies, SOPs) along with background consolidation, scoring, and pruning.

The two flows are what distinguish this from a fancier trajectory buffer. The downward flow observes raw trajectories and runs pluggable extractors to mine them for structural patterns, persisting results as candidate entities. The upward flow runs background consolidation: merging duplicate insights, scoring entities by demonstrated utility across runs, and pruning entities that consistently correlate with poor outcomes. At inference time, the top-k relevant entities (five, in the reported evaluation) are retrieved and injected into the agent’s context.

The consolidation and scoring loop is the intellectual core of the system. Entities that prove useful across many independent runs accumulate evidence and get promoted; entities that appear in failure trajectories accumulate negative signal and get pruned. The result is a self-curating knowledge base rather than a growing junk drawer, which is what keeps retrieval useful as the store scales. Context injection degrades once memory grows large enough to crowd out task-relevant information; the pruning mechanism prevents that.

Benchmark Results

The evaluation uses AppWorld (Trivedi et al., 2024), a benchmark where agents complete multi-step tasks spanning multiple apps via API calls, averaging 9.5 API calls across 1.8 apps per task. The metric is Scenario Goal Completion (SGC), which requires consistent success across multiple scenario variants of the same underlying task, making it harder to inflate than a single-run pass rate.

A ReAct baseline agent was evaluated with and without ALTK-Evolve memory, where memory was built from training and development runs, then tested on a held-out partition.

Difficulty	Baseline SGC	With ALTK-Evolve	Improvement
Easy	79.0%	84.2%	+5.2pp
Medium	56.2%	62.5%	+6.3pp
Hard	19.1%	33.3%	+14.2pp
Aggregate	50.0%	58.9%	+8.9pp

The hard-task result is where the value of accumulated procedural knowledge is clearest. Hard AppWorld tasks involve longer chains of dependent API calls, error recovery across multiple app boundaries, and state management where a wrong assumption about one service cascades into failures downstream. A 74% relative improvement (19.1% to 33.3%) on those tasks is a meaningful signal, not a rounding error in a favorable metric.

The generalization result matters as much as the raw numbers. Memory built from training and development runs improved performance on the held-out test-normal partition, confirming that the system is learning transferable principles rather than fitting to previously-seen task instances. That distinction matters for deployment: a system that memorizes seen scenarios provides no benefit on the novel situations that constitute almost all production usage.

Why Inference-Time Learning Matters

The alternative to inference-time adaptation is fine-tuning. ToolLLM (Qin et al., 2023) and Gorilla (Patil et al., 2023) bake tool-use knowledge into model weights through training on annotated tool-call corpora. Toolformer (Schick et al., 2023) taught models to self-annotate tool invocations via a self-supervised objective. These approaches improve base model capability for general tool use, but they cannot adapt to deployment-specific environments: your internal API’s undocumented behavior is not in any training corpus, and it never will be.

Fine-tuning also requires ML infrastructure, trained personnel, and a training loop that runs continuously as the environment evolves. It is unavailable for closed-source models like GPT-4o or Claude 3.5 Sonnet. ALTK-Evolve sidesteps all of this, running as a wrapper around any LLM, adding a retrieval call before each agent run and a trajectory capture call after. The model never needs to be updated.

This matters particularly for organizations running commercial API-based agents. The fine-tuning path is closed to them by default, and even when open, environment-specific quirks are too narrow to justify the cost of a dedicated training run. Inference-time memory is the practical option.

Three Ways to Deploy

The system ships with three integration paths. The no-code path targets Claude Code, Codex, and IBM Bob, installing as a plugin with automatic trajectory extraction and retrieval via hooks. The low-code path requires a single import:

import altk_evolve.auto

This emits traces to Arize Phoenix and works with OpenAI, LiteLLM, and Hugging Face agents. The pro-code path integrates with IBM’s CUGA agent framework via MCP, with explicit get_guidelines() calls before each run and save_trajectory() calls after, enabling the tightest cross-session learning loop at the cost of more integration code.

The GitHub repository includes the toolkit; documentation covers all three paths with setup walkthroughs.

The Broader Picture

The implementation pattern in ALTK-Evolve is familiar from other persistence systems: a write path that logs observed behaviors, a background compaction process that merges and scores entries, and a read path that returns only the relevant subset. Applied to agent deployment, the log entries are procedural guidelines and the compaction is semantic deduplication with utility scoring.

Several related systems approach the same goal from different directions. Mobile UI agents like AppAgent maintain per-app observation notes to guide future runs, which is conceptually similar but without the consolidation and scoring layer. Reinforcement learning with verifiable rewards, as in DeepSeek-R1 and related systems, updates model weights through outcome supervision and targets a different part of the stack entirely. ALTK-Evolve occupies the inference-time, model-agnostic position: no weight updates, no training infrastructure, works with any LLM, and improves continuously as the agent accumulates field experience.

For anyone building agents that run repeatedly against the same APIs and tools, the core proposition holds regardless of the benchmark numbers. An agent that distills its own deployment experience into retrievable, pruned guidelines will outperform one that starts from zero each time. The engineering infrastructure for this pattern is now available in a documented, multi-path package. What remains is adopting it.