When I first started wiring up agentic loops for my Discord bot, my instinct was to give the agent a scratchpad. Write intermediate reasoning to a file. Keep a task list on disk. Maintain a state file so the agent could resume if interrupted. It felt like responsible engineering.
Six months later, I had a bot generating dozens of tiny JSON files per session, a cleanup cron that occasionally failed, and a retrieval layer quietly becoming its own maintenance project. The agent was functional, but most of my engineering time had shifted to the infrastructure around it.
A post from a Stanford researcher making the rounds on Hacker News this week captures exactly why this pattern emerges and why it tends to go wrong. The argument, compressed: the complexity you spend building filesystem scaffolding for your agents is complexity not invested in the agents themselves. Put more plainly, a well-designed agent context and a well-maintained state-file ecosystem are both finite bets, and they tend to pull in opposite directions.
Why Agents Reach for Files
The pattern has clear historical causes. Early LLM agents ran against models with small context windows, often 4k to 8k tokens. When you cannot fit much in context, externalizing state to files is not a crutch; it is the only viable approach. The ReAct paper (Yao et al., 2022) demonstrated agents interleaving reasoning and action, but “acting” frequently meant reading from and writing to external storage because there was no other way to maintain state across steps without blowing the budget.
This thinking got embedded in the frameworks. LangChain’s memory system offers filesystem-backed stores as a first-class primitive. AutoGen multi-agent conversations log to disk by default. Most “production-ready” agent templates include a vector database, a SQLite state store, and a directory of markdown files the agent uses as working memory. These tools were designed around 2022-era constraints. The constraints have since changed substantially.
The Context Window Shifted the Math
Claude 3.7 Sonnet operates at 200k tokens. Gemini 1.5 Pro supports 1 million tokens. GPT-4o runs at 128k. These are not incremental improvements over GPT-3’s 4k window; they change what you can hold in-flight without ever touching the filesystem.
A 200k-token context window holds roughly 150,000 words, more than most novels and more than the full source of many medium-sized software projects. The reasoning chain for a complex multi-step task, along with all relevant documents and prior outputs, can often stay entirely in memory without a single file write.
When you externalize state to disk under these conditions, you are frequently solving a problem that no longer exists while introducing new ones. The file write adds latency. The read requires retrieval logic. The file format becomes a schema you own and must evolve. The path becomes a coupling point between invocations.
What Investing in the Agent Layer Actually Means
The recommendation from the Stanford post is not “never use the filesystem.” The argument is about where your complexity budget goes.
Complexity spent on filesystem scaffolding looks like this: designing a state file schema, writing serialization and deserialization logic, building cleanup routines, adding retrieval layers to find relevant prior state, and debugging corruption when two agent invocations overlap.
Complexity spent on agent capability looks like this: better system prompts that maintain coherent reasoning chains, structured output schemas that make the agent’s current state explicit in-context, tool designs that reduce surface area and ambiguity, and evaluation loops that catch when reasoning drifts.
The second list compounds across tasks. Improvements to how an agent reasons benefit every task it runs. Improvements to your state file schema address one specific failure mode.
Here is a concrete example. A typical filesystem-heavy agent step might look like this:
# Filesystem-heavy pattern
def run_agent_step(task, step_num):
state = json.load(open("plan.json"))
context = build_minimal_context(state, step_num)
result = llm.complete(context)
state["steps"][step_num] = result
json.dump(state, open("plan.json", "w"))
return result
The same agent designed to carry state natively in the conversation:
# Agent-native pattern
def run_agent(task):
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": task}
]
while not is_complete(messages):
response = llm.complete(messages)
messages.append({"role": "assistant", "content": response})
if response.requires_tool_call:
result = execute_tool(response.tool_call)
messages.append({"role": "tool", "content": result})
return extract_final_answer(messages)
The second version is simpler to debug, because the entire reasoning chain is one ordered list. It works better because the model can attend to everything it has done simultaneously, rather than re-reading serialized fragments that were written under a different context state.
The Cases Where Filesystem State Is Correct
There are legitimate reasons to persist state to disk, and a serious argument distinguishes between them and the anti-pattern.
Long-running tasks across process boundaries are the clearest case. If an agent task spans hours or days, or if the process might restart, in-context state does not survive. A database or filesystem is genuinely necessary here, and the engineering cost is justified.
Multi-agent coordination is similar. When multiple independent agent processes need to share information, some external store is required. The key is designing the interface carefully so agents do not step on each other’s state, which tends to mean treating the shared store as append-only logs rather than mutable working memory.
Audit logs and observability are a different category entirely. Writing a record of what an agent decided and why is valuable for debugging and compliance. This is not working memory; it is an output, and outputs belong on disk.
The problem pattern is using the filesystem as a substitute for context design, particularly when it compensates for weak prompting, unclear tool schemas, or agents that cannot reason coherently across a long task without external crutches.
The Coupling Problem
There is a subtler issue with filesystem-heavy agent systems that rarely appears in architecture discussions: it creates implicit coupling between agent invocations that is difficult to reason about in production.
When an agent writes state to a file and a later invocation reads it, those two execution contexts are coupled through shared mutable state. The file’s schema becomes a contract. Its presence or absence becomes a precondition. Its content becomes a potential attack surface: prompt injection through filesystem contents is a genuine concern for agents that read from directories with any user-controlled content, and it is a concern the filesystem-heavy pattern makes structurally more likely by expanding the attack surface.
In-context state does not have the same problem. Each message appends to the conversation; nothing is mutated in place. The reasoning chain is auditable in sequence. There is no hidden shared state that a concurrent process can silently corrupt.
This is part of why the HN discussion around the Stanford post has focused on the architectural angle rather than just the practical convenience argument. The filesystem-heavy pattern is not just harder to maintain; it introduces a class of failure modes that purely in-context designs avoid by construction.
The Broader Shift in Agent Design
What the Stanford post is pointing at connects to a wider change in how capable agent systems are being built in 2026. The frameworks that emerged from the 2022-to-2024 period, built around retrieval-augmented generation, vector stores as primary memory, and elaborate file-based state machines, were reasonable responses to real constraints. Those constraints are now partially obsolete.
The direction that seems to be winning in practice looks more like this: give the agent a well-designed set of tools, a clear system prompt that establishes reasoning conventions, and enough context to hold its entire working set in-flight. Reserve external persistence for genuine persistence needs, which tend to be narrower than most early implementations assumed.
Building Ralph, I found this in practice. The agent became more reliable when given better context and cleaner tool schemas, not when given more files to read. The working memory files, the intermediate state stores, the partial task logs, all of those became sources of stale data and subtle bugs. The configuration persistence, the scheduled task metadata, the user preferences: those paid off because they were genuine persistence needs, not architectural crutches.
The filesystem is good at being a filesystem. Used as a substitute for a well-designed agent context, it adds complexity without adding intelligence, and that gap between the two is where reliability tends to break down in ways that are difficult to debug and slow to fix.