The 678 KB AI Agent: What an IRC-and-Zig Stack Reveals About Inference Architecture

George Larson’s Nullclaw Doorman landed on Hacker News this week with 318 points and a question most people read as architectural curiosity: why build an AI agent system on IRC? But the IRC choice is the least interesting decision in this stack. The more revealing parts are how the inference cost is structured, what the A2A protocol passthrough actually solves, and what a 678 KB binary running on 1 MB of RAM says about where agent frameworks have been wasting everyone’s money.

Two Agents, Two Threat Models

The system splits into two agents with distinct exposure profiles. Nullclaw is the public-facing one: a 678 KB Zig binary connected to an Ergo IRC server, with visitors reaching it through a gamja web client embedded in the site or via any standard IRC client at irc.georgelarson.me:6697 (TLS). Ironclaw lives behind a Tailscale mesh, handling email and scheduling, reachable only from within the private tailnet.

The separation is clean because the threat models are different. Nullclaw accepts arbitrary public input over IRC; ironclaw handles privileged operations against real services. Putting them on the same host or the same network segment would mean a prompt injection in the public chat surface could potentially reach the email-sending, calendar-writing agent. The Tailscale boundary enforces this separation at the network layer, which is a cheaper and more reliable guarantee than trying to enforce it purely in application code.

Tailscale’s WireGuard-based mesh gives ironclaw a stable 100.64.x.x CGNAT address and a DNS name within the tailnet without any public firewall rules. For a self-hosted agent system, that eliminates an entire category of configuration error.

Why IRC Actually Works Here

IRC gets a reaction of either nostalgia or dismissal depending on who you ask, but neither response is the right one for evaluating this architecture. The relevant question is what properties a good agent transport needs, and IRC satisfies most of them.

Agents need a structured way to address messages to one another. IRC channels and nicks provide that. Agents need some form of history replay when they reconnect after a crash. Ergo, which is built in Go and ships as a single binary with an embedded key-value store, implements the IRCv3 chathistory extension natively, so there is no external bouncer required. Agents need message deduplication. Ergo assigns msgid values to every message under the labeled-response spec. Agents need a transport that human operators can observe and participate in with standard tooling. IRC has that by definition.

Gamja is the web client piece. Written by Simon Ser (also the author of the soju IRC bouncer and the goguma mobile client), it is a single-page application that connects over WebSocket, implements chathistory, and deliberately avoids a heavy JavaScript framework. Ergo exposes a native WebSocket endpoint, so the gamja-to-Ergo connection involves no proxy or gateway layer.

The thing that makes IRC unsuitable for most production message systems is the lack of ordering guarantees and the historical absence of persistence. Ergo addresses persistence. For a system where message ordering matters less than human-readable debuggability and zero operational overhead, IRC becomes a reasonable choice rather than a regressive one. Kafka has ordering guarantees, consumer groups, and configurable retention; it also has ZooKeeper (or KRaft) and a JVM-based broker cluster. The tradeoff is real.

The Zig Binary Is a Statement

A 678 KB binary using approximately 1 MB of RAM for a conversational AI agent gateway is not a performance benchmark. It is an indication of how much overhead most agent frameworks carry that has nothing to do with the actual inference work.

Typical Python-based agent frameworks start at 200-400 MB of installed dependencies. A FastAPI server with LangChain and a few client libraries will idle at 80-150 MB of RSS before handling a single request. The Zig binary’s footprint reflects what the gateway layer actually needs: a network socket, an IRC protocol parser, some state for active sessions, and HTTP client code to reach the Claude API. Everything else is inference, and inference happens on Anthropic’s infrastructure, not on the $7 VPS.

Zig has no runtime, no garbage collector, and compiles to native code with predictable memory layout. For a network daemon that parses a line-oriented protocol and makes HTTP requests, the language overhead is near zero. The binary size and RSS figures are close to what you would get from a handwritten C implementation, without the manual memory management hazards.

The broader point is that the VPS is not running an LLM. It is running a gateway. Confusing those two things leads to overprovisioned infrastructure and idle compute.

Tiered Inference: Where the Architecture Gets Interesting

The cost structure is where this system earns its design credit. Haiku 4.5 handles conversation turns, sub-second responses, cheap per-token cost. Sonnet 4.6 is invoked only for tool use, when the agent needs to actually do something rather than continue a conversation.

The pricing difference makes this meaningful. Claude Haiku 4.5 runs at roughly $0.80 per million input tokens and $4 per million output tokens. Claude Sonnet 4.6 is an order of magnitude more expensive for comparable capabilities. Most conversational turns, the back-and-forth that makes up the bulk of any chat interaction, do not require Sonnet-level reasoning. Routing those turns to Haiku while reserving Sonnet for structured tool invocations compresses the effective per-session cost significantly.

This pattern has prior art. RouteLLM, the routing research from LMSYS published in 2024, showed 40-85% cost reductions by training a lightweight classifier to predict when a weaker model would produce acceptable output. The Nullclaw implementation appears to use a simpler heuristic: does this turn require a tool call? If not, Haiku handles it. If yes, Sonnet does.

The hard cap at $2 per day is a design constraint that forces the tiering to be real. Without a cap, tiered inference is an optimization with soft incentives. With a cap, the system has to stay within budget regardless of traffic patterns, which means the routing decisions actually matter operationally.

The A2A Passthrough Solves a Real Problem

Google’s Agent-to-Agent (A2A) protocol, announced in April 2025, standardizes how agents delegate work to other agents. It is built on HTTP with JSON-RPC 2.0 messages, where each agent publishes an Agent Card at /.well-known/agent.json describing its capabilities, and other agents invoke it by sending Task objects with a lifecycle of submitted → working → completed (or input-required for multi-turn exchanges).

The clever part of the Nullclaw design is the A2A passthrough. Ironclaw, the private agent, borrows nullclaw’s inference pipeline rather than maintaining its own Anthropic API key and billing relationship. When ironclaw needs to call a model, the request routes through nullclaw’s existing connection. One API key, one billing relationship, regardless of which agent originated the work.

This matters more than it might appear. Multi-agent systems with separate API keys create audit fragmentation: you cannot easily attribute inference costs to specific workflows, and key rotation becomes a coordination problem across machines. Centralizing the billing relationship at the gateway layer, while using A2A for agent-to-agent routing, separates the authentication concern from the delegation concern cleanly.

A2A also handles the case where ironclaw needs to pause and wait for nullclaw to gather more information from the user. The input-required state in the Task lifecycle is the protocol primitive for that; it avoids the alternative of building a custom handshake protocol over IRC.

What This Stack Argues For

The cumulative argument of this design is that agent infrastructure should be boring. Use a protocol that human operators already understand and can debug with standard tools. Keep the gateway binary small enough that the hosting cost is irrelevant. Put the architectural sophistication into inference routing, where it has a direct effect on cost and capability. Use network-layer isolation for trust boundaries rather than relying on application-layer checks.

The things that make AI agent systems expensive and fragile are usually not the inference itself. They are the frameworks and orchestration layers that mediate the inference, the fat runtimes on always-on servers, and the implicit assumption that because inference is complex, every supporting layer needs to be complex too.

A 678 KB binary on a $7 VPS handling real conversations with a $2/day inference budget does not disprove the value of more sophisticated orchestration for more sophisticated problems. But it does make a case for asking, per system, what the infrastructure layer actually needs to do, and building no more than that.

You can talk to nullclaw yourself at georgelarson.me/chat or via any IRC client at irc.georgelarson.me:6697 with TLS.