A 678 KB Agent on a $7 VPS: What IRC Gets Right About AI Infrastructure

The instinct when building AI agent infrastructure is to reach for the same stack you would use for any other service: a WebSocket server, a message queue, maybe a Redis instance for state. George Larson’s nullclaw/ironclaw project makes a different bet. The public-facing agent runs as a 678 KB Zig binary consuming about 1 MB of RAM, connected to an IRC server, on a $7/month VPS. The total inference budget is capped at $2/day.

This is not a prototype. It handles real conversations, uses tools, manages email and scheduling, and communicates between agents using Google’s A2A protocol over a private Tailscale network. The interesting part is not that it works; it is what the design reveals about where AI agent complexity actually lives.

IRC as a Message Bus

IRC is a protocol from 1988. It predates the web. It has survived the rise and fall of AIM, MSN Messenger, and a dozen “IRC killers.” The reason is not nostalgia. IRC is a simple, text-based protocol over TCP, and text-based TCP protocols age very well.

For an AI agent, IRC provides exactly what you need: a message bus with named channels, presence detection via JOIN and PART events, and the ability for multiple clients to share a channel. The IRCv3 specification adds modern capabilities including message tags for structured metadata, server-time for message timestamping, and chathistory for replay. Larson uses Ergo (formerly Oragono), a Go-based IRC server that implements IRCv3 natively, includes built-in TLS via Let’s Encrypt, and supports WebSocket connections from the browser.

The web-accessible chat at georgelarson.me uses Gamja, a lightweight browser IRC client that can be embedded in a page and connects to Ergo via WebSocket. From the user’s perspective, it looks like any other chat widget. From the infrastructure perspective, you get a standard IRC channel that any IRC client in the world can join, at irc.georgelarson.me:6697 over TLS.

This means the transport layer is fully client-agnostic. Someone using WeeChat, irssi, or a modern client like Textual gets the same channel. The AI agent does not need to care which client is connected. It reads messages, responds, and the IRC server handles delivery. There is no custom WebSocket protocol to maintain, no session management, no presence heartbeats to implement.

There is also something to be said for the flood-control and connection-management infrastructure that good IRC servers have refined over decades. Ergo ships with rate limiting, connection throttling, and account-based authentication. You get that for free by choosing the protocol.

Why Zig for the Agent Binary

A Go program doing the same work would likely land around 10 to 15 MB as a static binary, because Go embeds its runtime, garbage collector, and goroutine scheduler in every executable. A Rust binary would be smaller, but still carries standard library and unwinding infrastructure. A C binary would be tiny but requires careful manual memory management at every turn. Zig sits in a different position.

Zig has no hidden allocations. Every allocation is explicit, and you pass an allocator to any function that needs one. The standard library is available but optional; you can write a Zig program that makes no standard library calls at all. The comptime system handles generics and zero-cost abstractions at compile time, leaving no template instantiation overhead in the final binary. Building with ReleaseSmall strips debug information and applies size-focused optimization.

The result: 678 KB. The agent process uses approximately 1 MB of RAM at runtime. On a $7/month VPS, that is negligible. The inference API calls are the entire cost center; the process itself is essentially free. This is the correct frame: for an LLM-backed agent, the binary is scaffolding around network I/O. Making the scaffolding small is not premature optimization; it means the VPS can run other things alongside the agent without contention.

Tiered Inference and the $2/day Cap

The billing architecture is one of the more interesting decisions in this project. Two Claude models are in play: Claude Haiku 4.5 handles conversational turns, and Claude Sonnet 4.6 activates only for tool use. This is a meaningful distinction.

Haiku 4.5 is fast and cheap, with sub-second response times for typical conversational inputs. It handles routing, clarification, and casual dialogue where the output does not need to drive a downstream system. Sonnet 4.6 is more capable and more expensive, but the project only invokes it when the conversation requires tool calls, meaning structured JSON output, API interactions, or scheduling operations.

This tiered pattern appears in production inference systems with some frequency, but it is often implemented as a routing classifier: a small model decides whether the input warrants a large model. Here, the trigger is structural rather than semantic. If a tool call is required, escalate. That keeps the logic simple and the escalation rate predictable, because most conversations do not require tools.

The hard cap at $2/day is enforced at the application level. At Haiku 4.5 rates, $2 covers several million input tokens; at Sonnet 4.6 rates, the budget goes faster. The cap prevents runaway costs from looping agents or unexpected traffic spikes, both of which are real operational risks when your agent is publicly accessible via a web widget. Putting the cap in the application rather than relying on API-level limits means the agent can handle the breach gracefully rather than returning hard API errors to users.

A2A Protocol and the Private Agent

The second agent, ironclaw, runs on a separate machine reachable only over Tailscale. It handles email and scheduling, which means it holds credentials and integrations that should not be exposed on a public-facing box. The communication between the two agents uses Google’s Agent2Agent (A2A) protocol, released in 2025.

A2A is an open, HTTP-based protocol for agent interoperability. An agent publishes a JSON “Agent Card” at /.well-known/agent.json describing its capabilities, supported modalities, and authentication requirements. Other agents send task requests and receive task responses over standard HTTP. The design is intentionally simple: it complements Anthropic’s Model Context Protocol (MCP), where MCP governs agent-to-tool communication and A2A governs agent-to-agent communication.

The specific trick here is a billing passthrough: ironclaw sends A2A task requests to nullclaw, and nullclaw fulfills them using its own Claude API connection. One API key, one billing relationship, regardless of which agent initiated the inference. The public agent acts as a credentialed proxy for the private one. This keeps API credentials consolidated on one box and the billing surface easy to audit.

The Tailscale layer means ironclaw’s A2A endpoint is never reachable from the public internet. Only machines in Larson’s Tailscale network can connect. WireGuard handles encryption; Tailscale handles key distribution and NAT traversal automatically. Standing up the private network takes minutes. The alternative, configuring firewall rules and a traditional VPN to protect a private agent endpoint, would be substantially more work and more brittle.

What This Architecture Actually Argues

Conventional thinking on AI agent infrastructure tends toward complexity: vector databases for memory, message queues for reliability, orchestration frameworks for multi-agent coordination. Some of those components are genuinely necessary at scale. At the scale of one person’s personal agent, they are overhead.

The nullclaw/ironclaw design makes the argument, through implementation rather than assertion, that the substantive cost of running an AI agent is almost entirely the inference budget. The compute and network infrastructure for a personal agent is so cheap as to be operationally irrelevant. The Zig binary’s 1 MB footprint is not an optimization targeting the VPS; it is a consequence of building a system where the process does as little as possible and delegates everything interesting to the API.

IRC works here because an AI agent is fundamentally a message-passing system. It receives text, produces text, optionally calls a tool. IRC has handled that workload since 1988. Using a modern server like Ergo means you get TLS, IRCv3, WebSocket access, and account persistence without writing any of that infrastructure yourself.

The A2A passthrough for billing is the most architecturally interesting piece. It establishes a clear trust boundary between public and private agents while keeping the credential footprint minimal. As multi-agent systems grow more common, that kind of composable billing and trust model will matter beyond personal projects.

The project is live at the georgelarson.me chat page and accepts direct IRC connections at irc.georgelarson.me:6697. The HN thread has substantive discussion on agent infrastructure trade-offs worth reading alongside the original post.