· 6 min read ·

What a 678 KB Zig Binary Teaches About AI Agent Architecture

Source: hackernews

George Larson’s nullclaw project showed up on Hacker News this week and picked up over 300 upvotes. The headline pitch is an AI agent running on a $7/month VPS with IRC as its transport layer. That framing undersells the interesting parts.

The binary is 678 KB. It uses approximately 1 MB of RAM at runtime. It is written in Zig. Those three facts together are not an accident; they are a design philosophy that cascades into every other decision in the system.

The Transport Is Not the Point

I build Discord bots for a living, or close enough to one. Discord’s gateway API is genuinely rich: you get presence events, interaction callbacks, slash command routing, file attachments, embeds, role management, audit logs, and a WebSocket protocol that reconnects and resumes sessions automatically. It handles a lot.

It also hands you a programming model. You are reactive to gateway events. Your bot exists inside Discord’s lifecycle. When Discord changes rate limits, or deprecates an intent, or reworks interaction handling, you update or you break. That is a reasonable trade for most use cases.

IRC gives you almost none of this. A message on IRC is a line of text, 512 bytes by default (extendable to 8192 with the LINELEN IRCv3 extension). Channels are pub/sub primitives. Nicks are identifiers. The server relays; it does not orchestrate. You write the orchestration yourself.

For a single-purpose agent that needs to accept visitor input and route it to an LLM, that is enough. More than enough. The protocol imposes nothing on your agent’s internal logic. There is no platform to integrate with, no permission model to reason about, no API changes to track. The transport layer is inert.

Larson is running Ergo, a modern IRCv3-compliant server written in Go. Ergo ships as a single binary, supports WebSocket connections natively, has a built-in bouncer mode, and handles TLS termination. The web client, gamja, is a minimal JavaScript SPA by Simon Ser that connects to Ergo over WebSocket with no build step required. The user experience is a chat window on a webpage. The underlying protocol is line-delimited text over a WebSocket tunnel to an IRC server. The stack has almost no surface area.

Zig at This Scale

Zig produces small binaries because it has no runtime, no garbage collector, and no hidden allocations. In ReleaseSmall mode, a non-trivial Zig program regularly comes in under a megabyte. A Go equivalent of the same IRC client would be roughly 5-8 MB after stripping. A Rust binary with full standard library pulls 300-600 KB minimum and grows from there. 678 KB for a functioning IRC bot with agent plumbing built in is at the low end of what you would expect from Zig, which means Larson has kept the feature set tight.

At 1 MB of RAM, this process fits comfortably in the L3 cache of any server CPU made in the last decade. On a $7/month VPS, you are sharing a host with dozens of other tenants. Staying small means staying out of the way of the kernel’s memory pressure decisions. It also means you could run this on hardware that has no business running modern software: an old Raspberry Pi, a repurposed router, anything with 64 MB to spare.

There is a practical argument here beyond elegance. Small, fast-starting binaries are easier to supervise, easier to restart on failure, and easier to reason about under load. The agent’s footprint does not grow with traffic because there is no runtime accumulating heap state across requests.

Tiered Inference Is the Right Default

The inference setup is worth copying directly. Haiku 4.5 handles conversational turns, which are cheap and sub-second. Sonnet 4.6 only activates when tool use is required. There is a hard spend cap of $2 per day.

Haiku 4.5 costs roughly $0.80 per million input tokens and $4 per million output tokens. Sonnet 4.6 is around $3 per million input and $15 per million output. For a public-facing agent that might see high conversational volume with occasional agentic tasks, running everything on Sonnet would cost an order of magnitude more per interaction for the same user experience on most messages.

The pattern is: use the cheapest model that can complete the current subtask, escalate only when necessary. This is not new as an idea, but most bot and agent implementations skip it because the routing logic feels like extra work. Larson’s setup makes it concrete: model selection is a per-turn decision, not a per-deployment one.

The $2/day ceiling is a hard budget limit at the API level, not a soft guideline. At current pricing, that is roughly 2.5 million Haiku input tokens per day, or about 130,000 Sonnet input tokens. For a personal agent, that ceiling is generous. For a public-facing one, it sets a real constraint that forces you to think about what interactions are worth servicing with expensive inference.

A2A as an API Gateway for Agents

The most technically interesting piece is the private agent, ironclaw, and how it relates to the public one.

Ironclaw handles email and scheduling. It is reachable only over Tailscale, a WireGuard-based mesh VPN that assigns stable 100.x.x.x addresses to each node and works through NAT without port forwarding. Ironclaw is not on the public internet. It communicates with nullclaw via Google’s A2A (Agent-to-Agent) protocol, which was released as an open specification in April 2025.

A2A uses JSON-RPC 2.0 over HTTP with server-sent events for streaming. Each agent exposes an Agent Card, a JSON document at /.well-known/agent.json, that describes its capabilities, authentication requirements, and supported message formats. Agents exchange Task objects containing typed parts: text, files, structured data. It is designed to be the agent-to-agent complement to Model Context Protocol, which handles agent-to-tool communication.

The passthrough arrangement Larson describes is the clever part: ironclaw borrows nullclaw’s inference pipeline. This means there is one Anthropic API key, one billing account, and one outbound inference path regardless of which agent initiated the work. From the API’s perspective, all requests come from nullclaw. Ironclaw delegates its LLM calls across the A2A boundary to its partner, which proxies them upstream.

This pattern solves a real problem in multi-agent systems: API key proliferation and billing fragmentation. If you have five agents, each with its own key, you have five points of credential management, five billing relationships, and no shared spend visibility. If one agent’s key leaks, you rotate one key and audit one usage log. With a gateway arrangement, you get centralized credential management, unified spend caps, and a single point of observability for all inference traffic.

Tailscale handles the trust boundary cleanly. Ironclaw is not accessible from the public internet. The only way to reach it is through the tailnet, which means the A2A communication channel between the two agents is authenticated and encrypted at the network layer before the application protocol even runs.

What This Adds Up To

The system as a whole is: a 678 KB binary on a $7 VPS, fronted by an IRC server and a static web client, with a private companion agent reachable over a mesh VPN, two models selected per-turn, a $2/day ceiling, and one API key for everything.

The design choices compound. A small Zig binary fits in cheap infrastructure. IRC’s minimal protocol surface means less framework code in the binary. Ergo and gamja are both single-artifact deployments. Tailscale eliminates the private networking problem without standing up a VPN server. A2A gives the two agents a typed communication contract without requiring a bespoke RPC layer. Tiered inference keeps the per-interaction cost low enough that a daily cap of $2 is not a constraint in practice.

None of these individual choices is novel. The composition is what is interesting. It demonstrates that a functioning multi-agent system with a public interface, private capabilities, unified billing, and sub-second response times has a minimum viable footprint that is much smaller than most people assume.

Building on Discord gives you a lot. You get authentication, push delivery, rich UI primitives, mobile clients, and a user base. What you give up is ownership of the transport and the platform lifecycle. For personal projects and internal tools, that trade is often not worth making. Larson’s setup is a reminder of how little infrastructure you actually need when the transport layer is just a conduit.

Was this interesting?