· 6 min read ·

What 'Inference for Agents' Actually Requires at the Infrastructure Level

Source: hackernews

The distinction between running a model and building infrastructure for agents is not semantic. A completion API that returns tokens is a solved problem. An inference layer that holds state between tool calls, routes across models, retries gracefully, and does all of this at global scale without a round trip to a centralized datacenter is a different engineering problem.

Cloudflare’s AI Platform announcement is an attempt to address that distinction directly. They’re pulling together Workers AI, AI Gateway, and Vectorize into something more coherent, and framing the whole stack around agent workflows rather than simple inference. The Hacker News thread has the usual mix of skepticism and genuine interest, but the underlying architecture is worth examining on its own terms.

The Request/Response Model Doesn’t Fit Agents

Traditional LLM APIs are stateless. You send a prompt, you get tokens back. This works fine for chat completions, text summarization, and one-shot generation. The session state lives wherever your application code lives; the model is just a function.

Agents break this model in several ways. An agent orchestration loop makes multiple inference calls in sequence. Between calls, it may read from a database, call an external tool, write intermediate results to state, and branch based on model output. If each of those inference calls takes 50-200ms of network latency to reach a centralized model server, your agent loop accumulates that latency with every step.

For a five-step agent task, the difference between 10ms and 150ms per inference call is 700ms of overhead that has nothing to do with model quality or token throughput.

Cloudflare’s pitch is that running inference on their edge network, colocated with the Workers that run your agent logic, collapses that gap. Your orchestration code and your inference endpoint are in the same region, sometimes on the same machine.

What the Platform Actually Provides

Workers AI is Cloudflare’s inference runtime, distributed across their global network of data centers. The model catalog has grown substantially: Llama 3, Mistral, Gemma, Phi, and several embedding and image generation models are available. The API surface follows the OpenAI completion format, which means migration from other inference providers requires minimal code changes. Workers AI supports streaming responses natively, which matters for agents where you want to begin processing tool call outputs before the full response is generated.

AI Gateway is the more immediately interesting piece for production agent workloads. It acts as a proxy and control plane in front of any inference provider: Workers AI, OpenAI, Anthropic, Together AI, and others. Through the gateway you get request logging, cost tracking, rate limiting, semantic caching, and fallback routing. For an agent that might make dozens of inference calls per task, having all of that instrumented at the infrastructure level rather than scattered through application code is a meaningful operational improvement.

Semantic caching deserves specific attention. If your agent repeatedly makes similar requests, whether summarizing documents, extracting structured data, or routing queries, the gateway can serve cached responses for semantically equivalent inputs. Depending on your workload, this can reduce inference costs substantially without changes to agent logic.

Vectorize is Cloudflare’s vector database, integrated directly with Workers AI’s embedding models. The standard RAG pattern (embed a query, retrieve relevant chunks, inject into context) runs within the same runtime environment as your agent code, without a separate API call to an external vector service.

Durable Objects as the Missing State Primitive

The most underrated part of Cloudflare’s agent story isn’t in the AI Platform announcement specifically. Durable Objects are Cloudflare’s stateful compute primitive: a single-instance, globally-consistent object that holds state in memory and storage, and handles WebSocket connections directly.

For agent workloads, Durable Objects solve the state problem in a clean way. A running agent needs somewhere to hold its conversation history, tool call results, pending tasks, and execution context. The typical approach is a database round trip on every step, or in-memory state that dies when the serverless function completes.

A Durable Object lets you instantiate a stateful agent that persists across multiple requests, holds its context in memory during active execution, and stores state durably without a separate database call. Cloudflare has started explicitly framing Durable Objects as an agent primitive, and their Agents SDK builds directly on this model.

The WebSocket support matters here too. Long-running agent tasks that stream intermediate results back to a frontend need a persistent connection. Durable Objects handle that connection natively, without a separate WebSocket service or coordination layer.

A basic stateful agent in this model looks roughly like this:

export class AgentState extends DurableObject {
  private history: Message[] = [];
  private context: Record<string, unknown> = {};

  async run(input: string): Promise<string> {
    this.history.push({ role: 'user', content: input });

    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: this.history,
      stream: false,
    });

    this.history.push({ role: 'assistant', content: response.response });
    return response.response;
  }
}

The object persists between requests, so this.history survives across the full agent session without a database read on every turn.

Where the Architecture Has Real Constraints

Edge inference means edge hardware. You are not running Llama 3 70B in full precision on a server with 80GB of VRAM at a Cloudflare PoP. The models available through Workers AI are quantized and sized to run on GPU hardware that scales across hundreds of locations globally. The top of the capability curve is lower than what you get from a centralized provider running H100s.

For many agent tasks this is fine. A tool-use loop that extracts structured data, routes queries, or summarizes documents doesn’t require a frontier model. Llama 3.1 8B or Mistral 7B, well-prompted, handles a wide class of agent subtasks competently and at substantially lower cost than a frontier model API call.

But if your agent needs the reasoning depth of Claude 3 Opus or GPT-4o for complex multi-step planning, you’re routing through AI Gateway to an external provider anyway. The edge inference story applies to the “fast, cheap inference for routine subtasks” tier of a multi-model agent, not to the planning tier.

This is actually a reasonable architectural pattern. Build a pipeline where cheap, low-latency edge inference handles tool dispatch, data extraction, and routing decisions, while complex reasoning calls go to a frontier model through the gateway. Cloudflare’s platform supports this pattern well, but the marketing tends to obscure the distinction between what runs on their infrastructure and what still relies on external providers.

Comparing the Field

Modal takes a different approach: arbitrary Python environments on serverless GPU, with a programming model that lets you define inference as a regular function call within your own code. You bring your own models and get full control over hardware configuration. The tradeoff is that Modal doesn’t have an edge network, so globally distributed low-latency inference isn’t in scope.

Together AI and Fireworks AI offer deep model catalogs with strong throughput and competitive pricing on open models, but they’re pure inference APIs without the surrounding compute and state primitives. Integrating them into an agent workflow requires building your own orchestration layer.

AWS Bedrock Agents is the enterprise-grade alternative, with access to frontier models, built-in tool use orchestration, and deep integration with AWS services. The complexity is correspondingly higher, and the developer experience is optimized for teams that already operate inside the AWS ecosystem.

Cloudflare’s differentiation is the integration between compute, inference, storage, and networking within a single global deployment target. The primitives are simpler, the operational surface is smaller, and the deployment model (push code, it runs globally) is already familiar to frontend and full-stack developers who want agent capabilities without managing infrastructure.

A Coherent Bet

The framing of “inference layer designed for agents” is partly a positioning statement, but the underlying infrastructure combination is genuinely coherent. Edge inference, a well-designed AI gateway with semantic caching and multi-provider routing, Durable Objects for stateful agent execution, and a colocated vector database fit together as a stack in a way that most competing offerings don’t.

Cloudflare has consistently followed a pattern of taking infrastructure concerns that required dedicated expertise and turning them into configurations. CDN, DDoS mitigation, DNS, zero-trust networking: each of these required specialized vendors or deep operational knowledge before Cloudflare made them accessible at the network layer. Applying that same instinct to AI inference, specifically for the agent orchestration pattern that is becoming the dominant production use case, is a natural extension of what they do.

The constraint on model capability at the edge is real, but it also enforces architectural discipline. An agent that routes every subtask to a frontier model will be slow and expensive regardless of where the inference runs. A layered system where edge inference handles the fast path and frontier models handle the hard cases is better engineering regardless of Cloudflare’s involvement. Their platform just makes that pattern easier to implement and operate.

Was this interesting?