Local LLM Tooling Has Grown Past Ollama

Ollama got a lot of people running language models locally for the first time, and the ollama pull llama3 workflow had a Docker-for-models quality that made it genuinely frictionless for getting something running. The question worth asking now is whether that convenience abstraction still makes sense for developers who know what they’re doing with local models.

The article on Sleeping Robots that generated significant discussion on Hacker News frames it as the ecosystem not needing Ollama at all. That framing is somewhat strong, but the underlying technical argument is sound: Ollama wraps llama.cpp, and llama.cpp already ships a capable server binary with an OpenAI-compatible API. The wrapper adds a daemon, a proprietary model registry, a custom storage layout, and a Modelfile abstraction. For power users, each of those is a thing to work around rather than a thing to use.

The Stack Under the Hood

Ollama is written in Go. Its core function is to manage a running instance of llama.cpp’s inference engine, provide a REST API, handle model downloads from ollama.com, and store model files under ~/.ollama/models/ in a content-addressed blob store keyed by SHA256 hash. When you run ollama serve, it starts an HTTP server on port 11434 that exposes both Ollama’s native API and, since version 0.1.24, an OpenAI-compatible endpoint at /v1/chat/completions.

The inference itself is entirely handled by llama.cpp. Ollama vendors a specific version of it, compiled into the binary with support for whichever hardware backends (CUDA, Metal, Vulkan, ROCm) the Ollama team has built and tested. That vendoring is a real architectural decision with real consequences. When llama.cpp ships improved CUDA kernels, better flash attention support, or a new quantization format, Ollama users wait for the Ollama team to update their vendor pin and cut a release. You’re on their schedule, not upstream’s.

What llama-server Gives You Directly

llama.cpp ships its own server binary, called llama-server (previously just server in older builds). If you’ve been treating llama.cpp as only a CLI inference tool, the server is worth a closer look.

llama-server \
  --model ~/.local/models/mistral-7b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 35 \
  --ctx-size 4096 \
  --threads 8

That starts an HTTP server with an OpenAI-compatible API on port 8080. Any client that speaks OpenAI’s chat completions format works against it without modification:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="mistral-7b",
    messages=[{"role": "user", "content": "Explain KV cache in transformers."}],
    temperature=0.7,
)
print(response.choices[0].message.content)

The llama-server API exposes endpoints for chat completions, text completions, embeddings, tokenization, and model health at /health. It ships a built-in web UI at the root path for quick manual testing. You control every inference parameter directly: context length, batch size, GPU layer offload count, rope frequency scaling, mirostat sampling, and grammar-constrained generation via GBNF. None of these require translating through a Modelfile or depending on Ollama’s API surface having exposed the relevant flag.

Grammar-constrained output is a concrete example of where the gap shows. llama.cpp’s GBNF grammar support lets you constrain generation to valid JSON, specific enumeration values, or arbitrary context-free grammars defined in Backus-Naur form. Ollama’s exposure of this feature has historically been partial and version-dependent, meaning developers who need reliable structured output have often ended up working around the abstraction layer anyway.

The Modelfile Problem

Ollama’s Modelfile borrows Docker’s syntax to describe a model configuration:

FROM llama3
SYSTEM "You are a concise technical assistant."
PARAMETER temperature 0.5
PARAMETER num_ctx 8192

This is workable for personal configuration. The issue is that it creates a proprietary abstraction over what is fundamentally just a GGUF file plus runtime parameters. Those same parameters can be passed directly to llama-server at startup, written into a JSON config, or specified per-request via the API. The Modelfile doesn’t add expressiveness; it adds a format to learn, a vendor-specific deployment artifact, and a source of confusion when parameters conflict between Modelfile defaults and per-request overrides.

The model registry at ollama.com raises a related concern. It makes model discovery easy and the pull workflow ergonomic, but the quantization choices made for each registry entry are not always documented clearly. When you run ollama pull mistral, you get whatever quantization the Ollama team decided to package. When you download a GGUF directly from Hugging Face, the filename tells you exactly what you’re getting: mistral-7b-instruct-v0.2.Q4_K_M.gguf. The difference between Q4_K_M, Q5_K_S, and Q8_0 matters in practice; it affects both memory footprint and output quality in ways that depend on your hardware and task. K-quants in particular use larger quantization groups for attention and feed-forward weights, which tends to preserve quality better than older uniform quantization at the same bit width.

The Broader Ecosystem

The local LLM tooling space has matured considerably beyond any single tool.

llamafile, a Mozilla project, takes an architecturally different approach. It bundles llama.cpp and optionally a model into a single executable that runs on Linux, macOS, Windows, FreeBSD, and OpenBSD without installation. The binary is a polyglot file that is simultaneously a shell script, a Windows PE executable, and a ZIP archive containing weights. For distributing a local model as a self-contained artifact or embedding inference capability into a larger application, the format is technically interesting and removes the server management problem entirely.

LM Studio and Jan are GUI frontends built on llama.cpp that handle model management, download, and configuration through a graphical interface. They target the same non-technical user Ollama targets but tend to be more transparent about quantization choices and more configurable around inference parameters. Jan in particular has been building out an OpenAI-compatible local server that makes it usable as a development backend without touching the GUI at all.

koboldcpp is popular in the creative writing community. It wraps llama.cpp with additional sampling modes including mirostat v1/v2, typical sampling, and tail-free sampling, along with a web UI designed for story generation with context management features that vanilla llama.cpp doesn’t expose natively.

vLLM is worth mentioning in a different category. It’s not a llama.cpp wrapper; it’s a separate high-performance inference engine focused on throughput for CUDA hardware. Its PagedAttention implementation manages the KV cache as a paged memory system, substantially improving throughput under concurrent load. For serving a model to multiple users on a capable GPU machine, vLLM is the more appropriate tool. It’s less suited for CPU or Apple Silicon inference but relevant for any server-side multi-user scenario.

When Ollama Still Makes Sense

Dismissing Ollama entirely overstates the case against it. The ollama CLI is well-designed, the pull workflow is frictionless, and the systemd service integration, automatic hardware detection, and cross-platform packaging all represent real engineering effort. Recent Ollama releases have improved the OpenAI compatibility surface, and for straightforward single-user chat the abstraction cost is low.

For someone setting up local inference for the first time, especially on Windows where building llama.cpp from source is not always smooth, Ollama remains a reasonable starting point. The argument against it sharpens as usage becomes more sophisticated: managing multiple models for different tasks, tuning inference parameters tightly, staying current with llama.cpp improvements, or building production tooling around local inference where the abstraction layer introduces unpredictable behavior.

The Practical Migration

For developers already working with Ollama who want more direct control, the migration is low friction. The client code changes by one line: replace http://localhost:11434/v1 with http://localhost:8080/v1. Model management shifts from ollama pull to downloading GGUF files directly from Hugging Face repositories, which also gives you direct control over quantization choices.

The local LLM tooling space was genuinely rough in 2023, and Ollama filled a real gap at the right time. The gap has since narrowed as llama-server has matured, the GGUF ecosystem has consolidated around a stable format, and the broader ecosystem has grown to cover the use cases Ollama originally dominated. That’s less a critique of Ollama than an observation about normal infrastructure maturity. The convenience layer that made local models accessible is now a layer developers can reasonably skip, and in many cases should.