· 6 min read ·

The Local LLM Stack Has Matured Past Needing a Wrapper

Source: hackernews

When this piece on the sleepingrobots blog landed on Hacker News with 600+ points, the reaction split pretty cleanly: people who had already migrated away from Ollama nodding along, and people for whom ollama run llama3 is still the entire workflow feeling defensive. Both reactions make sense. The argument is not that Ollama is broken. It’s that Ollama solved a problem the rest of the ecosystem has since solved itself, and the abstraction it leaves behind now costs more than it saves.

What Ollama Was Actually Doing

In early 2023, getting a quantized model running locally meant cloning llama.cpp, figuring out which build flags matched your hardware, converting weights from PyTorch format to GGML (later GGUF), and invoking the binary with a wall of flags. Ollama packaged all of that behind ollama pull and ollama run. The model registry at ollama.com meant you didn’t have to touch Hugging Face. The Go daemon handled model loading and unloading. The Modelfile format gave you a Docker-inspired way to pin system prompts and parameters alongside a model.

For someone who just wanted GPT-4-like responses without a cloud dependency, this was genuinely the right abstraction. The CLI was clean. The API was consistent. The model naming made sense to humans.

The problem is that llama.cpp kept moving, the rest of the tooling ecosystem caught up, and Ollama’s convenience layer started freezing users out of capabilities that matter.

The OpenAI-Compatible API as the Great Equalizer

The pivot that changed the calculus was the standardization of the OpenAI REST API shape as the universal interface for local inference. Today, every serious local inference tool exposes POST /v1/chat/completions, GET /v1/models, POST /v1/embeddings, and streaming via server-sent events. Ollama added this in version 0.1.24. llama.cpp’s server binary has supported it for longer. LM Studio serves it on port 1234. Jan.ai serves it on port 1337. vLLM serves it on port 8000.

Once your application code calls http://localhost:PORT/v1/chat/completions with a standard JSON body, the tool running behind that port is interchangeable. The model name in the model field is the only thing that needs to change. The request looks the same whether you’re talking to Ollama, to a direct llama-server process, or to vLLM:

{
  "model": "llama3",
  "messages": [{"role": "user", "content": "Explain RoPE scaling"}],
  "temperature": 0.3,
  "max_tokens": 1024,
  "stream": true
}

When the API is a commodity, what differentiates the tools is everything that sits around it: what parameters you can control, what the model loading behavior looks like, how multi-user throughput scales, and what you can observe at runtime.

What llama.cpp Server Exposes That Ollama Doesn’t

Running llama-server directly gives you a surface area that Ollama’s abstraction layer doesn’t reach. Some examples that matter in practice:

KV cache quantization. The --cache-type-k q8_0 and --cache-type-v q8_0 flags let you quantize the key-value cache to 8-bit integers, which cuts memory usage significantly at the cost of a small quality reduction. On a 24GB GPU running a 13B model with a long context, this is the difference between fitting and not fitting. Ollama’s Modelfile PARAMETER instruction doesn’t expose this.

Flash Attention. --flash-attn enables Flash Attention, which reduces memory bandwidth pressure for long sequences. Again, not accessible via Ollama’s Modelfile.

Continuous batching. The --cont-batching flag and --parallel N together allow multiple simultaneous requests to share GPU compute cycles. For any multi-user scenario, this is the difference between a 1-user model and something that can handle a small team. Ollama manages one model instance per daemon and doesn’t expose the batching parameters. If you’re building an application that needs to serve several concurrent users from a local GPU, this gap is significant.

Prometheus metrics. llama-server exposes GET /metrics with inference timing, token counts, slot utilization, and queue depth. If you’re running local inference as a service and want to alert on it, this endpoint is what you instrument. Ollama has no equivalent.

Fill-in-the-middle. POST /infill supports FIM-style code completion for models that were trained with prefix-suffix-middle structure (DeepSeek Coder, CodeLlama). Ollama doesn’t expose this endpoint.

Native context size. This one has bitten a lot of people. Ollama historically defaulted num_ctx to 2048 tokens regardless of what the underlying model supported. Running Llama 3.1, which has a native context window of 128,000 tokens, through Ollama without explicitly setting PARAMETER num_ctx 131072 in your Modelfile means you were getting 2048-token context. The model would silently truncate long conversations. Direct llama-server with --ctx-size defaults to 4096 and makes the limit explicit. The 2048 default was a real source of degraded responses for users who didn’t know to look for it.

The Model Storage Problem

Ollama stores models in ~/.ollama/models/blobs/ as content-addressed files named by their SHA256 digest. The directory looks like this:

~/.ollama/models/blobs/
  sha256-8a7c9f3b2a1d...
  sha256-c4e2a1f8d3b9...
  sha256-7d1e6a4b9c3f...

You cannot tell what model a given blob corresponds to without consulting the manifest files in ~/.ollama/models/manifests/. If you have GGUF files downloaded from Hugging Face sitting in a ~/models/ directory, Ollama won’t use them directly. You import them via ollama create, which creates a new copy under the content-addressed structure. You now have duplicate storage.

Direct llama-server loads whatever GGUF file you point it at: llama-server -m ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf. The filename is the name. There’s nothing to manage.

llamafile Takes the Opposite Approach

Mozilla’s llamafile project, created by Justine Tunney, is worth understanding because it solves the same original Ollama problem, friction, through a completely different mechanism. A llamafile is a single executable file that bundles llama.cpp plus model weights using the APE (Actually Portable Executable) format from Cosmopolitan Libc. The same binary runs on Linux x86-64, macOS ARM64, macOS x86-64, Windows x86-64, and FreeBSD without installation.

chmod +x llama-3.1-8b.llamafile
./llama-3.1-8b.llamafile  # starts an OpenAI-compatible server

No daemon. No registry. No abstraction over llama.cpp’s parameters, because llamafile passes flags directly through. The distribution story is unusual (a 4GB+ single file is awkward), but for the use case of handing someone a model that just works on their machine, it’s hard to beat.

When Ollama Is Still the Right Choice

None of this means Ollama should be avoided universally. The model registry at ollama.com is genuinely convenient for discovery and version pinning. The CLI for quick experimentation is hard to beat: ollama run llama3 starts an interactive session from a cold download in one command. The Modelfile system is useful for teams that want to version-control a system prompt alongside a model reference without writing server startup scripts.

For someone exploring local models for the first time, Ollama is still probably the right starting point. The daemon model means you don’t have to think about process management. The auto-download behavior removes the Hugging Face navigation step.

The friction shows up when you need more: more context, better throughput, visibility into what the inference engine is actually doing, or access to model capabilities that Ollama’s Modelfile doesn’t expose. At that point, llama-server with its full flag set is a process start, not a migration.

The Version Coupling Issue

One last friction point that compounds over time: Ollama bundles a specific version of llama.cpp internally. When a new model architecture ships, llama.cpp typically adds GGUF support within days. Ollama users wait for the Ollama team to cut a new release with the updated llama.cpp. For mainstream models this lag rarely matters, but for anyone following the edge of the open model space, it’s a recurring annoyance.

The broader point from the original article is correct: the local LLM ecosystem converged on a standard API, standard model formats, and standard hardware backends. Ollama was a bridge to that world that made sense when the world was rougher. The bridge is still there, but the road on the other side is paved now, and driving directly is faster.

Was this interesting?