Ollama Solved a 2023 Problem. The Ecosystem Has Moved On.

There is a post making the rounds on Hacker News from sleepingrobots.com arguing that the local LLM ecosystem no longer needs Ollama. The discussion thread hit 615 points with 200 comments, which tells you this resonates with people who have been running models locally for a while. The argument is correct, but the more interesting question is not whether to stop using Ollama. It is understanding exactly what work Ollama is doing on your behalf, what it gets wrong, and when that matters.

What Ollama Actually Is

Ollama is a Go daemon that wraps llama.cpp. When you run ollama serve, you start a background process that maintains a model registry, handles downloading models from the Ollama library, keeps one model loaded in memory at a time, and exposes an HTTP API. Requests from ollama run or any OpenAI-compatible client travel: client → HTTP → Ollama Go daemon → CGo → llama.cpp subprocess. Every inference request crosses that boundary.

In late 2023 and early 2024, this was a reasonable trade. llama.cpp’s server example was still maturing. There was no built-in web UI, model downloading required manual GGUF hunting, and running models with correct chat templates required reading documentation carefully. Ollama removed all of that friction with a clean ollama pull llama3 workflow. For someone who just wanted to run a model locally without archaeology, it was the right tool.

The underlying library has not stood still.

What llama.cpp Ships Today

llama-server is the built-in HTTP server that ships with every llama.cpp build. It provides a full OpenAI-compatible API that covers chat completions, text completions, and embeddings, along with a set of native endpoints that expose capabilities the OpenAI spec does not have a slot for.

Starting the server directly:

llama-server \
  -m ./models/meta-llama-3.1-8b-q4_k_m.gguf \
  -ngl 99 \
  -c 32768 \
  -np 4 \
  --host 0.0.0.0 \
  --port 8080 \
  --metrics \
  -fa

The flags here do things Ollama either does not expose or defaults poorly: -ngl 99 offloads all layers to GPU (Ollama auto-selects this, sometimes conservatively), -c 32768 sets the actual context window (Ollama defaults many models to 2048 regardless of what the model supports), -np 4 creates four parallel KV cache slots for concurrent requests, --metrics enables a Prometheus endpoint at /metrics, and -fa enables Flash Attention, which reduces VRAM usage and improves throughput at long contexts.

The native /completion endpoint accepts parameters that have no equivalent in Ollama’s API surface:

{
  "prompt": "...",
  "cache_prompt": true,
  "slot_id": 0,
  "grammar": "root ::= (\"yes\" | \"no\")",
  "json_schema": { "type": "object", "properties": { "answer": { "type": "string" } } },
  "mirostat": 2,
  "min_p": 0.05,
  "n_probs": 10,
  "logit_bias": [[1234, -10.0]]
}

cache_prompt: true is particularly significant. llama.cpp will reuse the KV cache for any request that shares an identical prefix with a previously computed sequence. If your system prompt is 500 tokens and you are running a chatbot, every turn after the first saves 500 tokens of prefill computation. Ollama does not expose this. The web UI served at / is functional for ad-hoc conversation without any extra software.

The Chat Template Problem

This is the most dangerous footgun in Ollama, and it does not produce obvious errors. Every instruct-tuned model ships with a chat template, usually a Jinja2 template embedded in the model’s tokenizer_config.json. This template defines how the assistant, user, and system turns get formatted into the raw text the model actually sees. Get it wrong and the model behaves degraded in ways that look like a capability issue rather than a formatting issue.

llama.cpp reads the chat template directly from the GGUF file’s metadata, where it is embedded during quantization from the original HuggingFace weights. When you use POST /v1/chat/completions, llama.cpp applies the correct template natively.

Ollama uses its own Modelfile TEMPLATE field, which is Go template syntax, not Jinja2. For models in the Ollama library, someone has translated the original Jinja2 template into Go templates. Translations are imperfect. Edge cases around system prompt handling, tool call formatting, and special token placement have produced incorrect behavior for various models over the years. The Llama 3 series, Mistral’s function-calling variants, and models with non-standard BOS/EOS handling have all had documented Modelfile template bugs.

When you pull a model with ollama pull, you are trusting that translation. When you run the same GGUF through llama-server, you are running the original template. For serious use, that distinction matters.

The Disk and Memory Overhead

Ollama stores models in a content-addressed blob registry under ~/.ollama/models/blobs/. When you ollama pull a model, it downloads and stores the GGUF in this registry regardless of whether you already have that file elsewhere on disk. If you maintain a model library for use with multiple tools, Ollama creates a second copy. For a 7B Q4_K_M model at roughly 4.5GB, this is a real cost on laptops with limited NVMe space.

You can work around it with ollama create from a Modelfile that references your existing file, but that adds friction to a workflow Ollama is supposed to simplify. llama.cpp server accepts a direct path to any .gguf file with -m. No registry, no import step.

The update lag is also real. llama.cpp merges changes at a pace that reflects being the core engine for a global research and deployment community. New quantization types like IQ2_XXS and IQ1_S, GPU backend improvements, Flash Attention, and model architecture support for new releases all land in llama.cpp first. Ollama bundles a specific llama.cpp version and updates on its own schedule. At any given moment, Ollama users are running a llama.cpp that is weeks to months behind HEAD.

Performance in Practice

The performance gap between Ollama and direct llama-server is not dramatic for single-user casual use. Community benchmarks on threads like the HN discussion above consistently show 5 to 15 percent lower tokens-per-second through Ollama versus direct llama.cpp with equivalent settings on the same hardware. The causes are the extra IPC hop, CGo’s interaction with Go’s garbage collector in latency-sensitive paths, and Ollama’s conservative defaults.

The gap grows under concurrent load. llama.cpp’s -np flag creates multiple KV cache slots that allow genuinely parallel inference requests to share a single model load. Ollama’s concurrency handling is more limited. If you are running something like a local API for multiple users or integrating local inference into a multi-agent setup, this becomes significant.

For a genuinely high-throughput use case on NVIDIA hardware, neither Ollama nor llama.cpp server is the right answer. vLLM implements PagedAttention and continuous batching, delivering throughput that llama.cpp cannot match under concurrent load, at the cost of requiring CUDA and not supporting GGUF quantization formats.

When Ollama Is Still Fine

If you are pulling a model for a one-off experiment, running it once, and moving on, Ollama’s UX is genuinely better. The model library with single-command download, the clean CLI, and the automatic template handling for common models remove real friction for casual use.

If you are building something that will run in production, integrating local inference into a bot or application, using models that require precise prompt formatting, or care about parameters like grammar-constrained generation and prefix caching, Ollama is the wrong layer to sit behind. You are paying the abstraction tax without getting a corresponding benefit.

llama-cpp-python is worth mentioning for Python-heavy workflows. It provides both a library interface and an OpenAI-compatible server, exposes the same parameter surface as llama-server, and integrates naturally with Python tooling. Install with CUDA support via CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python and you have everything Ollama provides minus the model registry, plus everything llama.cpp exposes that Ollama does not.

The Actual Decision

Ollama made a specific moment in local LLM adoption easier to get through. That moment has passed. llama.cpp ships a stable, well-documented HTTP server with a web UI, full OpenAI compatibility, native chat template handling, and a parameter surface that covers production requirements. The wrapper is no longer pulling its weight for anyone who has moved past “I want to try a model locally.”

The local LLM ecosystem has consolidated around llama.cpp as the inference engine and the OpenAI API spec as the wire format. Anything that sits between your code and those two things needs to be earning its place with concrete benefits. For most workflows past the initial experimentation phase, Ollama is not.