There is a post making the rounds that argues the local LLM ecosystem has outgrown Ollama. The argument is blunt: Ollama wraps llama.cpp, and llama.cpp already ships a capable HTTP server. The wrapper is no longer earning its keep.
That framing is worth unpacking carefully, because it is almost right but not quite. Ollama solved a real problem when it launched. The question worth examining now is whether that problem still exists, and whether the cost of the abstraction has grown past its benefit.
What Ollama Actually Provided
When Ollama arrived in 2023, running a local LLM required manual steps that most people were not prepared to handle. You needed to compile llama.cpp from source, understand what -ngl meant and how many GPU layers your card could accommodate, find a quantized model on HuggingFace, figure out the chat template for whatever model you picked, and wire all of that into a server you could actually talk to. The barrier was not conceptually hard, but it was long.
Ollama collapsed that into ollama pull llama3.2 && ollama run llama3.2. That is a genuine contribution. It handles GPU layer detection automatically, wraps model management into a familiar pull/push model, and surfaces a REST API on port 11434 that downstream tools can point at.
The REST API was the key part. The ecosystem around local LLMs was converging on the OpenAI wire protocol as a de facto standard, and Ollama eventually adopted it via /v1/chat/completions and related endpoints. Tools like Open WebUI, Continue.dev, and various IDE plugins started targeting the OpenAI-compatible surface, and Ollama’s support meant any of those tools could use local models with minimal configuration.
The Problem Is the Pinned llama.cpp
Here is where the abstraction starts showing its cost: Ollama vendors its own copy of llama.cpp at a specific commit. When upstream llama.cpp ships improvements, whether that is a new quantization format, a performance fix, or architecture support for a new model family, Ollama users wait for Ollama’s maintainers to update the vendor tree and cut a release.
llama.cpp moves fast. New quantization schemes like the IQ-series quants, Flash Attention support, and new context extension techniques show up regularly. If you are using llama-server directly, you build from HEAD or from the latest tagged release and get those improvements immediately. If you are behind Ollama, you get them weeks or months later, sometimes after the conversation has already moved on.
This is not a hypothetical concern. The community routinely documents cases where Ollama’s bundled llama.cpp lacks support for models that llama-server handles fine, because the model requires features added after Ollama’s vendor commit. The lag is structural, not incidental.
What llama-server Provides Today
The llama-server binary, which ships as part of the llama.cpp repository, now provides essentially the same API surface as Ollama, without the wrapper. The endpoint list covers the full OpenAI-compatible surface: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models. Beyond that, it adds things Ollama does not expose cleanly: a Prometheus metrics endpoint at /metrics, a /slots endpoint for inspecting concurrent inference state, grammar-constrained generation via BNF grammars and JSON schema, speculative decoding with a draft model, and LoRA adapter loading at runtime.
Starting it looks like this:
./llama-server \
-m /path/to/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-c 8192 \
-ngl 35 \
--flash-attn \
-np 4 \
--host 0.0.0.0 \
--port 8080
Once it is running, any OpenAI SDK client works against it unchanged:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Hello"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")
The model name is ignored by llama-server; it serves whatever GGUF you pointed it at. That is worth understanding: the OpenAI-compatible API across all local LLM servers treats the model field as mostly decorative, since the model is determined by server configuration rather than request-time selection.
The -ngl parameter is the one thing you need to set manually, where Ollama would auto-detect it. That is a real usability gap. For most setups, setting it to the total number of transformer layers in the model (35 for most 7B/8B models, 80 for 70B-class models) and letting llama.cpp handle the rest is straightforward enough, but it is not zero friction.
The Model Format Situation
Ollama stores models in a content-addressed blob store under ~/.ollama/models/blobs/. A fresh GGUF from HuggingFace is not directly usable; you either pull from the Ollama registry or go through a Modelfile workflow:
FROM /path/to/model.gguf
PARAMETER temperature 0.8
SYSTEM "You are a helpful assistant."
Then ollama create mymodel -f Modelfile. This creates a new entry in the blob store, sometimes a hard link but sometimes an actual copy depending on the filesystem. The friction is not severe, but it is present, and it is entirely absent when you use llama-server directly against a GGUF file.
The GGUF format itself is well-specified and self-describing. It contains the model’s chat template, recommended context length, tokenizer, and quantization information in its header. llama-server reads all of that natively. Ollama’s Modelfile system partly duplicates metadata that already exists in the GGUF, introducing a second source of truth that occasionally gets out of sync. When a model’s built-in chat template and the Ollama registry’s template disagree, you get subtly wrong behavior that can be difficult to diagnose.
For pulling models directly from HuggingFace without Ollama in the middle, the huggingface-cli tool handles it cleanly:
pip install huggingface_hub
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
--include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
--local-dir ./models
The bartowski repository on HuggingFace has become the de facto source for community GGUF quantizations of major models, and download speeds from HuggingFace are generally comparable to Ollama’s registry.
The Alternatives Worth Knowing
llama-server is not the only answer for people who want to step outside Ollama.
llamafile, Mozilla’s project using Justine Tunney’s Cosmopolitan Libc, packages a llama.cpp build and optionally model weights into a single portable executable. One binary, no installation, runs on Linux, macOS, Windows, and several BSDs from the same file. The embedded HTTP server is OpenAI-compatible. For distribution or air-gapped deployment, nothing else comes close to this level of portability.
For Apple Silicon users, mlx_lm uses Apple’s MLX framework rather than llama.cpp’s Metal backend, and benchmarks frequently show 10-30% better throughput on M-series chips as a result. It exposes an OpenAI-compatible server and downloads models from HuggingFace directly:
pip install mlx-lm
python -m mlx_lm.server \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--port 8080
For high-concurrency production serving, neither Ollama nor llama-server is the right tool. vLLM’s PagedAttention gives dramatically better throughput under concurrent load, at the cost of requiring a full NVIDIA (or ROCm) GPU stack and not supporting CPU offloading. At ten simultaneous users, the throughput difference between vLLM and a llama.cpp-based server becomes significant. For a personal assistant or a small team tool, llama-server is adequate; for anything resembling production traffic, vLLM is the correct starting point.
If you want to preserve the model management convenience without Ollama’s llama.cpp lock-in, LM Studio’s local server mode exposes an OpenAI-compatible API on port 1234, updates its llama.cpp backend independently, and pulls directly from HuggingFace with a searchable UI. It is not a server you would run headlessly in CI, but for a developer machine it covers the gap.
Where Ollama Still Makes Sense
The “stop using Ollama” framing slightly overreaches. For someone getting started with local inference who wants to try a few models without reading documentation, Ollama is still the shortest path. The automatic GPU detection, the pull-by-name model management, and the stable CLI surface have real value for that audience.
The problem is when Ollama becomes the foundation you build on. If you are writing an application that talks to a local LLM, you care about accessing the latest llama.cpp capabilities, or you need observability into what the inference server is doing, you are fighting the abstraction rather than benefiting from it. The Prometheus metrics endpoint that llama-server exposes natively is, by itself, a meaningful reason to prefer it for any deployment you intend to monitor.
The deeper issue is that the ecosystem has matured in exactly the areas where Ollama filled gaps. The OpenAI-compatible wire protocol is now a stable convention every serious local inference tool implements. HuggingFace’s model hub is a more complete registry than Ollama’s, with direct GGUF downloads and community quantizations for models Ollama often does not carry. The GPU configuration question, while still requiring a number to be set manually in llama-server, is well-documented and consistent across hardware.
Abstractions have a lifecycle. They exist because the underlying layer is not ready to use directly. When the underlying layer catches up, the abstraction either evolves to provide something the layer cannot, or it starts costing more than it provides. Ollama is at that inflection point, and the right response is not categorical rejection, but clarity about what you are giving up. For a developer building anything beyond a quick experiment, the overhead, the version lag, and the format friction now exceed the setup savings. llama-server, llamafile, and mlx_lm have all crossed the threshold where they are genuinely usable without a wrapper, and that is the point the sleepingrobots post is making.